I have been meaning to write about this for such a long time. Ever since the lockdown happened, I have been wanting to take a look at a particular biomedical dataset that looks extremely interesting to me: the OpenTrials dataset. If you are not familiar with this yet, this is what they say:
OpenTrials is a collaboration between Open Knowledge International and Dr Ben Goldacre from the University of Oxford DataLab. It aims to locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally.
It's a super interesting initiative, and it really flows from the idea that in much of the very intensive, expensive biomedical research, we should be looking at how to better use and re-use the knowledge that we are building up. Kind of like what people in the CovidGraph.org initiative, het.io (remember the interview I did with Daniel - so great!) and others are doing.
Downloading and restoring the dataset
It's a bit hidden, but you can actually download a (slightly older, but still) dataset of the OpenTrials dataset from their website. The dataset is actually a Postgres dump file: I got the latest one from http://datastore.opentrials.net/public/opentrials-api-2018-04-01.dump.
I then installed Postgres on my laptop, as well as the super easy to use PgAdmin administration tooling.
I was then able to import the .dump file really easily, by running a few simple commands. I needed to take a bit of care with the security settings (you need sufficient privileges to do this kind of thing on a real server, naturally), but that was all there was to it.
After these simple steps I had the OpenTrials database running, but then I wanted to get it imported into Neo4j. How to do that?
Importing the OpenTrials Postgres database into Neo4j
- inspecting the schema of a relational database system (Postgres is supported, but many others are too!)
- mapping that relational schema onto a "default" graph data model, which you can tweak to your hearts desire
- importing the data from the relational database into a running Neo4j instance - in a variety of different ways (online/offline/...).
This is already super interesting - and in many cases you will want to tweak this and not go for the default suggestion here. Relational database modelling and graph database modelling tend to be quite a bit different, as the parameters of the modelling deployment are quite different as well. So you will want to tweak this. In the case of OpenTrials, I had to mainly play around with the datatype suggestions a bit: not all suggested datatypes were appropriate or correct in this case, so I basically ended up making all data elements of all labels into String data types. Clearly not ideal, but a good starting point that would not get into the way of anything.
Next up: pushing the import button:
I found that the entire process went extremely smoothly: after about 15-20mins everything was imported into my little laptop database, and ready to be explored.
Quick exploration of OpenTrials in Neo4j
And we can start looking at some stats with regards to the number of Trials that have been run for specific conditions.
Getting a bit more personal: looking into Glanzmann's disease
There's a ton of other queries that you could run here. I have experimented a bit more with other diseases and illnesses, and also tried to take a look at the geographical perspective included in the dataset. I have put all of these queries together in a gist over here, for you to take a look at. Let me know if you find out some other interesting stuff?
Hope this was another interesting read for you - I definitely enjoyed writing this up!
All the best
Rik
No comments:
Post a Comment