Friday, 18 September 2020

OpenTrials in Neo4j - with a simple ETL job

I have been meaning to write about this for such a long time. Ever since the lockdown happened, I have been wanting to take a look at a particular biomedical dataset that looks extremely interesting to me: the OpenTrials dataset. If you are not familiar with this yet, this is what they say:

OpenTrials is a collaboration between Open Knowledge International and Dr Ben Goldacre from the University of Oxford DataLab. It aims to locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally. 

It's a super interesting initiative, and it really flows from the idea that in much of the very intensive, expensive biomedical research, we should be looking at how to better use and re-use the knowledge that we are building up. Kind of like what people in the CovidGraph.org initiative, het.io (remember the interview I did with Daniel - so great!) and others are doing. 


Downloading and restoring the dataset

It's a bit hidden, but you can actually download a (slightly older, but still) dataset of the OpenTrials dataset from their website. The dataset is actually a Postgres dump file: I got the latest one from http://datastore.opentrials.net/public/opentrials-api-2018-04-01.dump

I then installed Postgres on my laptop, as well as the super easy to use PgAdmin administration tooling.

I was then able to import the .dump file really easily, by running a few simple commands. I needed to take a bit of care with the security settings (you need sufficient privileges to do this kind of thing on a real server, naturally), but that was all there was to it.



After these simple steps I had the OpenTrials database running, but then I wanted to get it imported into Neo4j. How to do that?

Importing the OpenTrials Postgres database into Neo4j

I know that there are a million ways that I could have done this, but it turns out that this has become SOOOOOO easy these days. All I needed to do was to install a Neo4j GraphApp for this: https://install.graphapp.io/ to the rescue. I grabbed the Neo4j ETL tool (the first app on the page) and installed it into my Neo4j Desktop

Neo4j ETL tool is a super cool piece of kit. If you read through the documentation, you will find that it has a really cool way of 
  • inspecting the schema of a relational database system (Postgres is supported, but many others are too!)
  • mapping that relational schema onto a "default" graph data model, which you can tweak to your hearts desire
  • importing the data from the relational database into a running Neo4j instance - in a variety of different ways (online/offline/...).
It really is quite neat. So let's take you through that exercise for the OpenTrials project. 

After installing the GraphApp, I can grab a running Neo4j database, and run the Neo4j ETL Tool against that by clicking the dropdown:
Then, I would need to load the JDBC connection to the running OpenTrials Postgres relational database, and the running Neo4j database:

After doing that, you will find that the ETL tool actually comes up with a suggested graph data model:

This is already super interesting - and in many cases you will want to tweak this and not go for the default suggestion here. Relational database modelling and graph database modelling tend to be quite a bit different, as the parameters of the modelling deployment are quite different as well. So you will want to tweak this. In the case of OpenTrials, I had to mainly play around with the datatype suggestions a bit: not all suggested datatypes were appropriate or correct in this case, so I basically ended up making all data elements of all labels into String data types. Clearly not ideal, but a good starting point that would not get into the way of anything.

Next up: pushing the import button:

I found that the entire process went extremely smoothly: after about 15-20mins everything was imported into my little laptop database, and ready to be explored.

Quick exploration of OpenTrials in Neo4j

I only did a very first little exploration of the dataset in Neo4j at this point, but already I thought it was quite interesting and easy to work with. The model - as imported from the ETL tool - now looks like this in Neo4j:
The stats all seem to match up with what we had in Postgres:


And we can start looking at some stats with regards to the number of Trials that have been run for specific conditions.
And by looking at this in some detail, we very quickly found out that many Trials seem to be related to a lot of Conditions - which is kind of interesting in and of itself. I am not knowledgeable about this domain at all, but I thought it was interesting - there's probably a very good explanation for this.

Getting a bit more personal: looking into Glanzmann's disease

So the data is all there to play with. There's lot's of place where you could take this, but I decided to take it sonewhere personal. I have been blessed with (touching wood!) a good health so far in my life, and so has most of my family been. There is/was one noteable exception: one of my cousin's sons has a very rare blood disease called Glanzmann Thrombasthenia. From the outside, this disease is a little bit like Haemophilia, where the symptoms include a very significant difficulty to make blood clots - resulting in very dangerous excessive bleeding if not treated. My family member has received very special and good care of the past few decades, and thankfully has been able to survive - but it's a very, very serious condition for which there are no cures and very few medications. So I wanted to see if, in the OpenTrials dataset, there was any research related to this available. Worth a shot, right?

I ran a very simple query:

match (c:Condition)--(conn) where toLower(c.name) contains "glanzmann" return c,conn;

and got this result:
Let's expand this a bit more:

//finding clincaltrials.gov
match (c:Condition)--(t:Trial)--(conn) 
where toLower(c.name) contains "glanzmann" 
return c,t, conn

And then we get:
In both of these results, a very interesting name immediately pops up: NovoSeven. This is a very, VERY expensive piece of medication that has saved my family member's life multiple times. It's the only thing that works. There's more to be explored there for sure - but I will keep it at this for now.

There's a ton of other queries that you could run here. I have experimented a bit more with other diseases and illnesses, and also tried to take a look at the geographical perspective included in the dataset. I have put all of these queries together in a gist over here, for you to take a look at. Let me know if you find out some other interesting stuff?


Hope this was another interesting read for you - I definitely enjoyed writing this up!


All the best


Rik



No comments:

Post a comment