Wednesday 28 November 2018

Working with the ICIJ Medical Devices dataset in Neo4j

Just last weekend our friends at the ICIJ published another really interesting case of investigative journalism - tracking down and publishing the quite absurd and disturbing practices of the medical devices industry. The entire case with all of the developing stories can be found at https://medicaldevices.icij.org/ - take a look as it really is quite fascinating. Of course that meant that I wanted to see what that data looked like in Neo4j, and if I could have a play. I didn't have time for a full detailed exploration yet - but hopefully this will also give others the opportunity to chime in. So let's see.

The Medical Devices dataset as a graph

This turned out to be surprisingly easy. Just download the Zip file from the ICIJ website: https://medicaldevices.icij.org/download/icij-imddb-2018-11-25.zip, unzip this, and then we get 3 comma-separated-values files:
  • one for the Devices that are being reported on
  • one for the Events that are being reported (whenever something happens to a device (eg. a recall) then that is logged and reported)
  • one for the Manufacturers of the medical devices.
That's easy enough.
In order to import it I had to take a look at the structure of the data and figure out a reasonable model. Here's how I did that. I first grabbed the headers of the three files, and put them into a google sheet. This allowed me to quickly get a first look at the structure:

I immediately noticed that 
  • the Events referred to Devices by their IDs
  • the Devices referred to their Manufacturers by their IDs
So I can start with importing and connecting these first into a very simple model:
So let's do that.

Import the Medical Devices dataset into our model

So here's a simple script to import the three files into the simplistic model. All of the details are also on github.

First we import the 3 csv files:
USING PERIODIC COMMIT
LOAD CSV with headers from "file:///devices.csv" as line
CREATE (d:Device)
set d = line;

USING PERIODIC COMMIT
LOAD CSV with headers from "file:///events.csv" as line
CREATE (e:Event)
set e = line;

USING PERIODIC COMMIT
LOAD CSV with headers from "file:///manufacturers.csv" as line
CREATE (m:Manufacturer)
set m = line;

CREATE index on :Event(id);
CREATE index on :Device(id);
CREATE index on :Manufacturer(id);
CREATE index on :Manufacturer(name);

Then we connect the events to the devices using their ID properties:
MATCH (e:Event), (d:Device)
WHERE d.id = e.device_id
CREATE (d)-[:RELATED_TO]->(e);
And we also connect the devices to their manufacturers using the ID properties:
MATCH (d:Device), (m:Manufacturer)
WHERE d.manufacturer_id = m.id
CREATE (m)-[:MANUFACTURES]->(d);

That's already good, and happens pretty quickly. But as we have seen so often before, the graph model usually allows for some easy and quick additional normalisation, by splitting off some interesting properties into their own nodes, and connecting these to the original entity. We have a couple of interesting properties like that in this dataset, and in the spreadsheet above I have marked these in bright colours. The properties to be normalised are:
  • Countries of the Events
  • EventTypes of the Events
  • Classification of the Devices
  • Source of the Manufacturer information
  • Source of the Event information
  • the Parent Companies of the Manufacturers
Let's do that. 

Normalise the Medical Devices dataset some more

The way to do this is pretty much the same for every property that we want to normalise: 
  1. we create an index for the new Node and Property. This is necessary so that we can do efficient MERGE operations in the next step,
  2. We read the properties from the original event, and spin them out into new Nodes+Properties using a MERGE operation - which assures us that the relevant node will only get created once,
  3. and then finally we connect the original node to the new node with a new relationship. To do this efficiently and make sure that we batch the transaction, we do this using a fantastic apoc procedure called apoc.periodic.iterate. This makes it MUCH faster,
So heree are the queries

Normalise the Countries of Events

CREATE index on :Country(name);

MATCH (e:Event)
MERGE (c:Country {name: e.country});

CALL apoc.periodic.iterate("
MATCH (e:Event), (c:Country) WHERE e.country = c.name RETURN e,c", "
MERGE (e)-[:LOCATED_IN]->(c)",
{batchSize:5000, parallel:false});

Normalise the EventTypes of Events

CREATE index on :EventType(name);

MATCH (e:Event)
MERGE (et:EventType {name: e.type});

CALL apoc.periodic.iterate("
MATCH (e:Event), (et:EventType) WHERE e.type = et.name RETURN e, et", "
MERGE (e)-[:OF_TYPE]->(et)",
{batchSize:5000, parallel:false});


Normalise the DeviceClassification of Devices

MATCH (d:Device)
MERGE (dc:DeviceClassification {name: d.classification});

CREATE index on :DeviceClassification(name);

CALL apoc.periodic.iterate("
MATCH (d:Device), (dc:DeviceClassification) WHERE d.classification = dc.name RETURN d, dc", "
MERGE (d)-[:HAS_CLASSIFICATION]->(dc)",
{batchSize:5000, parallel:false});


Normalise the Source or Manufacturer information

MATCH (m:Manufacturer)
MERGE (s:Source {name: m.source});

CREATE index on :Source(name);

CALL apoc.periodic.iterate("
MATCH (m:Manufacturer), (s:Source) WHERE m.source = s.name RETURN m, s", "
MERGE (s)-[:IS_SOURCE_FOR]->(m)",
{batchSize:5000, parallel:false});

Normalise the Source or Event information

MATCH (e:Event)
MERGE (s:Source {name: e.source});

CALL apoc.periodic.iterate("
MATCH (e:Event), (s:Source) WHERE e.source = s.name RETURN e, s", "
MERGE (s)-[:IS_SOURCE_FOR]->(e)",
{batchSize:5000, parallel:false});


Normalise the Parent Companies of Manufacturers

Create index on :ParentCompany(name);

MATCH (m:Manufacturer)
WHERE m.parent_company is not null AND m.parent_company <> ""
with m
MERGE (pc:ParentCompany {name: m.parent_company});

CALL apoc.periodic.iterate("
MATCH (pc:ParentCompany), (m:Manufacturer) WHERE pc.name = m.parent_company RETURN pc, m", "
MERGE (pc)-[:PARENT_OF]->(m)",
{batchSize:5000, parallel:false});

Once we have done this, we can look at the schema in the database, and find that it is indeed getting quite interesting and graphy at this point.

Some data cleanup

When I then started looking into our model in the database, I found that there were couple of data quality and inconsistency issues that I should really clean up quite easily. So I did.

Cleanup Event action classification

I found that events have actions associated with them, and that this action gets classified into categories. However, sometimes this category used "Class 1/2/3" indications, and sometimes it was "Class I/II/III". So let's just get that sorted with a simple query:

CREATE index on :Event(action_classification);

MATCH (e:Event)
WHERE e.action_classification = "I" OR e.action_classification = "Class I"
SET e.action_classification = "Class 1";

MATCH (e:Event)
WHERE e.action_classification = "II" OR e.action_classification = "Class II"
SET e.action_classification = "Class 2";

MATCH (e:Event)
WHERE e.action_classification = "III" OR e.action_classification = "Class III"
SET e.action_classification = "Class 3";

and then there was something similar with the Device Risk Class.

Cleanup Device Risk Class

Same thing: just making all the class numbers the same.
MATCH (d:Device)
WHERE d.risk_class = "II"
SET d.risk_class = "2";

MATCH (d:Device)
WHERE d.risk_class = "III"
SET d.risk_class = 3;

MATCH (d:Device)
WHERE d.risk_class = "Unclassified"
SET d.risk_class = "Not Classified";

Querying the ICIJ Medical Devices dataset in Neo4j

As I indicated earlier, I did not have a lot of time to experiment, yet, but I did start to do so - and here are some queries that I ran.

Parent companies with most events by class

Parent Companies are of course interesting entities. Let's see if we can group these by frequecy of the events occurring:
MATCH (pc:ParentCompany)-->(m:Manufacturer)-->(d:Device)-->(e:Event)
RETURN pc.name, e.action_classification, count(e)
ORDER BY count(e) desc
LIMIT 10;
That gives us:


And of course makes me want to zoom into "Zimmer Biomet Holdings" - and what better way to do that then by using Bloom. Very quickly I get:
which I can then explore more.

Devices with lots of events

I then started to look into some if the specific devices that had lots of events associated with them, and that had some importance in society - if they had been sold a lot. Unfortunately, I found that the "quantity in commerce" property in the Device nodes is not very reliable or workable:
MATCH (d:Device)--(e:Event)
WHERE d.quantity_in_commerce is not null
RETURN d.name, d.quantity_in_commerce, e.action_classification, count(e)
ORDER BY count(e) desc
LIMIT 10
gave me this:
So I think we would need to do some further data cleansing on that property in order to do any meaningful research.

The graph of breast implants

Next I started to look at some of the case studies of the ICIJ. For example: read  the story of the breast implants, and let's look at the main companies in there: Allergan and Mentor. A simple query helps us out:

MATCH path = (m:Manufacturer)--(d:Device)--(e:Event)--(conn)
WHERE m.name contains "Allergan" or m.name contains "Mentor"
RETURN path

Then we can explore the graph in the Neo4j Browser quite easily:
Let's do one more story.

The graph of heart valves

The final graph that I wanted to show is the one described in this crazy story about the experimentation going on with heart valves: read it over here. Two companies (Medtronic and Edwards) seem to be the focus of the article. So let's explore. I started with Medtronic

MATCH path = (m:Manufacturer)--(d:Device)--(e:Event)--(conn)
WHERE m.name contains "Medtronic"
RETURN path
and very quickly found that the graph was absolutely crazy huge. No way to run a meaningful query in the Neo4j browser - but there are definitely meaningful ways to do it in Neo4j Bloom: just type the search phrase

manufacturer with name Medtronic with device with event

and you immediately see some interesting clusters appear.


Even more interesting when we add the Parent Company of Medtronic to the mix:

parentcompany with manufacturer with name Medtronic with device with event with country

in the search phrase box and we get this:

I can't help but feel that Medtronic is definitely worth some additional research. That graph is way too large to be good.

Similar queries for Edwards were a lot less heavy:
MATCH path = (m:Manufacturer)--(d:Device)--(e:Event)--(conn)
WHERE m.name contains "Edwards"
RETURN path
Gave me:



That's all I have at the moment. I hope you found this interesting, and that you can play around with it as well a little more. There's definitely a wealth of interesting information in here.

As always, feedback welcome!

Cheers

Rik

No comments:

Post a Comment