The Medical Devices dataset as a graph
This turned out to be surprisingly easy. Just download the Zip file from the ICIJ website: https://medicaldevices.icij.org/download/icij-imddb-2018-11-25.zip, unzip this, and then we get 3 comma-separated-values files:- one for the Devices that are being reported on
- one for the Events that are being reported (whenever something happens to a device (eg. a recall) then that is logged and reported)
- one for the Manufacturers of the medical devices.
In order to import it I had to take a look at the structure of the data and figure out a reasonable model. Here's how I did that. I first grabbed the headers of the three files, and put them into a google sheet. This allowed me to quickly get a first look at the structure:
- the Events referred to Devices by their IDs
- the Devices referred to their Manufacturers by their IDs
So let's do that.
First we import the 3 csv files:
Then we connect the events to the devices using their ID properties:
And we also connect the devices to their manufacturers using the ID properties:
That's already good, and happens pretty quickly. But as we have seen so often before, the graph model usually allows for some easy and quick additional normalisation, by splitting off some interesting properties into their own nodes, and connecting these to the original entity. We have a couple of interesting properties like that in this dataset, and in the spreadsheet above I have marked these in bright colours. The properties to be normalised are:
and then there was something similar with the Device Risk Class.
Import the Medical Devices dataset into our model
So here's a simple script to import the three files into the simplistic model. All of the details are also on github.First we import the 3 csv files:
USING PERIODIC COMMIT
LOAD CSV with headers from "file:///devices.csv" as line CREATE (d:Device) set d = line; USING PERIODIC COMMIT LOAD CSV with headers from "file:///events.csv" as line CREATE (e:Event) set e = line; USING PERIODIC COMMIT LOAD CSV with headers from "file:///manufacturers.csv" as line CREATE (m:Manufacturer) set m = line; CREATE index on :Event(id); CREATE index on :Device(id); CREATE index on :Manufacturer(id);
CREATE index on :Manufacturer(name);
|
Then we connect the events to the devices using their ID properties:
MATCH (e:Event), (d:Device)
WHERE d.id = e.device_id CREATE (d)-[:RELATED_TO]->(e); |
MATCH (d:Device), (m:Manufacturer)
WHERE d.manufacturer_id = m.id CREATE (m)-[:MANUFACTURES]->(d); |
That's already good, and happens pretty quickly. But as we have seen so often before, the graph model usually allows for some easy and quick additional normalisation, by splitting off some interesting properties into their own nodes, and connecting these to the original entity. We have a couple of interesting properties like that in this dataset, and in the spreadsheet above I have marked these in bright colours. The properties to be normalised are:
- Countries of the Events
- EventTypes of the Events
- Classification of the Devices
- Source of the Manufacturer information
- Source of the Event information
- the Parent Companies of the Manufacturers
Let's do that.
Normalise the Medical Devices dataset some more
The way to do this is pretty much the same for every property that we want to normalise:
- we create an index for the new Node and Property. This is necessary so that we can do efficient MERGE operations in the next step,
- We read the properties from the original event, and spin them out into new Nodes+Properties using a MERGE operation - which assures us that the relevant node will only get created once,
- and then finally we connect the original node to the new node with a new relationship. To do this efficiently and make sure that we batch the transaction, we do this using a fantastic apoc procedure called apoc.periodic.iterate. This makes it MUCH faster,
So heree are the queries
Normalise the Countries of EventsCREATE index on :Country(name);MATCH (e:Event) MERGE (c:Country {name: e.country}); CALL apoc.periodic.iterate(" MATCH (e:Event), (c:Country) WHERE e.country = c.name RETURN e,c", " MERGE (e)-[:LOCATED_IN]->(c)", {batchSize:5000, parallel:false}); |
Normalise the EventTypes of Events
CREATE index on :EventType(name);
MATCH (e:Event) MERGE (et:EventType {name: e.type}); CALL apoc.periodic.iterate(" MATCH (e:Event), (et:EventType) WHERE e.type = et.name RETURN e, et", " MERGE (e)-[:OF_TYPE]->(et)", {batchSize:5000, parallel:false}); |
Normalise the DeviceClassification of Devices
MATCH (d:Device)
MERGE (dc:DeviceClassification {name: d.classification}); CREATE index on :DeviceClassification(name); CALL apoc.periodic.iterate(" MATCH (d:Device), (dc:DeviceClassification) WHERE d.classification = dc.name RETURN d, dc", " MERGE (d)-[:HAS_CLASSIFICATION]->(dc)", {batchSize:5000, parallel:false}); |
Normalise the Source or Manufacturer information
MATCH (m:Manufacturer)
MERGE (s:Source {name: m.source}); CREATE index on :Source(name); CALL apoc.periodic.iterate(" MATCH (m:Manufacturer), (s:Source) WHERE m.source = s.name RETURN m, s", " MERGE (s)-[:IS_SOURCE_FOR]->(m)", {batchSize:5000, parallel:false}); |
Normalise the Source or Event information
MATCH (e:Event)
MERGE (s:Source {name: e.source}); CALL apoc.periodic.iterate(" MATCH (e:Event), (s:Source) WHERE e.source = s.name RETURN e, s", " MERGE (s)-[:IS_SOURCE_FOR]->(e)", {batchSize:5000, parallel:false}); |
Normalise the Parent Companies of Manufacturers
Create index on :ParentCompany(name);
MATCH (m:Manufacturer) WHERE m.parent_company is not null AND m.parent_company <> "" with m MERGE (pc:ParentCompany {name: m.parent_company}); CALL apoc.periodic.iterate(" MATCH (pc:ParentCompany), (m:Manufacturer) WHERE pc.name = m.parent_company RETURN pc, m", " MERGE (pc)-[:PARENT_OF]->(m)", {batchSize:5000, parallel:false}); |
Once we have done this, we can look at the schema in the database, and find that it is indeed getting quite interesting and graphy at this point.
Some data cleanup
When I then started looking into our model in the database, I found that there were couple of data quality and inconsistency issues that I should really clean up quite easily. So I did.Cleanup Event action classification
I found that events have actions associated with them, and that this action gets classified into categories. However, sometimes this category used "Class 1/2/3" indications, and sometimes it was "Class I/II/III". So let's just get that sorted with a simple query:
CREATE index on :Event(action_classification);
MATCH (e:Event) WHERE e.action_classification = "I" OR e.action_classification = "Class I" SET e.action_classification = "Class 1"; MATCH (e:Event) WHERE e.action_classification = "II" OR e.action_classification = "Class II" SET e.action_classification = "Class 2"; MATCH (e:Event) WHERE e.action_classification = "III" OR e.action_classification = "Class III" SET e.action_classification = "Class 3"; |
and then there was something similar with the Device Risk Class.
Cleanup Device Risk Class
Same thing: just making all the class numbers the same.
MATCH (d:Device)
WHERE d.risk_class = "II" SET d.risk_class = "2"; MATCH (d:Device) WHERE d.risk_class = "III" SET d.risk_class = 3; MATCH (d:Device) WHERE d.risk_class = "Unclassified" SET d.risk_class = "Not Classified"; |
Querying the ICIJ Medical Devices dataset in Neo4j
As I indicated earlier, I did not have a lot of time to experiment, yet, but I did start to do so - and here are some queries that I ran.
Parent companies with most events by class
Parent Companies are of course interesting entities. Let's see if we can group these by frequecy of the events occurring:
MATCH (pc:ParentCompany)-->(m:Manufacturer)-->(d:Device)-->(e:Event)
RETURN pc.name, e.action_classification, count(e) ORDER BY count(e) desc LIMIT 10; |
And of course makes me want to zoom into "Zimmer Biomet Holdings" - and what better way to do that then by using Bloom. Very quickly I get:
Devices with lots of events
I then started to look into some if the specific devices that had lots of events associated with them, and that had some importance in society - if they had been sold a lot. Unfortunately, I found that the "quantity in commerce" property in the Device nodes is not very reliable or workable:
gave me this:
So I think we would need to do some further data cleansing on that property in order to do any meaningful research.
MATCH (d:Device)--(e:Event)
WHERE d.quantity_in_commerce is not null
RETURN d.name, d.quantity_in_commerce, e.action_classification, count(e)
ORDER BY count(e) desc
LIMIT 10
|
The graph of breast implants
Next I started to look at some of the case studies of the ICIJ. For example: read the story of the breast implants, and let's look at the main companies in there: Allergan and Mentor. A simple query helps us out:
MATCH path = (m:Manufacturer)--(d:Device)--(e:Event)--(conn)
WHERE m.name contains "Allergan" or m.name contains "Mentor" RETURN path |
Then we can explore the graph in the Neo4j Browser quite easily:
Let's do one more story.
The graph of heart valves
The final graph that I wanted to show is the one described in this crazy story about the experimentation going on with heart valves: read it over here. Two companies (Medtronic and Edwards) seem to be the focus of the article. So let's explore. I started with Medtronic
MATCH path = (m:Manufacturer)--(d:Device)--(e:Event)--(conn)
WHERE m.name contains "Medtronic"
RETURN path
and very quickly found that the graph was absolutely crazy huge. No way to run a meaningful query in the Neo4j browser - but there are definitely meaningful ways to do it in Neo4j Bloom: just type the search phraseWHERE m.name contains "Medtronic"
RETURN path
manufacturer with name Medtronic with device with event
and you immediately see some interesting clusters appear.
Even more interesting when we add the Parent Company of Medtronic to the mix:
parentcompany with manufacturer with name Medtronic with device with event with country
in the search phrase box and we get this:
I can't help but feel that Medtronic is definitely worth some additional research. That graph is way too large to be good.
Similar queries for Edwards were a lot less heavy:
MATCH path = (m:Manufacturer)--(d:Device)--(e:Event)--(conn)
WHERE m.name contains "Edwards"
RETURN path
Gave me:WHERE m.name contains "Edwards"
RETURN path
That's all I have at the moment. I hope you found this interesting, and that you can play around with it as well a little more. There's definitely a wealth of interesting information in here.
As always, feedback welcome!
Cheers
Rik
No comments:
Post a Comment