Tuesday 21 April 2020

(Covid-19) Contact Tracing Blogpost - part 1/4

Part 1/4: creating and importing a synthetic contact tracing graph

As we are living in these very interesting times, and many countries are still going through a massive operation to slow down the devastating effects of the SARS-CoV-2 virus and its CoViD-19 effects, there is of course also a lot of discussion already going on what we will do after the initial surge of the virus has passed, and when the various countries and regions will start opening up their economies.

A tactic many countries seem to be taking is the implementation of some kind of Contact Tracing. Using the technology on our phones and our pervasive internet connectivity, we could imagine a way to implement "distancing" and isolation of people that are either already victim of, or vulnerable to, CoViD-19. This seems like a logical, and useful tactic, that could help us to open up our economies for business, while still maintaining the basic attitude of wanting to "flatten the curve". Of course there are still many, many issues with this approach, not in the least with regards to patient privacy and political freedoms, but it seems like an interesting track to explore, at least. Many government organisations have therefore started to explore this, and are working with some of the industry giants like Google and Apple to make this a reality.

This evolution started a whole range of discussions inside Neo4j, especially with regards to the usefulness of a graph database to make sense of some of these contact traceability databases. I remember reading Christakis and Fowler's Connected book, and understanding that virus outbreaks are one of those cases where our direct contacts don't necessarily matter - or at least not matter alone. Indirect contacts, between our friends' friends' friends, can be just as important. So lots of interesting, graph-oriented questions arise: How could we maximise the effect of our distancing measures, and of any contact tracing applications that we put in place? How could we use the excellent and predictive power of the graph to find out which of a person's connections could be most risky? How can we use graph analytics to better understand the structural power and weakness of our social networks? And many more.

So, being locked down myself (although Belgium clearly has a much software stance than for example France or Italy), I thought I would spend some time exploring this. That's what this blogpost series is going to be about - so let's get right to it.


Creating the synthetic dataset

Of course, today, we don't have a contact tracing dataset available to play with. So I would have to create one myself. That was the first assignment, which I joyfully embraced. As usual, I turned to my beloved Google Sheets to do that, and have created this spreadsheet to help us along. It's a very simple dataset, containing a handful of entities, and it's available for you to play with over here. But let me take you through this.

Person worksheet

The first sheet you will find in the workbook is the Person worksheet. I have put the following data in this sheet:

  • PersonId: just an id for a Person
  • PersonName: from a random name generator that I found online.
  • Healthstatus: randomly chosen between "Sick" and "Healthy"
  • ConfirmedTime: randomly chosen in the last 30 days. This field is meant to refer to the date/time that a person got tested, and got their health status (sick or healthy) confirmed. Note that this uses a datetime data type, and can be used for lots of interesting queries afterwards.


As you will see later on, we will be using the CSV export of this sheet in our import steps. You can find that CSV over here.

Place worksheet

The second sheet that you will find in the workbook is the Place worksheet. This one contains the following information:

  • PlaceId: just an ID for every place
  • PlaceName: just a random place name
  • PlaceType: randomly chosen from a list of place types (Restaurant, Bar, Mall, Hospital, Park, School, Theater, Grocery shop) which you can actually see on the "Metah" worksheet
  • Lat and Long: geospatial coordinates, randomly calculated in the neighborhood of a city center. We will use these to generate geospatial data type properties later on - so that we can actually use distance calculations etc. I have currently generated these random points in the vicinity of the center of Antwerp, where I live.


As you will see later on, we will be using the CSV export of this sheet in our import steps. You can find that CSV over here.

Visits worksheet

The last sheet in this synthetic data workbook, is all about the Visits that the Persons make to the Places. Tying it together, so to speak. Here are the property fields that I have included here:

  • VisitId: just an ID for the Visit
  • PersonId: ID for a Person, randomly selected from list of persons
  • PlaceId: ID for a Place, randomly selected from list of places
  • StartTime: randomly selected time in last 30 days
  • EndTime: starttime + a random number between 0 and 1


For our import later on, we can download this worksheet as a CSV file from this location.

Now that we have this google sheet in good order, we can continue with the actual import job for this data into a Neo4j database.

Importing the dataset

Once we have the dataset in the right format, it's pretty easy to perform the actual import - tools to do this have become so easy and fast these days, that the scripts (all available on this github repo) are really quite easy.

Import the Persons

Here's the first step:

load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=0" as csv
create (p:Person {id: csv.PersonId, name:csv.PersonName, healthstatus:csv.Healthstatus, confirmedtime:datetime(csv.ConfirmedTime)});

Nothing special about this, except for the fact that I have used a datetime data type for the confirmedtime property.

Import the Places

Then we can import all the places in our graph:

load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=205425553" as csv
create (p:Place {id: csv.PlaceId, name:csv.PlaceName, type:csv.PlaceType, location:point({x: toFloat(csv.Lat), y: toFloat(csv.Long)})});

Again, only thing special here is the location property, which uses a geospatial datatype.

Then to make things easy for the next step in the process (the creation of the Visits), we are adding some indexes.

create index on :Place(id);
create index on :Place(location);
create index on :Place(name);
create index on :Person(id);
create index on :Person(name);
create index on :Person(healthstatus);
create index on :Person(confirmedtime);

And the last but not least, we are importing the Visits.

Import the Visits

Again, since we prepared the dataset quite well, we can do this very easily. You will find that I am creating a bit of duplicate data here, which is quite normal in a graph dataset in my experience. We are connecting Person to Visit to Place, as well as directly connecting the Person to a Place via a Visits relationship.

load csv with headers from
"https://docs.google.com/spreadsheets/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=1261126668" as csv
match (p:Person {id:csv.PersonId}), (pl:Place {id:csv.PlaceId})
create (p)-[:PERFORMS_VISIT]->(v:Visit {id:csv.VisitId, starttime:datetime(csv.StartTime), endtime:datetime(csv.EndTime)})-[:LOCATED_AT]->(pl)
create (p)-[vi:VISITS {id:csv.VisitId, starttime:datetime(csv.StartTime), endtime:datetime(csv.EndTime)}]->(pl)
set v.duration=duration.inSeconds(v.starttime,v.endtime)
set vi.duration=duration.inSeconds(vi.starttime,vi.endtime);

We are also calculating the duration of the visit, and setting it as a property on both the node and the relationship.

As a last and totally optional step, I am also connecting the Places to a Region node, just to highlight the general area where places are located.

create (r:Region {name:"Antwerp"})-[:PART_OF]->(c:Country {name:"Belgium"})-[:PART_OF]->(co:Continent {name:"Europe"});
match (r:Region {name:"Antwerp"}), (pl:Place)
create (pl)-[:PART_OF]->(r);

Once we have done that, we can see that the import operation has succeeded in a matter of seconds:

And the model looks like what we had expected:


Now we can proceed to the next part of this shiny new database, and start to play around with some queries. That will be what we do in part 2 of this blogpost.

Hope this was already interesting - comments welcome as always.

Cheers

Rik

No comments:

Post a Comment