Tuesday 21 April 2020

(Covid-19) Contact Tracing Blogpost - part 2/4

Part 2/4: Querying the contact tracing graph

Note that these queries require environment: Neo4j Desktop 1.2.7, Neo4j Enteprise 3.5.17, apoc 3.5.0.9 or Neo4j Enterprise 4.0.3, apoc 4.0.0.6 (NOT later! a bug in apoc.coll.max/apoc.coll.min needs to be resolved)

In Part 1 we created and imported a contact tracing graph. Now, we are ready to experiment with some interesting graphy queries.

The most interesting part about many if these queries, I find, is that they all relay on the fundamental principle of "hypothesis-free querying". What I mean by this is, is that graph querying, in my experience and opinion, have this wonderful quality about them that you can actually interact with the data in a way that does not require you to hypothesize too much about the structure of the dataset. This is important, because very often I just won't know what I don't know, and making meaningful hypotheses is actually really hard and complicated. The fact that we don't have to do that, is a great win.

As always, you will find all queries are on github, so that you can have a play with it yourself as well. So let's dive right into it.

Who has a sick person potentially infected

To answer that, I will "grab" a sick person from the dataset, and then just walk the dataset from the person to the other persons that are currently healthy. The query goes like this:

match (p:Person {healthstatus:"Sick"})
with p
limit 1
match (p)--(v1:Visit)--(pl:Place)--(v2:Visit)--(p2:Person {healthstatus:"Healthy"})
return p.name as Spreader, v1.starttime as SpreaderStarttime, v2.endtime as SpreaderEndtime, pl.name as PlaceVisited, p2.name as Target, v2.starttime as TargetStarttime, v2.endtime as TargetEndttime;



I get the results in no time:


Or this slightly modified query gives me the results in a visual graph format:

match (p:Person {healthstatus:"Sick"})
with p
limit 1
match path = (p)-->(v1:Visit)-->(pl:Place)<--(v2:Visit)<--(p2:Person {healthstatus:"Healthy"})
return path;

And again the results show very quickly:

Of course, I should know by now that I have actually simplified the data model by closing the triangle:

And that we can therefore simplify the query by using the VISITS relationship:

match (p:Person {healthstatus:"Sick"})
with p
limit 1
match path = (p)-[:VISITS]->(pl:Place)<-[:VISITS]-(p2:Person {healthstatus:"Healthy"})
return path;

The resulting graph is also quite a bit more readable because of it.

Now we can go a step further, and make this even more sophisticated.

Higher risk of infection: Time overlap between visits

The idea in this next set of queries, is to not just look at the connections between sick/healthy people that are visiting a specific location, but to actually look at the timing of these visits, and whether or not they overlap. In other words, if a healthy person visited a location while the sick person was there, then obviously that could increase that healthy person's chances of being infected.

So I needed to see how two time windows would overlap. There's quite a few scenarios here, but I ended up following the developer knowledge base article over here. It uses logical reduction to assess the overlap: if the following applies

Max(StartA, StartB) <= Min(EndA, EndB)

This essentially means that the latest of the start times of our visits must occur before (or at the same time) as the earliest of the end times of the visits for the ranges to overlap. This we can work with, and write a query for:

match (p:Person {healthstatus:"Sick"})-[v1:VISITS]->(pl:Place)
with p,v1,pl
limit 10
match path = (p)-[v1]->(pl)<-[v2:VISITS]-(p2:Person {healthstatus:"Healthy"})
with path, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
return path;

In the example above, we have had to convert the starttimes to epochMillis, and then use the apoc.coll.max and apoc.coll.min functions to find the maxima and minima that we need to compare. Nothing too complicated.

The result looks like this:


Onto a next query.

Find sick person that has visited places since being infected

There are obviously some more interesting queries that we could ask with regards to the visits that a person makes to different places. We have timing information with regards to the health status, and with regards to the visits that that person makes, so the question naturally becomes if a sick person has visited a place after they were confirmed to be sick. Here's a query:

match (p:Person {healthstatus:"Sick"})-[visited]->(pl:Place)
where p.confirmedtime < visited.starttime
return p, visited, pl
limit 10;

There's a few results here:

And we can refine that even more, by looking at the whether or not that sick person has visited that place more than once (by including the place in the pattern more than once):

match (pl:Place)<-[v2]-(p:Person {healthstatus:"Sick"})-[v1]->(pl:Place)
where p.confirmedtime > v1.starttime
or p.confirmedtime > v2.starttime
return *;

This also gives some interesting results:

Let's look at some other interesting ideas.

Risk of healthy people increases with overlap time

As you have seen above, we have created a few queries to understand the overlap of visits between different people. The next logical step, of course, is to understand how much overlap time there was between sick and healthy people - assuming that more overlap time is actually very negative for a healthy person.

To do this, we are using the same principle as above to find the "overlapping" healthy and sick people, and then we sum up all of the overlap times (defined as the difference between the minimum end time, and the maximum start time).

match (hp:Person {healthstatus:"Healthy"})-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(sp:Person {healthstatus:"Sick"})
with hp, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
return hp.name, hp.healthstatus, sum(minEnd-maxStart) as overlaptime
order by overlaptime desc;

This then gives you a kind of "ranking" of risk for the healthy people, as below:


We can of course also look at this visually:

match (hp:Person {healthstatus:"Healthy"})-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(sp:Person {healthstatus:"Sick"})
with hp, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
with hp, sum(minEnd-maxStart) as overlaptime
order by overlaptime desc
limit 10
match (hp)-[v]-(pl:Place)
return hp,v,pl;

And then we get this graph representation for the "first" (or highest risk) 10 people:

Onto our last set of example queries.

Finding the hotspots

The final set of queries that I experimented with in this dataset, have to do with the hotspots, ie. the places that seem to be attracting more sick people than others.

Here's a query that counts the visits to a place (using a degree function in apoc), and compares that with the number of visits by sick people - and therefore also calculates a relative "riskiness" for a place:

match (p:Person {healthstatus:"Sick"})-[v:VISITS]->(pl:Place)
with distinct pl.name as placename, count(v) as nrofsickvisits, apoc.node.degree.in(pl,'VISITS') as totalnrofvisits
order by nrofsickvisits desc
limit 10
return placename, nrofsickvisits, totalnrofvisits, round(toFloat(nrofsickvisits)/toFloat(totalnrofvisits)*10000)/100 as percentageofsickvisits;

That gives us this result:

Which we can also represent graphically quite easily:

match (p:Person {healthstatus:"Sick"})-[v:VISITS]->(pl:Place)
with distinct pl.name as placename, count(v) as nrofsickvisits, pl
order by nrofsickvisits desc
limit 10
match (pl)<-[v]-(p:Person)
return pl,p,v;


That brings me to an end of the basic querying that I did on this dataset to illustrate some of the really neat things that we can do here. In part 3 of this series I will take it one more step further, and look at some interesting graph analytics.

Hope this was already interesting - comments welcome as always.

Cheers

Rik

PS: part 1 is over here, in case you missed it


No comments:

Post a Comment