Tuesday, 21 April 2020

(Covid-19) Contact Tracing Blogpost - part 3/4

Part 3/4: Graph Analytics on the contact tracing graph

Note that these queries require environment: Neo4j Desktop 1.2.7, Neo4j Enteprise 3.5.17, apoc 3.5.0.9 and GDS 1.1. At the time of writing, Neo4j 4.0.3 is not yet supported by GDS 1.1.

One of the fantastic qualities of the graph data model, I have always found, is that it can give you interesting insights - without even looking at the data. The structure of the network can give you some really interesting new revelations, that you would not even have considered before. That is why Neo4j has invested a ton of effort in providing our industry with a completely new set of capabilities that allow us to discover these structural insights more easily - in the form of a new Graph Data Science Library. We have recently released the product, and you should read up on it in detail, and I think it would be a great and interesting idea to explore it on this Contact Tracing dataset that we have built in part 1 and queried in part 2.

Some data prep for analytics: inferring a new relationship

In order to do that, there's actually something that's missing: a new relationship between two Persons, which infers the fact that two people have MET. We can do that based on the overlap time of their visits to the same place - therefore leveraging a query from part 2. This is what are going to do: create a MEETS relationship between 2 Person nodes, based on the overlap - and we do that like this:

match (p1:Person)-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(p2:Person)
where id(p1)<id(p2)
with p1, p2, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
with p1, p2, sum(minEnd-maxStart) as meetTime
create (p1)-[:MEETS {meettime: duration({seconds: meetTime/1000})}]->(p2);


As you can see, we are storing the length of the inferred meeting as a duration property on the relationship. The result appears very quickly:


And as we explore some of these MEETS relationships, we can see that they totally add up:
Now, we can use this for our Graph Data Science ideas. 

About graph data science algorithms

We have used some of these algorithms before here on this blog, but I think it would still be useful to reiterate it here. As I mentioned before, graphs have this unique characteristic of enabling insight from structure. What I mean by this is that the actual data does not even matter for us to learn something new about the data. The structural characteristics, how things are connected together, actually already tells us something about the dataset. Entire books have been written about this, and for me my go-to resource for learning more about this stuff, is always going to be Networks, Crowds, and Markets: A Book by David Easley and Jon Kleinberg. It even has an entire chapter about epidemiology, which I would highly recommend.

To work with these structural analysis, the academic community together with the software industry, has come up with an entire set of algorithms to perform some of these analyses. Broadly speaking, we can categorize these analyses methods into the following 4 categories:
  • Centrality analysis: this is a category of analysis that tries to asses the importance of distinct nodes in the network
  • Pathfinding: understanding how to get from one part of the network to another, by walking along the relationships between the nodes
  • Similarity analysis: understanding how similar different elements in a network are to one another
  • Community detection: understanding if, based on some of the structural and non-structural characteristics of a network element, it may or may not belong together with another network element, as part of the same group / clique / community.
No doubt there are even more analyses possible, but with Neo4j, we have been focussing on these categories to be implemented in a new set of tools that allow for very efficient experimentation and operationalisation of these analysis. We do this in a Graph Data Science Library, which was originally prototyped as part of the Neo4j Labs, and now graduated in a professionally engineered product offered by Neo4j.

In the next section of this post, we will be using this toolset to understand and assess the parts of the Contact Tracing graph that we have built. Let's get to that now.

Running Graph Data Science Algos

To run our graph data science algorithms, we are going to install the Graph Data Science Library plugin. This is super easy to do in the Neo4j Desktop, but you can also refer to the installation instructions over here.


Simply run

RETURN gds.version()

To check if the install is running well. Once we have that, we can start using the library for our analysis from the command line or from the Neo4j Browser, but I personally really like to use the Graph Data Science / Graph Algorithms Playground graphapp, aka Neuler. You can install this from https://install.graphapp.io/.
Once installed, you can open the graphapp from the menu in the Neo4j Desktop, and then we are ready to start running some algorithms. We will cover 3 of them over here: 2 centrality algorithms to understand the importance of Person nodes (based on their PageRank, and based on their betweenness), and one community detection algorithm (based on the Louvain algorithm). Let's do that.



Graph Algo nr 1: Pagerank

Pagerank is probably the most famous algorithm there is in the Graph World, because all of us use it every day. It's the basic algorithm that was and is the key to the success of Google, and very well described over here. It is frequently used in lots of domains other than websearch, as it counts the number and the quality of the links between nodes to determine an estimate of how important that node might be. It works very well in Neo4j - and is super easy to spin up in Neuler:
Above you can see the table results of running the algorithm, and the Code tab gives you a detailed view of what is being done under the hood:

:param limit => (10);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {}
}
},
relationshipWeightProperty: null,
dampingFactor: 0.85,
maxIterations: 20,
writeProperty: 'pagerank'
});

CALL gds.pageRank.write($config);

To look at the results, which have been stored as a "pagerank" property on all the Person nodes, we run this:

MATCH (node)
WHERE not(node[$config.writeProperty] is null)
RETURN node.name as name, node[$config.writeProperty] AS pagerank, node.betweenness as betweenness
ORDER BY pagerank DESC
LIMIT 10;

And we see the results:

Note that the "betweenness" property is still empty at this point - we will calculate this below. You can also look at the top 10 pagerank nodes and their neighborhoods visually:

MATCH (node)
WHERE not(node[$config.writeProperty] is null)
with node, node[$config.writeProperty] AS score
ORDER BY score DESC
LIMIT 10
MATCH (node)-[r]-(conn)
RETURN node, r, conn;

It looks like this:

This should give us a first indication of the important nodes in our graph - and of course we will complete that with the below metrics.

Graph Algo nr 2: Betweenness

The next algorithm is even more fascinating, in my book. Look at Betweenness centrality for some more background, but essentially what this algorithm is going to provide for us, is a metric for the importance of a node not because of its data, but because of its structural position in the graph. A node with a high betweenness score, has a crucial interconnecting function in the network, that allows nodes in one part of the network to be connected to nodes in a different part of the network - through the use of the connectivity provided by the node that is very "between", that has a high betweenness centrality score. In the context of a contact tracing dataset, and epidemiology specifically, this sounds very important - as these nodes will be crucial to containing or spreading the virus outbreak.

Using Neuler, we are able to configure and run the algorithm very quickly, again:
The code that is being run looks like this:

:param limit => (20);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {}
}
},
writeProperty: 'betweenness'
});

CALL gds.alpha.betweenness.write($config);

And running this query:

MATCH (node)
WHERE not(node[$config.writeProperty] is null)
RETURN node.name as name, node.pagerank as pagerank, node[$config.writeProperty] AS betweenness
ORDER BY betweenness DESC
LIMIT 10;

gives us the following betweenness results table: 

We can of course also get these results visually in a graph visualization:

MATCH (node)
WHERE not(node[$config.writeProperty] is null)
with node, node[$config.writeProperty] AS score
ORDER BY score DESC
LIMIT 10
MATCH (node)-[r]-(conn)
RETURN node, r, conn;


This clearly gets us another step closer to understanding the interesting parts of the network. 

Let's move to the last part now: trying to identify the communities of nodes that belong together.

Graph Algo nr 3: Louvain community detection

Last but not least, we are going to look at an automated way of trying to assess which parts of the network form sub-networks, communities that belong together. There are many different ways of doing this, but a popular technique seems to be the Louvain modularity technique. It's basically an iterative process of adjusting the community definition to optimize the grouping of nodes based on the modularity of these nodes. By doing that over and over again, the algorithm quickly finds groups of items belonging together.

There's one small thing that we need to do in order to enable our library to run the algorithm, is that we need to convert the "meettime" property (a duration that is stored on the MEETS relationship) from a duration to an integer. This is very quickly done with a simple query:

MATCH p=()-[r:MEETS]->() 
set r.meettimeinseconds=r.meettime.seconds;

And then we can again switch to Neuler to prepare the code:
This is what is being run:

:param limit => ( 50);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {
meettimeinseconds: {
property: 'meettimeinseconds',
defaultValue: 1
}
}
}
},
relationshipWeightProperty: 'meettimeinseconds',
includeIntermediateCommunities: false,
seedProperty: '',
writeProperty: 'louvain'
});

CALL gds.louvain.write($config);

And once we have done that we can check the communities (identified by a "louvain" property on every Person node) that have been found:

match (p:Person)
return distinct p.louvain, count(p)
order by count(p) desc;

41 communities were detected, and they varied in size quite significantly:

Using a couple of simple queries we can of course explore these communities very easily. Let's do that for community nr 489:

match (p1:Person {louvain: 489})-[v:VISITS]->(pl:Place), (p1)-[m:MEETS]->(p2:Person)
return p1, p2, pl, v, m;

Then we get:

More exploration can definitely be done here. But I am hoping that we have illustrated the point sufficiently of using the graph data science library, and it's sophisticated algorithms, for creating new structural insights on our dataset.

Hope this was interesting again - comments welcome as always. In the last post (part 4/4) we will be looking into some final points that have been dangling.

Cheers



No comments:

Post a Comment