Monday, 21 June 2021

Revisiting Covid-19 contact tracing with Neo4j 4.3's relationship indexes

Last week was a great week at the "office". One that I don't think I will easily forget. Not only did we host our Nodes 2021 conference, but we also launched our new website, published a MASSIVE trillion-relationship graph, and announced a crazy $325M series F investment round that will fuel our growth in years to come. 

In all that good news, the new release of Neo4j 4.3 kinda disappeared into the background - which is why I thought it would be fun to write a short blogpost about one of the key features that are part of this new release: relationship property indexes.

This is a really interesting feature for a number of different reasons. But let's draw your attention to two main points of attention:

  1. Relationship indexes will lead to performance improvements: all of a sudden the Neo4j Cypher query planner is going to be able to use a lot more information, provided by these relationship indexes. The planner is becoming smarter - and therefore queries will become faster. We will explore this below.
  2. Relationship indexes will actually have interesting modelling implications: the introduction of these indexes could have far-reaching implications with regards to how we model certain things. Here's what we mean with that

You can see that both alternative models could have good use, but that the second model is simpler and potentially more elegant. It will depend on the use case to decide between the two - but in the past we would most often use the first model for performance reasons - and we will see below that that will no longer be a main reason with the addition of relationship indexes. Let's investigate.

Create a synthetic contact tracing graph - size of Antwerp

First, let's create the graph that we are going to use to demonstrate the power of relationship indexes. We will use a similar graph to the work I did last year on Covid-19 contact tracing. You can take a look at this old post to see how that went. We can use the "faker" plugin to the Neo4j database to generate the data. Download that plugin from the github page, install it by copying the files, and then make sure the config is updated too. You have to whitelist fkr.* just like we do with gds.* and apoc.* - and that's it. The only difference will be that we will be pushing the scale of the contact tracing graph up a bit - up to the size of my home city of Antwerp, Belgium, approximately.

Create 500000 persons

First we create the 500k (:Person) nodes. To simply things, we do this in one transaction - which is easy enough if we just give our database some extra memory. 

foreach (i in range(1,500000) |
    create (p:Person { id : i })
    set p += fkr.person('1940-01-01','2021-06-01')
    set p.healthstatus = fkr.stringElement("Sick,Healthy")
    set p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    set p.birthDate = datetime(p.birthDate)
    set p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    set p.name = p.fullName
    remove p.fullName
);

10 seconds later that part of the graph is already there:

Create 10000 places

In our graph, the Person nodes will be visiting specific (:Place) nodes. Adding the 10000 places to the database is instantaneous:

foreach (i in range(1,10000) |
    create (p:Place { id: i, name: "Place nr "+i})
    set p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
    set p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
);

And looks like this:

Put in places some indexes on the NODES

Then we will add the indexes - which we don't really need them for this demo - but could be useful for other queries. The syntax for the indexes has actually become a bit more explicit and consistent, as you can see below:

CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);

These indexes get populated very quickly, and are ready for use:

Now we can proceed to add a few "place visits" for every person. 

Add 1500000 random visits to places

We will use a periodic iteration to make sure that the transactions will periodically commit and don't lead to excessive memory use. 

CALL apoc.periodic.iterate(
    'with range(1,1500000) as range
        unwind range as iteration return iteration', 
    'match (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
        create (p)-[:PERFORMS_VISIT]->(v:Visit { id: iteration})-[:LOCATED_AT]->(pl)
        create (p)-[virel:VISITS]->(pl)
        set v.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
        set virel.starttime = v.starttime
        set v.endtime = v.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
        set virel.endtime = v.endtime
        set v.visittime = duration.between(v.starttime,v.endtime)
        set v.visittimeinseconds = v.visittime.seconds
        set virel.visittime = v.visittime
        set virel.visittimeinseconds = v.visittimeinseconds', 
    {batchSize:25000, parallel:false});

It takes 1,5 minutes (89 seconds) to process this - which really is not bad at all!

Some people will be unconnected

As you may have seen in the above query, we will use a rand() function to select the (:Person) and the (:Place) nodes that we are connecting. This randomisation makes for some people to stay unconnected - which is not a problem - in real life that would also be the case, right? Some people just don't go out :) ...

match (p:Person)
where not ((p)--())
return count(p);

Now we can start looking at the comparison of the old model and the new model - and see what the result will be.

Querying for starttimes using OLD model / node indexes

Let's see what we actually saw in the old (triangular) model that we had last year. Here's the query that we will use using the (:Visit) nodes:
profile match (p:Person)-[:PERFORMS_VISIT]->(v:Visit)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;

When we run that on the database as is, we see that the database is using the NodeByLabelScan operator: this is generating lots of db hits, and requires 4,4 seconds to complete.

Index the visit nodes

Then we add the "older" Node index on the (:Visit) nodes:
CREATE INDEX visitnodestarttime FOR (v:Visit) ON (v.starttime);

which goes very quickly:

And then we re-run the query above, and then we are witnessing that the query planner is using NodeIndexSeekByRange: this is making the performance fly from 4403ms to 7ms. 

Now, we will try to do the same with the new, simpler data model that leverages the new relationship indexes.

Querying for starttimes using NEW model and relationship indexes

So let's see howe we can actually forget about the intermediate (:Visit) nodes, and just use the [:VISITS] relationships. Here's what the query looks like then:

profile match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;

Let's run that query in the as-is database, without the relationship index. The database query planner is then using the NodeByLabelScan, and is causing lots of db hits and 6 seconds of waiting. 

Now, let's index the [:VISITS] relationships. The query is very similar to adding the index to the nodes - but now we do it on the relationship property:

CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);

And if we now run the above query on the new model, with the relationship index, we see that the database planner is now using DirectedRelationshipIndexSeekByRange - which is dramatically dropping the number of db hits and decimating the wait time to less than 8 milliseconds.

This is great news, and a proofpoint that the new system really does work great.

Conclusion: more modelling options, more performance

I must say that I am quite happy with this result. The new feature not only leads to extremely significant performance improvements, but also allows for a much simpler model. I love the fact that now have even more flexibility, even more options to make the most of our graphs.

I hope this was a useful exercise. You can find all the queries on github, and also look at the .mdx file to add to your Neo4j Desktop as a Neo4j  Browser guide.

All the best

Rik

No comments:

Post a Comment