Monday 6 December 2021

Revisiting contact tracing with Neo4j 4.4's transaction batching capabilities

Yes! It's been a few months, but Saint Nicholas just brought us a brand new and shiny release of Neo4j 4.4 to play with. One of the key features is a generic transaction batching capability, similar to what we have been using in apoc.periodic.iterate but now built right into the core of the database. It is referred to as the CALL in Transaction capability - and of course it is a really interesting feature.

So in this article I will be revisiting this blogpost, but without the need for APOC's apoc.periodic.iterate feature. Let's see how that goes.

Create a synthetic contact tracing graph - size of Antwerp

The first step of course is going to be similar to, if not exactly the same as, the work I did in 2020 on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went. The key thing to recall there is that I was using the fantastic faker plugin. You can download it yourself from the github page. Install is super easy. Just need to make sure the config is updated too - and that you whitelisted fkr.* just like you do with gds.* and apoc.*.

As with the previous post, I will be pushing the scale up to the size of my home city of Antwerp, Belgium. And critically, we would not even use APOC - but use the transaction batching instead.

Create 500000 (Person) nodes

Previously we did this in one transaction - which is probably at the limits of what I would normally do. But since we now have this transaction batching mechanism in Cypher, let's use it:

:auto UNWIND range(1,500000) as id CALL { WITH id CREATE (p:Person {id: id}) SET p += fkr.person('1950-01-01','2021-12-01') SET p.healthstatus = fkr.stringElement("Sick,Healthy") SET p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H") SET p.birthDate = datetime(p.birthDate) SET p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)}) SET p.name = p.fullName REMOVE p.fullName } IN transactions of 25000 ROWS;

This returns a little more slowly than a single shot transaction would, but that is to be expected. Here's the result:

Then, we will create the (Place) nodes.

Create 10000 (Place) nodes

Adding the places is instantaneous, even with two batches of 5000:

:auto UNWIND range (1,10000) as id CALL { WITH id CREATE (p:Place { id: id, name: "Place nr "+id}) SET p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park") SET p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)}) } IN transactions of 5000 rows;

The result looks like this:

Put in place some indexes on the NODES and future RELATIONSHIPS

We don't really need them for this demo - but could be useful for other queries. Note that we are using the relationship-centric model here - as we proved in the last blogpost that this is at least as capable, and much simpler, as the reified model that used (Visit) nodes.

So here we add the node indexes:

CREATE INDEX placenodeid FOR (p:Place) ON (p.id); CREATE INDEX placenodelocation FOR (p:Place) ON (p.location); CREATE INDEX placenodename FOR (p:Place) ON (p.name); CREATE INDEX personnodeid FOR (p:Person) ON (p.id); CREATE INDEX personnodenam FOR (p:Person) ON (p.name); CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus); CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);

And we also the index to the -[:VISITS]-> relationship property:

CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);

Now we can add the 1,5M relationships - the real test of the new transaction batching functionality.

Add 1500000 random visits to places

It's pretty straightforward and similar to the previous examples, so let's just dive in:

:auto UNWIND range(1,1500000) as iteration CALL { WITH iteration MATCH (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 }) create (p)-[virel:VISITS]->(pl) set virel.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H") set virel.endtime = virel.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M") set virel.visittime = duration.between(virel.starttime,virel.endtime) set virel.visittimeinseconds = virel.visittime.seconds } IN TRANSACTIONS of 25000 rows;

The result was pretty quick: 75 seconds, not even!

Query on VISITS relationships

Just for completeness, I will revisit the main query that we explored in the previous blogpost here as well. This is what that query looks like:

match (p:Person)-[v:VISITS]->(pl:Place) where v.starttime > datetime()-duration("P20DT17H") and v.starttime < datetime()-duration("P20DT10H") return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds order by totalvisittime desc limit 10;
This then becomes:

Conclusion:

The new transaction batching functionality makes for a great addition to our toolbox - and clear performs quite well. Looking forward to using it in other use cases, already!

Cheers

Rik Van Bruggen

No comments:

Post a Comment