So in this article I will be revisiting this blogpost, but without the need for APOC's apoc.periodic.iterate feature. Let's see how that goes.
Create a synthetic contact tracing graph - size of Antwerp
The first step of course is going to be similar to, if not exactly the same as, the work I did in 2020 on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went. The key thing to recall there is that I was using the fantastic faker
plugin. You can download it yourself from the github page. Install is super easy. Just need to make sure the config is updated too - and that you whitelisted fkr.*
just like you do with gds.*
and apoc.*
.
As with the previous post, I will be pushing the scale up to the size of my home city of Antwerp, Belgium. And critically, we would not even use APOC - but use the transaction batching instead.
Create 500000 (Person)
nodes
Previously we did this in one transaction - which is probably at the limits of what I would normally do. But since we now have this transaction batching mechanism in Cypher, let's use it:
:auto UNWIND range(1,500000) as id
CALL {
WITH id
CREATE (p:Person {id: id})
SET p += fkr.person('1950-01-01','2021-12-01')
SET p.healthstatus = fkr.stringElement("Sick,Healthy")
SET p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
SET p.birthDate = datetime(p.birthDate)
SET p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
SET p.name = p.fullName
REMOVE p.fullName
} IN transactions of 25000 ROWS;
This returns a little more slowly than a single shot transaction would, but that is to be expected. Here's the result:
Then, we will create the (Place) nodes.
Create 10000 (Place)
nodes
Adding the places is instantaneous, even with two batches of 5000:
:auto UNWIND range (1,10000) as id
CALL {
WITH id
CREATE (p:Place { id: id, name: "Place nr "+id})
SET p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
SET p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
} IN transactions of 5000 rows;
The result looks like this:
Put in place some indexes on the NODES and future RELATIONSHIPS
We don't really need them for this demo - but could be useful for other queries. Note that we are using the relationship-centric model here - as we proved in the last blogpost that this is at least as capable, and much simpler, as the reified model that used (Visit)
nodes.
So here we add the node indexes:
CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);
And we also the index to the -[:VISITS]->
relationship property:
CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);
Now we can add the 1,5M relationships - the real test of the new transaction batching functionality.
Add 1500000 random visits to places
It's pretty straightforward and similar to the previous examples, so let's just dive in:
:auto UNWIND range(1,1500000) as iteration
CALL {
WITH iteration
MATCH (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
create (p)-[virel:VISITS]->(pl)
set virel.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
set virel.endtime = virel.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
set virel.visittime = duration.between(virel.starttime,virel.endtime)
set virel.visittimeinseconds = virel.visittime.seconds
} IN TRANSACTIONS of 25000 rows;
The result was pretty quick: 75 seconds, not even!
Query on VISITS relationships
Just for completeness, I will revisit the main query that we explored in the previous blogpost here as well. This is what that query looks like:
match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;
This then becomes:
Conclusion:
The new transaction batching functionality makes for a great addition to our toolbox - and clear performs quite well. Looking forward to using it in other use cases, already!
Cheers
Rik Van Bruggen
No comments:
Post a Comment