Wednesday 23 September 2020

Exponential growth in Neo4j

With the current surges of the Covid-19 Pandemic globally, there is a huge amount of debate raging in our societies - everywhere. It’s almost as if the duality between left and right that has been dividing many political spectra in the past few years, is now also translating itself into a duality that is all about more freedom for the individual (and potentially - a higher spread of the SARS-CoV-2 virus), versus more restrictions for the individual. It’s such a difficult debate - with no clear definitive outcome that I know of. There’s just too many uncertainties and variations in the pandemic - I personally don’t see how you can make generic statements about it very easily.

One thing I do know though, is that very smart and loveable people, in my own social and professional circle and beyond, seem to be confused by some of the data. Very often, they make seemingly rational arguments about the numbers that are seeing - but ignoring the fact that we are looking at an Exponential Growth problem. In this post, I want to talk about that a little bit, and illustrate it with an example from the Neo4j world.

What is Exponential Growth exactly?

Let’s take a look at the definition from good old Wikipedia:
Exponential growth is a specific way that a quantity may increase over time. It occurs when the instantaneous rate of change (that is, the derivative) of a quantity with respect to time is proportional to the quantity itself. Described as a function, a quantity undergoing exponential growth is an exponential function of time, that is, the variable representing time is the exponent (in contrast to other types of growth, such as quadratic growth).
The basic functions that are being entertained here are very simple in terms of the maths:

As you can see, the blue (ie. what we call "cubic" growth) and green (ie. what we can "exponential" growth) lines are rising more slowly than the red one in the earlier part of their evolution - but they very quickly catch up and rise above and beyond the red (ie. "linear" growth) line. This is where the hard part comes in: when you are in the early parts of the "green line", it’s very difficult, and I mean VERY DIFFICULT to imagine what the evolution of the data is going to be like down the line.

We, as humans, just can’t phathom the fact that we are living a flat line in 0-5, and that by 8-9-10 the evolution is litterally skyrocketing. We. Just. Can’t. Get. It. Into. Our. Heads. Really.

Making Exponential Growth Tangible

Over the course of history, there have been some amazing illustrations of this phenomenon. Here’s a few that you should read - just for fun:
Now let’s apply this to Covid-19 - because here’s the thing: this Covid-19 pandemic is an exponential problem whether we like it or not. This is the case for a very simple reason: the way an epidemic like this spread, is of such a nature that one infected patient can and most likely will infect multiple other people. Different diseases, some of them transmitted by viruses, others differently, have different rates of infectiousness, if that makes sense. Take a look at the Wikipedia page about the basic reproduction number of different diseases, and you will see the problem. Covid-19, and the virus that causes it SARS-CoV-2, is a modestly infectious disease - for which we currently do not have a decent cure or treatment.

I have also just created a little spreadsheet for you to take a quick look at. You can find it over here (just copy it if you want to play around with it yourself). The whole point of the lockdowns, contact tracing initiatives and others has been that we want to be able to "flatten the curve", or potentially even "crush the curve" - by lowering the R-number.

So what if we could illustrate this problem, and make it tangible in a Neo4j database? Let’s see what we can do here!

Exponential problems in a GraphDB!

Let’s start with a very simple database. I am going to use the Neo4j Faker plugin library, like I did before with the Contact tracing testbed. It’s really easy to install, and then I can just run this simple query to create 100000 person nodes:

Create all the Person nodes

call apoc.periodic.iterate(
 "with range(1,100000) as range unwind range as item return item",
 "create (p:Person { uid : item }) set p += fkr.person('1960-01-01','2000-01-01') 
set p.birthDate = datetime(p.birthDate) 
set p.healthstatus = 'Healthy' 
set p.location = point({x: toFloat(50.25+rand()), y: toFloat(2.5+3.5*rand())})", {batchSize:25000, parallel:false}

Once this is done we want to add a few indexes for efficient querying:
create index on :Person(uid);
create index on :Person(healthstatus);
create index on :Person(location);

and then either wait a few seconds or just simply run this query to make sure that the indexes are up to date:
call db.index.fulltext.awaitEventuallyConsistentIndexRefresh();

Now, in order to show you what would happen to this dataset if an exponential problem would be thrown at it, we need to have a starting point. So let’s make one Person-node sick, with this simple query:

match (p:Person {uid: round(rand()*100000)})
set p.healthstatus = "Sick"
set p.infectiontime = datetime();

Now, we are ready to see how the infections spread exponentially - and how quickly our "database gets infected". Let’s see.

Infecting the database - exponentially

To do this, I wanted to write a simple query that I could run over and over again, and that would just continuously allow me to monitor the spread of the disease in the database. So here’s what I had in mind: for every sick person that has not infected anyone, my query would find 3 healthy "unlucky ones" that will get infected. I played around with many different versions of this query, broke my head on it multiple times, but ended up with a very simple query that uses a great new Neo4j 4.x feature: the CALL{} subquery. Read more about it on the developer pages as well. Here’s what this does:

CALL allows to execute subqueries, i.e. queries inside of other queries. A subquery is evaluated for each incoming input row and may produce an arbitrary number of output rows. Every output row is then combined with the input row to build the result of the subquery. That means that a subquery will influence the number of rows.

So here’s how this would work for our situation: In the first match clause we search for all the Sick people that have not infected anyone (yet), and then for each and everyone of these Sick people we then call a subquery that matches for 3 Unlucky Ones that are going to get infected. The "infecting" is then done by a merge statement at the end.

match (spreaders:Person {healthstatus:"Sick"})
where not ((spreaders)-[:INFECTS]->())
{ match (unluckyones:Person {healthstatus:"Healthy"}) return unluckyones limit 3 }
merge (spreaders)-[:INFECTS {time: datetime()}]->(unluckyones)
 set unluckyones.healthstatus = "Sick"
 set unluckyones.infectiontime = datetime();

This works like a charm. Only downside is that, after some time and on bigger datasets, we would literally have a too big an operation to execute in one transaction, and we would run out of memory. To fix the memory limitation, we will wrap the query with a periodic iteration instruction using the apoc plugin. The following works for any dataset size:

CALL apoc.periodic.iterate("
match (spreaders:Person {healthstatus:'Sick'})
where not ((spreaders)-[:INFECTS]->())
return spreaders", "
call { match (unluckyones:Person {healthstatus:'Healthy'}) return unluckyones limit 3 }
merge (spreaders)-[:INFECTS {time: datetime()}]->(unluckyones)
set unluckyones.healthstatus = 'Sick'
 set unluckyones.infectiontime = datetime()",
{batchSize:2500, parallel:false})

If you try this out, you will see that the exponential growth that we were trying to illustrate moves very quickly. Just run this query after every iteration:

match (n:Person {healthstatus:"Sick"}) return count(n);

Hope this is a good illustration for everyone!

All the best!


No comments:

Post a Comment