Saturday, 29 March 2014

Using Neo4j to Manage and Calculate Hierarchies

I recently had a number of extremely interesting conversations with a company that wants to use Neo4j for managing hierarchical data. This use case is so common, that I decided to write up my findings and experiments and share over here. Hierarchical data is extremely common in many different situations - let's just mention a few:
Conceptually and mathematically, hierarchies are nothing more than a particular type of graph. Therefore, I would automatically think that finding things in or calculating things over a hierarchy would be a great application for a graph database like neo4j - and that's what this post is about.

Generating a Product Hierarchy

In order to start looking at this, I needed an appropriately sized hierarchy to work with. My normal way of working would be to do something in a spreadsheet and then generate the graph from there - but My dear friend Michael recently wrote a blogpost that inspired me to skip that step. In fact, I can quickly generate a hierarchy myself using Cypher only. Let's talk a look at the following statements:

//Let's create the top of the tree, the PRODUCT
create (n1:PRODUCT {id:1});

//Let's create 100 children of the PRODUCT, COST_GROUPS connected to the top of the tree
match (n1:PRODUCT {id:1})
with range(1,100) as RANGE, n1
foreach (r in RANGE | create (n2:COST_GROUP {id:r})-[:PART_OF {quantity:round(rand()*100)}]->(n1) );

The first query creates the top of the hierarchy, eg. a product. Underneath that we could then create the first level (100 nodes in this example) by iterating through the range and assigning a random identifier to the node and a "quantity" to the "PART_OF" relationship. Semantically, this would mean that the node at Level 1 would be composed of 100 nodes at level 2, using a given quantity of each of the nodes - mentioned on the relationship. In fact, we are building a hierarchy like this one:

So in fact we need to generate another 4 levels underneath the two existing ones, using the same technique as above:

//for every COST_GROUP, let's connect 100 children at a 3rd level, the COST_TYPEs
match (n2:COST_GROUP)
with range(1,100) as RANGE, n2
foreach (r in RANGE | create (n3:COST_TYPE {id:r})-[:PART_OF {quantity:round(rand()*100)}]->(n2) );

//for every COST_TYPE, connect 10 COST_SUBTYPEs 
match (n3:COST_TYPE)
with range(1,10) as RANGE, n3
foreach (r in RANGE | create (n4:COST_SUBTYPE {id:r})-[:PART_OF {quantity:round(rand()*100)}]->(n3) );

//for every COST_SUBTYPE, connect 5 different COSTs
match (n4:COST_SUBTYPE)
with range(1,5) as RANGE, n4
foreach (r in RANGE | create (n5:COST {id:r})-[:PART_OF {quantity:round(rand()*100)}]->(n4) );

So now we already have a graph of 1*100*100*10*5 =500000 nodes. To add the sixth, lowest level of this hierarchy, I actually needed to run three separate queries, as my machine would run out of memory otherwise - the transaction would just be too big:

//for every COST, connect 3 COST_COMPONENTs
match (n5:COST)
create (n6:COST_COMPONENT {id:1,price:round(rand()*1000)})-[:PART_OF {quantity:round(rand()*100)}]->(n5);
match (n5:COST)
create (n6:COST_COMPONENT {id:2,price:round(rand()*1000)})-[:PART_OF {quantity:round(rand()*100)}]->(n5);
match (n5:COST)
create (n6:COST_COMPONENT {id:3,price:round(rand()*1000)})-[:PART_OF {quantity:round(rand()*100)}]->(n5);

As you can see, the sixth level of the hierarchy is a bit different than the ones above, as it includes actual pricing information for the lowest component of the product that we are building. Creating this entire structure on my machine (a 2012 MBP with 8Gbyte of RAM) took about 5minutes, which is not bad in my opinion - it's not a tiny little graph: 2110101 nodes in total.

We can take a look at the browser and see the hierarchy:

So that was already interesting :) ... but now let's see how we can work with these hierarchies, how we can do some interesting queries.

Querying and recalculating the hierarchy - a "graph global" approach

The first scenario that I wanted to explore is to see how fast neo4j would be able to (re)calculate the price of this "complex" (in the sense that it contains millions of parts) product if I would change the price (or the quantity for that matter) of one of the underlying components.  In order to do that, my initial approach was to walk the graph from top to bottom, evaluating the "quantity" property on every relationship, and multiplying that with the price of the COST_COMPONENT at the bottom. Summing all of that would then yield the total price of the product:

//calculating price based on full sweep of the tree
match (n1:PRODUCT {id:1})<-[r1]-(:COST_GROUP)<-[r2]-(:COST_TYPE)<-[r3]-(:COST_SUBTYPE)<-[r4]-(:COST)<-[r5]-(n6:COST_COMPONENT)
return sum(r1.quantity*r2.quantity*r3.quantity*r4.quantity*r5.quantity*n6.price);

One massive advantage of this approach is its simplicity. Cypher is just so great at these types of recursive operations and simplifying the notation for it in a 2 line query. The result comes back in a very reasonable time (43 seconds):
Note that I at first had forgotten to add directions to the relationships, and that this had a massive impact on performance (factor 5 approximately).

What this little test shows, in my humble opinion, and without having done any kind of optimisation of my machine/setup, neo4j is immediately very good at this - it essentially grabs the entire tree, hops along all the 2.1M nodes and relationships to grab the price and the quantities.

Doing some intermediate price calculations

One way of optimizing this kind of operation, of course, is to do some intermediate calculations at every level of the hierarchy. Essentially adding the price at every node as the sum of the prices of the underlying nodes, multiplied by the quantities on the connecting relationships. Let's set that up with the following queries:

//calculate intermediate pricing at the COST level
match (n5:COST)<-[r5]-(n6:COST_COMPONENT)
with n5, sum(r5.quantity*n6.price) as Sum
set n5.price=Sum;

//calculate intermediate pricing at the COST-SUBTYPE level
match (n4:COST_SUBTYPE)<-[r4]-(n5:COST)
with n4,sum(r4.quantity*n5.price) as Sum
set n4.price=Sum;

//calculate intermediate pricing at the COST-TYPE level
match (n3:COST_TYPE)<-[r3]-(n4:COST_SUBTYPE)
with n3,sum(r3.quantity*n4.price) as Sum
set n3.price=Sum;

//calculate intermediate pricing at the COST-GROUP level
match (n2:COST_GROUP)<-[r2]-(n3:COST_TYPE)
with n2,sum(r2.quantity*n3.price) as Sum
set n2.price=Sum;

//calculate intermediate pricing at the PRODUCT level
match (n1:PRODUCT)<-[r1]-(n2:COST_GROUP)
with n1, sum(r1.quantity*n2.price) as Sum
set n1.price=Sum;

Running these queries is really fast:
The advantage of having these intermediate pricing is of course that calculating the price of the product can now be done at every one of the levels. Let's look at two levels: if we use the intermediate pricing at COST-GROUP level, we run this query

//calculate pricing using intermediate pricing levels at the COST-GROUP level
match (n1:PRODUCT {id:1})<-[r1]-(n2:COST_GROUP)
return sum(r1.quantity*n2.price);

and the result is there instantaneously:
Running the same query at the COST-TYPE level takes a bit longer

//calculate pricing using intermediate pricing levels at the COST-TYPE level
match (n1:PRODUCT {id:1})<-[r1]-(n2:COST_GROUP)<-[r2]-(n3:COST_TYPE)
return sum(r1.quantity*r2.quantity*n3.price);

but gives the same result, fortunately:
Interesting. But what would happen if we would change one or more prices of COST_COMPONENTS? How would that affect the hierarchy/graph? Would it perform?

Doing simulations on the hierarchy - recalculating price

If we change something in the price of one of the nodes at the lowest level of the hierarchy, then that would obviously affect the price of the product at the top of the hierarchy. So how can we recalculate price? The "brute force" way is of course to re-run the sweeping query above, and run through the entire hierarchy again. But now that we have the intermediate prices, we can do this much more intelligently.

Essentially what we will do is, when we do the update of the price of the COST_COMPONENT, we will at the same time also update all of the intermediate prices above it. And, consequently, also recalculate the price of the top-level of the hierarchy, the PRODUCT level.

Here's the query that I wrote:

//Changing the price of ONE COST_COMPONENT
with n6, n6.price as OLDPRICE limit 1
set n6.price = n6.price*10
with n6.price-OLDPRICE as PRICEDIFF,n6
match n6-[r5:PART_OF]->(n5:COST)-[r4:PART_OF]->(n4:COST_SUBTYPE)-[r3:PART_OF]->(n3:COST_TYPE)-[r2:PART_OF]->(n2:COST_GROUP)-[r1:PART_OF]-(n1:PRODUCT)
set n5.price=n5.price+(PRICEDIFF*r5.quantity),
return PRICEDIFF, n1.price;

It basically does a couple of things:

  • it takes 1 COST_COMPONENT, and extracts the price (OLDPRICE)
  • it changes the price of that COST_COMPONENT to *10 as expensive
  • it stores the price difference between old and new price in a PRICEDIFF variable
  • it then runs through all the levels ABOVE the COST_COMPONENT involved, and changes the intermediate pricing of those nodes. By doing so, we get the new price at the top of the hierarchy.

the result of that changes is also there instantaneously:
We can do the same thing for a hundred COST_COMPONENTS at a time, and still get instantaneous responses. In the following query we increase the pricing of 100 COST_COMPONENTS with 5%:

//Changing the price of ONE HUNDRED COST_COMPONENTS
with n6, n6.price as OLDPRICE limit 100
set n6.price = round(n6.price*1.05)
with n6.price-OLDPRICE as PRICEDIFF,n6
match n6-[r5:PART_OF]->(n5:COST)-[r4:PART_OF]->(n4:COST_SUBTYPE)-[r3:PART_OF]->(n3:COST_TYPE)-[r2:PART_OF]->(n2:COST_GROUP)-[r1:PART_OF]-(n1:PRODUCT)
set n5.price=n5.price+(PRICEDIFF*r5.quantity),
return PRICEDIFF, n1.price;

And the result is there immediately (160ms):

According to the intermediate pricing, the pricing should now be as follows:
If we do another sweeping price calculation then we see that this approach is indeed correct.


All of the queries that I wrote for this blogpost are available in the gist over here. If you would run them in your neo4j instance you would be able to create a hierarchy yourself and take a look at the results. Just beware that the hierarchy that I created is quite big, and that some of the transactions involve hundreds of thousands of updates to neo4j in one transaction - and your machine has to be able to do that (mine bombed out on a number of occasions).

Aside from that, I took a away many different things from this weekend exploration of hierarchies in neo4j.

  • First of all, without any manipulation of intermediate calculations, neo4j performs fantastically for these types of hierarchical operations. Don't forget, we were looking at a truly large "Product" definition (2.1 millions nodes and relationships for 1 product), and we were recalculating the pricing in under a minute. 
  • Secondly, I found that if you would apply the traditional technique of "precalculating" parts of the hierarchy in neo4j (which is what we would always do in relational systems to maintain some kind of performance), the model would become a bit more complicated - but the speed would become AMAZING. 

So there you go - another fantastic use case for neo4j. Anything to do with hierarchies really should be modeled as a graph :) ...

Hope this is useful.



Sunday, 23 March 2014

Media, Politics and Graphs

My dear friend and neo4j community member Ron recently pointed me to an amazing piece of work. Thomas Boeschoten, of the Utrecht Data School among many other things, published some amazing work of analysing the Dutch Talk Shows from different perspectives, using Gephi as one of his tools.  Some of his results are nothing short of fascinating, and very cool to look at.
Now I think the world needs more people like Thomas: media have long abandoned political neutrality, and as citizens, we owe it to ourselves to understand the politics of the things that we see, read and hear in the media. Without it, democracy will be short lived and (to paraphrase de Maistre) we "will get the government we deserve". We need to understand this interplay between media and politics - and graphs can help there.

I will not try to help you understand the depths of Thomas' research (just visit his site - lots of cool stuff there, but mostly in Dutch), I would just like to take this dataset - which he kindly shared - for a spin using neo4j

Importing the dataset

As Thomas visibly already is a graphista through and through, he shared his datasets with me as a Gephi file. So that made it really easy to get the initial stuff into neo4j: all I needed to do was use the recently updated neo4j-Gephi plugin and generate the data store files. Copy those over to my neo4j server's data directory, call it graph.db and boom - we're done!

However, when I fired up the server, I soon found out that I would have to do some work :) ... the graph that Thomas created did not really have a "database-like" model (it did not do any normalisation of the model, for instance) - and the neo4j browser looked a bit boring:

I needed to add some structure to this all, in order to be able to query it meaningfully.

Adding a model

After browsing around through the data, I decided that the model that I would be playing with would look something like this:

You can see that it is not a very big graph:

but it is quite densely connected - it has a lot of relationships between the nodes:

So now I can do some more interesting queries on the data, and see if - like in Thomas' research - I kind find out some interesting stuff about this dataset.

Take it for a spin: CYPHER queries!

Let's start with some simple queries. Let's figure out how many people have visited the different shows:

match (g:GUEST)-[v:VISITED]->(sh:SHOW)
return as Show, count(v) as NrOfVisits
order by NrOfVisits desc;

And we immediately get a feel for the dominant talkshows:

But then let's see how many of these talkshow guests are politicians (or have political affiliations at least). Let's expand the query a bit:

match (g:GUEST)-[v:VISITED]->(sh:SHOW),
return as Show, count(v) as NrOfVisits
order by NrOfVisits desc;

And see if there is any difference in the way the shows are ranked:

Interesting. There are indeed some differences, as you can see.

Now let's look at another perspective in our dataset: Gender. Let's look at the distribution of male/female guests to all of these shows:

match (g:GUEST)-[:HAS_GENDER]->(gen:GENDER),
return, count(v)
order by ASC;

we can clearly still see the dominance of men in these shows:
If we then add the political dimension again, and look at gender distribution for the political visitors to the shows:

match (g:GUEST)-[:HAS_GENDER]->(gen:GENDER),
return, count(v)
order by ASC;

then we can see that the distribution is broadly the same:

I am sure there are plenty of other queries to think of, but let me do one more in this post: let's see what the overlap is - in terms of guests visiting them - between the different shows. To do that, all we need to do is calculate some paths between two shows: DWDD and P&W.

match p = AllShortestPaths((s1:SHOW {id:"DWDD"})-[*..2]-(s2:SHOW {id:"P&W"}))
return p
limit 100;

The result is exactly what you would expect: a HUGE amount of overlap - at least between these two (see above: largest) shows. Hence the "limit 100" in the query - so that my poor neo4j browser would survive:


That's about all I have at this point. You can download the database from over here. And the queries that I used above are all on github.

From my perspective, I think these kinds of datasets are extremely interesting and powerful. I would love to see more work like Thomas', from my own country or abroad, and look at this from an even broader perspective. In any case, I would like to thank and compliment Thomas on his work - and look forward to your feedback.

Hope this was useful.



Sunday, 2 March 2014

Food networks, Countries, Diets, Health - and LOAD CSV

Last weekend I took my kids to the awesome Antwerp Zoo. We have season's tickets, and go there regularly - but for some reason it had been a while since all of us had gone together. While visiting the penguins, my daughter points to this picture

and shouts: "LOOK DADDY, A NEO4J DATABASE!". I am not kidding you - true story. And she was, of course, right: it was the "web of the sea", a predator-prey network of how sealife interacts with eachother.
So that got me browsing the web for a while, looking for other examples of such networks. And before long I found a dataset that really triggered my interest: on "Follow the data" I found this article that mentioned a google spreadsheet with some really interesting stuff. It basically has a lot of information about Countries, their dietary habits, and their health statistics. Excellent. I can make a graph out of that. 

Neo4j 2.1.MO1 – native loading of CSV files in cypher

At the same time one of my colleagues pinged me about a new milestone beta release of neo4j: version 2.1.MO1. This is the first milestone release after the ground-breaking 2.0 release that came out end of last year – and it is looking like a very interesting one. One of the key new features in 2.1 is going to be a set of features that will allow us to Import Data more easily. A pet pieve of mine, as you know.
I read through the manual pages, and thought it would be easy enough to use. So I spend some time getting the spreadsheet mentioned above into the right format for import, and took it for a spin.
In the zip-file over here, you can download a couple of files that allow you to do it all yourself. But for now, let me take you through it.

Importing data with LOAD CSV

The process of importing data was really, really easy now. All you need to do is       tell Neo4j what to do (load the csv),       assign a variable to the set (csvimport in this case) and then use the column names of the set as parameters for your cypher statement. The result was there instantaneously:
One thing that I did want to do then, was to use labels to provide structure to the graph, and use it for indexing:
With that Import I have my Countries and my Food Categories imported, so now I would want to add some relationships. I chose a model like the one outlined below: a country uses different food categories, at a different rate (kcalories used per day).

So first we import the relationships between the countries and the food categories used in that country:
As you can see, the relationships hold the values of the kilo-calories that that country uses of this specific food category.

That was quick!

So now we can do some querying. Let’s see what are the food categories that Belgium and the Netherlands have in common, and that have a significant part of the diet:

When we limit the query to only the food categories that are used for more than 500 kcal per day, we get:
These are the categories that apply:

  • Animal Products
  • Vegetal Products
  •  Cereals - Excluding Beer (strange!)
  •  Wheat

Then, I decided to use LOAD CSV one last time to add some health data that was also in the original dataset: the life expectancy data of the countries in the dataset. This data contains two interesting data elements that I imported:
  • The Life Expectancy At Birth (LEAB)
  • The HEalthy Life Expectancy At Birth (HELEAB)
I decided to import both of these, in a specific way. You may have been able to tell from the model picture above, but I created an in-graph Life Expectancy Index. By importing 100 Life Expectancies (1-100 years of age) as separate nodes, and then connecting the countries to these nodes as I used LOAD CSV. I used two different types of relationships for the LEAB and the HELEAB.

The following import was easy using LOAD CSV:

So then we could actually revisit the queries above, but include these interesting health stats about life expectancies:

The result shows how Belgium and the Netherlands have identical LEABs, but different HELEABs – interesting.

I am sure there are a bunch of other interesting queries in this dataset, but for now I think I have satisfied my curiosity – and learned about an awesome new Import tool – LOAD CSV. 

Hope this was useful.