Sunday 2 March 2014

Food networks, Countries, Diets, Health - and LOAD CSV

Last weekend I took my kids to the awesome Antwerp Zoo. We have season's tickets, and go there regularly - but for some reason it had been a while since all of us had gone together. While visiting the penguins, my daughter points to this picture

and shouts: "LOOK DADDY, A NEO4J DATABASE!". I am not kidding you - true story. And she was, of course, right: it was the "web of the sea", a predator-prey network of how sealife interacts with eachother.
So that got me browsing the web for a while, looking for other examples of such networks. And before long I found a dataset that really triggered my interest: on "Follow the data" I found this article that mentioned a google spreadsheet with some really interesting stuff. It basically has a lot of information about Countries, their dietary habits, and their health statistics. Excellent. I can make a graph out of that. 

Neo4j 2.1.MO1 – native loading of CSV files in cypher

At the same time one of my colleagues pinged me about a new milestone beta release of neo4j: version 2.1.MO1. This is the first milestone release after the ground-breaking 2.0 release that came out end of last year – and it is looking like a very interesting one. One of the key new features in 2.1 is going to be a set of features that will allow us to Import Data more easily. A pet pieve of mine, as you know.
I read through the manual pages, and thought it would be easy enough to use. So I spend some time getting the spreadsheet mentioned above into the right format for import, and took it for a spin.
In the zip-file over here, you can download a couple of files that allow you to do it all yourself. But for now, let me take you through it.

Importing data with LOAD CSV

The process of importing data was really, really easy now. All you need to do is       tell Neo4j what to do (load the csv),       assign a variable to the set (csvimport in this case) and then use the column names of the set as parameters for your cypher statement. The result was there instantaneously:
One thing that I did want to do then, was to use labels to provide structure to the graph, and use it for indexing:
With that Import I have my Countries and my Food Categories imported, so now I would want to add some relationships. I chose a model like the one outlined below: a country uses different food categories, at a different rate (kcalories used per day).

So first we import the relationships between the countries and the food categories used in that country:
As you can see, the relationships hold the values of the kilo-calories that that country uses of this specific food category.

That was quick!

So now we can do some querying. Let’s see what are the food categories that Belgium and the Netherlands have in common, and that have a significant part of the diet:

When we limit the query to only the food categories that are used for more than 500 kcal per day, we get:
These are the categories that apply:

  • Animal Products
  • Vegetal Products
  •  Cereals - Excluding Beer (strange!)
  •  Wheat

Then, I decided to use LOAD CSV one last time to add some health data that was also in the original dataset: the life expectancy data of the countries in the dataset. This data contains two interesting data elements that I imported:
  • The Life Expectancy At Birth (LEAB)
  • The HEalthy Life Expectancy At Birth (HELEAB)
I decided to import both of these, in a specific way. You may have been able to tell from the model picture above, but I created an in-graph Life Expectancy Index. By importing 100 Life Expectancies (1-100 years of age) as separate nodes, and then connecting the countries to these nodes as I used LOAD CSV. I used two different types of relationships for the LEAB and the HELEAB.

The following import was easy using LOAD CSV:

So then we could actually revisit the queries above, but include these interesting health stats about life expectancies:

The result shows how Belgium and the Netherlands have identical LEABs, but different HELEABs – interesting.

I am sure there are a bunch of other interesting queries in this dataset, but for now I think I have satisfied my curiosity – and learned about an awesome new Import tool – LOAD CSV. 

Hope this was useful.