Showing posts with label networks. Show all posts
Showing posts with label networks. Show all posts

Sunday, 2 March 2014

Food networks, Countries, Diets, Health - and LOAD CSV

Last weekend I took my kids to the awesome Antwerp Zoo. We have season's tickets, and go there regularly - but for some reason it had been a while since all of us had gone together. While visiting the penguins, my daughter points to this picture

and shouts: "LOOK DADDY, A NEO4J DATABASE!". I am not kidding you - true story. And she was, of course, right: it was the "web of the sea", a predator-prey network of how sealife interacts with eachother.
So that got me browsing the web for a while, looking for other examples of such networks. And before long I found a dataset that really triggered my interest: on "Follow the data" I found this article that mentioned a google spreadsheet with some really interesting stuff. It basically has a lot of information about Countries, their dietary habits, and their health statistics. Excellent. I can make a graph out of that. 

Neo4j 2.1.MO1 – native loading of CSV files in cypher

At the same time one of my colleagues pinged me about a new milestone beta release of neo4j: version 2.1.MO1. This is the first milestone release after the ground-breaking 2.0 release that came out end of last year – and it is looking like a very interesting one. One of the key new features in 2.1 is going to be a set of features that will allow us to Import Data more easily. A pet pieve of mine, as you know.
I read through the manual pages, and thought it would be easy enough to use. So I spend some time getting the spreadsheet mentioned above into the right format for import, and took it for a spin.
In the zip-file over here, you can download a couple of files that allow you to do it all yourself. But for now, let me take you through it.

Importing data with LOAD CSV

The process of importing data was really, really easy now. All you need to do is       tell Neo4j what to do (load the csv),       assign a variable to the set (csvimport in this case) and then use the column names of the set as parameters for your cypher statement. The result was there instantaneously:
One thing that I did want to do then, was to use labels to provide structure to the graph, and use it for indexing:
With that Import I have my Countries and my Food Categories imported, so now I would want to add some relationships. I chose a model like the one outlined below: a country uses different food categories, at a different rate (kcalories used per day).

So first we import the relationships between the countries and the food categories used in that country:
As you can see, the relationships hold the values of the kilo-calories that that country uses of this specific food category.




That was quick!


So now we can do some querying. Let’s see what are the food categories that Belgium and the Netherlands have in common, and that have a significant part of the diet:




When we limit the query to only the food categories that are used for more than 500 kcal per day, we get:
These are the categories that apply:

  • Animal Products
  • Vegetal Products
  •  Cereals - Excluding Beer (strange!)
  •  Wheat


Then, I decided to use LOAD CSV one last time to add some health data that was also in the original dataset: the life expectancy data of the countries in the dataset. This data contains two interesting data elements that I imported:
  • The Life Expectancy At Birth (LEAB)
  • The HEalthy Life Expectancy At Birth (HELEAB)
I decided to import both of these, in a specific way. You may have been able to tell from the model picture above, but I created an in-graph Life Expectancy Index. By importing 100 Life Expectancies (1-100 years of age) as separate nodes, and then connecting the countries to these nodes as I used LOAD CSV. I used two different types of relationships for the LEAB and the HELEAB.




The following import was easy using LOAD CSV:
 




So then we could actually revisit the queries above, but include these interesting health stats about life expectancies:



The result shows how Belgium and the Netherlands have identical LEABs, but different HELEABs – interesting.





I am sure there are a bunch of other interesting queries in this dataset, but for now I think I have satisfied my curiosity – and learned about an awesome new Import tool – LOAD CSV. 


Hope this was useful.


Cheers

Rik

Tuesday, 17 December 2013

Fascinating food networks, in neo4j

When you're passionate about graphs like I am, you start to see them everywhere. And as we are getting closer to the food-heavy season of the year, it's perhaps no coincidence that this graph I will be introducing in this blogpost - is about food.

A couple of weeks ago, when I woke up early (!) Sunday morning to get "pistolets" and croissants for my family from our local bakery, I immediately took notice when I saw a graph behind the bakery counter. It was a "foodpairing" graph, sponsored by the people of Puratos - a wholesale provider of bakery products, grains, etc. So I get home and start googling, and before you know it I find some terribly interesting research by Yong-Yeol (YY) Ahn,  featured in a Wired article, and in Scientific American, and in Nature. This researcher had done some fascinating work in understanding al 57k recipes from Epicurious, Allrecipes and Menupan, their composing ingredients and ingredient categories, their origin and - perhaps most fascinating of all - their chemical compounds.

And best of all: he made his datasets (this one and this one) available, so that I could spend some time trying to get it into neo4j and take it for a spin.

The dataset: some graph cleanup required

The dataset was there, but clearly wasn't perfect for import yet. I would have to do some work. And like always, that works starts with a model. Time to use Arrows again, and start drawing. I ended up with this:

The challenge really was in the recipes. As you can see from the screenshot below, that data is/was hugely denormalised in the dataset that I found, and logically so: some recipes will only have a very limited number of ingredients, others will have lots and lots:

So what do you do - especially when you're not a programmer like myself? Indeed, MS Excel to the rescue!

It turned out to be a bit of manual work, but in the end I found it very easy to create the sheet that I needed. It was even less than 500k rows long in the end - so Excel didn't really blink. You can find the final excel file that I created over here.

Then it was really just a matter of exporting excel to CSV files, and getting it ready for import into neo4j with neo4j-shell-tools. Again: easy enough - I sort of went through this a couple of times before. You can find the zip file with all the csv files over here, and the neo4j-shell instructions are in this gist.

As you can see from the screenshot below, the dataset was well imported, without any issues, in a matter of minutes.


So then, the fun could begin! Interactive exploration, in the awesome neo4j browser.

Query fun on the foodnetwork

I have put all of the queries that I wrote on this gist over here - but I am sure you can come up with some more interesting ones.

Let's look if we can find out how many recipe-categories there would be in the different areas if the dataset. That would mean looking for the following pattern:

The cypher query would look something like this:


and that would yield the following result:
Clearly North America is leading the charts up here, but it's kind of interesting to compare the different continents/areas and compare what types of ingredient-categories are leading there.

Or another interesting example, zooming into the specific Cuisines: what are the most popular ingredient categories in Belgium and the Netherlands, two neighbouring countries with a lot in common. The cypher query would look something like:

and the results would look like this (click for larger view): 

And then last but not least, let's look at some specific recipes based on actual ingredients that we like. For example, I am a big fan of a "salade Liègeoise", which is a lukewarm dish with bacon, green beans, potatoes and in some cases, hard boiled eggs. Let's see if we could find any other recipes in our database that would use these ingredients? Chances are that we would like them, no? So here goes. The cypher query would go like this:

Note the use of the "collect" function to get all the ingredients of a recipe into one resultset column. And the result is actually quite interesting:


And also visually this gives us a pretty interesting picture:
Turns out there's quite a few similar dishes that I could choose from. Gotta do that some day :) ...

And now it's your turn

If you want to play around with this dataset yourself, there are multiple options:
  • start with the zipped import files and the import script as described above
  • download the zipped graph.db directory from over here.
  • pay a visit to our friends at Graphenedb.com, who have an extremely nice sandbox environment that you can play around with. Handle with care, of course!
If you do, you may also want to apply this grass-file so that you don't have to mess around with the default settings. 

I hope you thought this was as interesting as I found it - and as always, would love to get your feedback! In any case, I wish you and your families
  a Merry Christmas, and a Happy New Year!  

Cheers

Rik