Bruggen Blog: excel

Tuesday, 17 December 2013

Fascinating food networks, in neo4j

When you're passionate about graphs like I am, you start to see them everywhere. And as we are getting closer to the food-heavy season of the year, it's perhaps no coincidence that this graph I will be introducing in this blogpost - is about food.

A couple of weeks ago, when I woke up early (!) Sunday morning to get "pistolets" and croissants for my family from our local bakery, I immediately took notice when I saw a graph behind the bakery counter. It was a "foodpairing" graph, sponsored by the people of Puratos - a wholesale provider of bakery products, grains, etc. So I get home and start googling, and before you know it I find some terribly interesting research by Yong-Yeol (YY) Ahn, featured in a Wired article, and in Scientific American, and in Nature. This researcher had done some fascinating work in understanding al 57k recipes from Epicurious, Allrecipes and Menupan, their composing ingredients and ingredient categories, their origin and - perhaps most fascinating of all - their chemical compounds.

And best of all: he made his datasets (this one and this one) available, so that I could spend some time trying to get it into neo4j and take it for a spin.

The dataset: some graph cleanup required

The dataset was there, but clearly wasn't perfect for import yet. I would have to do some work. And like always, that works starts with a model. Time to use Arrows again, and start drawing. I ended up with this:

The challenge really was in the recipes. As you can see from the screenshot below, that data is/was hugely denormalised in the dataset that I found, and logically so: some recipes will only have a very limited number of ingredients, others will have lots and lots:

So what do you do - especially when you're not a programmer like myself? Indeed, MS Excel to the rescue!

It turned out to be a bit of manual work, but in the end I found it very easy to create the sheet that I needed. It was even less than 500k rows long in the end - so Excel didn't really blink. You can find the final excel file that I created over here.

Then it was really just a matter of exporting excel to CSV files, and getting it ready for import into neo4j with neo4j-shell-tools. Again: easy enough - I sort of went through this a couple of times before. You can find the zip file with all the csv files over here, and the neo4j-shell instructions are in this gist.

As you can see from the screenshot below, the dataset was well imported, without any issues, in a matter of minutes.

So then, the fun could begin! Interactive exploration, in the awesome neo4j browser.

Query fun on the foodnetwork

I have put all of the queries that I wrote on this gist over here - but I am sure you can come up with some more interesting ones.

Let's look if we can find out how many recipe-categories there would be in the different areas if the dataset. That would mean looking for the following pattern:

The cypher query would look something like this:

and that would yield the following result:

Clearly North America is leading the charts up here, but it's kind of interesting to compare the different continents/areas and compare what types of ingredient-categories are leading there.

Or another interesting example, zooming into the specific Cuisines: what are the most popular ingredient categories in Belgium and the Netherlands, two neighbouring countries with a lot in common. The cypher query would look something like:

and the results would look like this (click for larger view):

And then last but not least, let's look at some specific recipes based on actual ingredients that we like. For example, I am a big fan of a "salade Liègeoise", which is a lukewarm dish with bacon, green beans, potatoes and in some cases, hard boiled eggs. Let's see if we could find any other recipes in our database that would use these ingredients? Chances are that we would like them, no? So here goes. The cypher query would go like this:

Note the use of the "collect" function to get all the ingredients of a recipe into one resultset column. And the result is actually quite interesting:

And also visually this gives us a pretty interesting picture:

Turns out there's quite a few similar dishes that I could choose from. Gotta do that some day :) ...

And now it's your turn

If you want to play around with this dataset yourself, there are multiple options:

start with the zipped import files and the import script as described above
download the zipped graph.db directory from over here.
pay a visit to our friends at Graphenedb.com, who have an extremely nice sandbox environment that you can play around with. Handle with care, of course!

If you do, you may also want to apply this grass-file so that you don't have to mess around with the default settings.

I hope you thought this was as interesting as I found it - and as always, would love to get your feedback! In any case, I wish you and your families

a Merry Christmas, and a Happy New Year!

Cheers

Rik

Bruggen Blog

Pages

Tuesday, 17 December 2013

Fascinating food networks, in neo4j

The dataset: some graph cleanup required

Query fun on the foodnetwork

And now it's your turn

Labels

Blogarchive

Metricool