Tuesday 17 December 2013

Fascinating food networks, in neo4j

When you're passionate about graphs like I am, you start to see them everywhere. And as we are getting closer to the food-heavy season of the year, it's perhaps no coincidence that this graph I will be introducing in this blogpost - is about food.

A couple of weeks ago, when I woke up early (!) Sunday morning to get "pistolets" and croissants for my family from our local bakery, I immediately took notice when I saw a graph behind the bakery counter. It was a "foodpairing" graph, sponsored by the people of Puratos - a wholesale provider of bakery products, grains, etc. So I get home and start googling, and before you know it I find some terribly interesting research by Yong-Yeol (YY) Ahn,  featured in a Wired article, and in Scientific American, and in Nature. This researcher had done some fascinating work in understanding al 57k recipes from Epicurious, Allrecipes and Menupan, their composing ingredients and ingredient categories, their origin and - perhaps most fascinating of all - their chemical compounds.

And best of all: he made his datasets (this one and this one) available, so that I could spend some time trying to get it into neo4j and take it for a spin.

The dataset: some graph cleanup required

The dataset was there, but clearly wasn't perfect for import yet. I would have to do some work. And like always, that works starts with a model. Time to use Arrows again, and start drawing. I ended up with this:

The challenge really was in the recipes. As you can see from the screenshot below, that data is/was hugely denormalised in the dataset that I found, and logically so: some recipes will only have a very limited number of ingredients, others will have lots and lots:

So what do you do - especially when you're not a programmer like myself? Indeed, MS Excel to the rescue!

It turned out to be a bit of manual work, but in the end I found it very easy to create the sheet that I needed. It was even less than 500k rows long in the end - so Excel didn't really blink. You can find the final excel file that I created over here.

Then it was really just a matter of exporting excel to CSV files, and getting it ready for import into neo4j with neo4j-shell-tools. Again: easy enough - I sort of went through this a couple of times before. You can find the zip file with all the csv files over here, and the neo4j-shell instructions are in this gist.

As you can see from the screenshot below, the dataset was well imported, without any issues, in a matter of minutes.


So then, the fun could begin! Interactive exploration, in the awesome neo4j browser.

Query fun on the foodnetwork

I have put all of the queries that I wrote on this gist over here - but I am sure you can come up with some more interesting ones.

Let's look if we can find out how many recipe-categories there would be in the different areas if the dataset. That would mean looking for the following pattern:

The cypher query would look something like this:


and that would yield the following result:
Clearly North America is leading the charts up here, but it's kind of interesting to compare the different continents/areas and compare what types of ingredient-categories are leading there.

Or another interesting example, zooming into the specific Cuisines: what are the most popular ingredient categories in Belgium and the Netherlands, two neighbouring countries with a lot in common. The cypher query would look something like:

and the results would look like this (click for larger view): 

And then last but not least, let's look at some specific recipes based on actual ingredients that we like. For example, I am a big fan of a "salade Liègeoise", which is a lukewarm dish with bacon, green beans, potatoes and in some cases, hard boiled eggs. Let's see if we could find any other recipes in our database that would use these ingredients? Chances are that we would like them, no? So here goes. The cypher query would go like this:

Note the use of the "collect" function to get all the ingredients of a recipe into one resultset column. And the result is actually quite interesting:


And also visually this gives us a pretty interesting picture:
Turns out there's quite a few similar dishes that I could choose from. Gotta do that some day :) ...

And now it's your turn

If you want to play around with this dataset yourself, there are multiple options:
  • start with the zipped import files and the import script as described above
  • download the zipped graph.db directory from over here.
  • pay a visit to our friends at Graphenedb.com, who have an extremely nice sandbox environment that you can play around with. Handle with care, of course!
If you do, you may also want to apply this grass-file so that you don't have to mess around with the default settings. 

I hope you thought this was as interesting as I found it - and as always, would love to get your feedback! In any case, I wish you and your families
  a Merry Christmas, and a Happy New Year!  

Cheers

Rik

7 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. What a blog it is. This is giving me very important information so thanks a lot for posting.

    Online Food Ordering

    ReplyDelete
    Replies
    1. Pleasure! Hope you have fun with this - and happy holidays!

      Delete
  3. Can i use this data for commercial usse?

    ReplyDelete
    Replies
    1. I think so - although it's really old by now and I doubt that there are really interesting commercial uses. What are you thinking about?

      You should of course check the license of the original data that I have used above...

      Rik

      Delete
    2. Thanks Rik. I was thinking of making an app for myself first. Something to suggest recipes. Not sure right now this will ever become a commercial app, but I am asking just in case. Do you have more recent data?. Thanks for making this available. (BTW: for future readers, would be nice if the queries could be copied to the clipboard. With images, we need to retype them.)

      Delete
    3. Stephane, all the queries are in the gist that is mentioned in the post: https://gist.github.com/rvanbruggen/8007697#file-foodqueries-cql ... Does that not work???

      Delete