Showing posts with label neo4j-shell-tools. Show all posts
Showing posts with label neo4j-shell-tools. Show all posts

Tuesday, 17 December 2013

Fascinating food networks, in neo4j

When you're passionate about graphs like I am, you start to see them everywhere. And as we are getting closer to the food-heavy season of the year, it's perhaps no coincidence that this graph I will be introducing in this blogpost - is about food.

A couple of weeks ago, when I woke up early (!) Sunday morning to get "pistolets" and croissants for my family from our local bakery, I immediately took notice when I saw a graph behind the bakery counter. It was a "foodpairing" graph, sponsored by the people of Puratos - a wholesale provider of bakery products, grains, etc. So I get home and start googling, and before you know it I find some terribly interesting research by Yong-Yeol (YY) Ahn,  featured in a Wired article, and in Scientific American, and in Nature. This researcher had done some fascinating work in understanding al 57k recipes from Epicurious, Allrecipes and Menupan, their composing ingredients and ingredient categories, their origin and - perhaps most fascinating of all - their chemical compounds.

And best of all: he made his datasets (this one and this one) available, so that I could spend some time trying to get it into neo4j and take it for a spin.

The dataset: some graph cleanup required

The dataset was there, but clearly wasn't perfect for import yet. I would have to do some work. And like always, that works starts with a model. Time to use Arrows again, and start drawing. I ended up with this:

The challenge really was in the recipes. As you can see from the screenshot below, that data is/was hugely denormalised in the dataset that I found, and logically so: some recipes will only have a very limited number of ingredients, others will have lots and lots:

So what do you do - especially when you're not a programmer like myself? Indeed, MS Excel to the rescue!

It turned out to be a bit of manual work, but in the end I found it very easy to create the sheet that I needed. It was even less than 500k rows long in the end - so Excel didn't really blink. You can find the final excel file that I created over here.

Then it was really just a matter of exporting excel to CSV files, and getting it ready for import into neo4j with neo4j-shell-tools. Again: easy enough - I sort of went through this a couple of times before. You can find the zip file with all the csv files over here, and the neo4j-shell instructions are in this gist.

As you can see from the screenshot below, the dataset was well imported, without any issues, in a matter of minutes.


So then, the fun could begin! Interactive exploration, in the awesome neo4j browser.

Query fun on the foodnetwork

I have put all of the queries that I wrote on this gist over here - but I am sure you can come up with some more interesting ones.

Let's look if we can find out how many recipe-categories there would be in the different areas if the dataset. That would mean looking for the following pattern:

The cypher query would look something like this:


and that would yield the following result:
Clearly North America is leading the charts up here, but it's kind of interesting to compare the different continents/areas and compare what types of ingredient-categories are leading there.

Or another interesting example, zooming into the specific Cuisines: what are the most popular ingredient categories in Belgium and the Netherlands, two neighbouring countries with a lot in common. The cypher query would look something like:

and the results would look like this (click for larger view): 

And then last but not least, let's look at some specific recipes based on actual ingredients that we like. For example, I am a big fan of a "salade Liègeoise", which is a lukewarm dish with bacon, green beans, potatoes and in some cases, hard boiled eggs. Let's see if we could find any other recipes in our database that would use these ingredients? Chances are that we would like them, no? So here goes. The cypher query would go like this:

Note the use of the "collect" function to get all the ingredients of a recipe into one resultset column. And the result is actually quite interesting:


And also visually this gives us a pretty interesting picture:
Turns out there's quite a few similar dishes that I could choose from. Gotta do that some day :) ...

And now it's your turn

If you want to play around with this dataset yourself, there are multiple options:
  • start with the zipped import files and the import script as described above
  • download the zipped graph.db directory from over here.
  • pay a visit to our friends at Graphenedb.com, who have an extremely nice sandbox environment that you can play around with. Handle with care, of course!
If you do, you may also want to apply this grass-file so that you don't have to mess around with the default settings. 

I hope you thought this was as interesting as I found it - and as always, would love to get your feedback! In any case, I wish you and your families
  a Merry Christmas, and a Happy New Year!  

Cheers

Rik

Friday, 6 December 2013

Untying the Graph Database Import Knot

Working for Neo Technology has many, many upsides. I love my job, love my colleagues, love our product, love our market - I think you can pretty much say that I am a happy camper. But. There's always a but. At least a couple times a week I am confronted with things that make me go "Oh no, not that again!" And "that" is usually about one particular topic: Importing data into Neo4j. Many, smart people are having trouble with it  - and there are many reasons for this. So let's start zooming into this Gordian Knot - and see if we can untie it - without having to cut it ;-) ...

The Graph Database Import Knot

The first thing that everyone should understand that, in a connected world, importing data is, per definition more difficult to do. It is a true "knot" that is terribly difficult to untie, for many different reasons.

Just logically, the problem of importing "connected" data is technically more difficult than with "unconnected" data structures. Importing unconnected data (eg. the nodes of your graph model) is always easy/easier. Just dump it all in there. But then you come to importing the connections, the relationships, and you find that there's no such thing as an "external entity" (aka "the database schema") that is going to be ensuring the consistency and connectedness of the import. You have to do that yourself, and explicitly, by importing the relationships between a) a start node that you have to find, and b) an end node that you have to lookup. It's just ... more complicated. Especially at scale, of course.

So how to untie this knot? I can really see two steps that everyone needs to take, in order to do so:
  1. Understand the import problem. Every import is different, just like every graph is different. There is little or no uniformity there, and in spite of the fact that many people would love to just have a silver bullet solution to this problem, the fact of the matter is that there is none - at least not today. So therefore we will have to "create" a more or less complex import solution for every use case - using one of the tools at hand. But like with any problem, understanding the import problem is often the key to choosing the right solution - so that's what I will focus on here as well.
  2. Pick the right tool. There are many tools out there, and we should not be defeated by the law of the instrument - and use the right tool for the job. Maybe, this article can help in bringing these different tools together, bring some structure to them, and then - even though I have not used all tools, but I have used a few - I can also tell you about my experiences. That should allow us to make some kind of a mapping between the different types of Import problems, and the different tools at hand.
So let's give it a shot.

YOUR import scenario

Like I said before: one import problem is different from the next one. Some people want to store the facebook social graph in neo4j, other people just want to import a couple of thousand proteins and their interactions. It's really, very different. So what are the questions that you should ask yourself? Let me try and map that out for you:


This little mindmap should give you an insight into the types of questions you should ask yourself. Some of these are project related, others are size/scale related, others are format related, and then the final set of questions are related to the type of import that you are trying to do. 

The Tools Inventory

If you have ever visited the neo4j website, you have probably come across the import page. There's a wealth of information there around the different types of tools available, but I would like to try and help by providing a bit of structure to these tools:


So these tools range from using a spreadsheet - which most of use should be able to wield as a tool - to writing a custom piece of software to achieve the solution to the import problem at hand. The order in which I present these is probably very close to "from easy to difficult", and "from not so powerful to very powerful". 

So let's start doing a little assessment on these tools. Note that this is by no means scientific - this is just "Rik's view of the world".

ProsCons
SpreadsheetsVery easy: all you need to do is write some formulas that concatenate strings with cell content - and compose cypher statements this way. These cypher statements can then just be copied into the neo4j-shell.Only works at limited scale (< 5000 nodes/relationships at a time). Performance is not good: overhead of unparametrized cypher transactions. Quirks in copying/pasting the statements above a certain scale. Piping the statements in can work on OSX/Linux - but not on Windows.
Neo4j-shell
Cypher StatementsNative toolset - no need to install anything else. Neo4j-shell can be used to pipe to in OSX/Linux - which can be very handy.You have to create the statements (see above). If they are not parametrized, they will be slow because of the parsing overhead.
neo4j-shell-toolsFantastic, rich functionality for importing .csv, geoff and graphml files. Not a part of the product (yet). Requires a separate install.
Command line
batch importerHigh-performance, easy to use (if you know maven).Specific purpose, for CSV files. Currently does not have easy install procedures.
ETL tools
TalendOut of the box, versatile, customizable, uses specific Neo4j connector - both in online and offline modes.Requires you to learn Talend. Current connector not yet upgraded to neo4j 2.0.
MulesoftOut of the box, versatile, customizable, uses the JDBC connector in online mode.Requires you to learn Mulesoft. No batch loading of offline database supported.
Custom Software
Java API
High Performance, perfectly customizable, supports different input types specific for your use case!
You have to write the code!
REST API
Spring Data Neo4j

So if this assessment is close enough, then how would we map the different import scenarios sketched above, to these different tools? Let's do an attempt at creating that.

Mapping the scenario to the inventory

Here's my mapping:


So there is pretty much a use case for every one of the tools - it's not like you can discard any of them easily. But, if you would ask my subjective assessment, here's my personal recommendation:
  • the spreadsheet way is fantastic. It just works, and it's quick to get something done in no time. I still use it regularly.
  • neo4j-shell-tools is my personal favourite in terms of versatility. Easy to use, different file format support, scales to large datasets - what's not to like?
  • for many real-world solutions which require regular updates of the database - you will need to write software. Just like you used to do with your relational databases system - nothing's changed there!

Hope this was a useful discussion - if you want you can download the entire mindmap that I used for this blogpost from over here.

All the best

Rik