Friday 6 December 2013

Untying the Graph Database Import Knot

Working for Neo Technology has many, many upsides. I love my job, love my colleagues, love our product, love our market - I think you can pretty much say that I am a happy camper. But. There's always a but. At least a couple times a week I am confronted with things that make me go "Oh no, not that again!" And "that" is usually about one particular topic: Importing data into Neo4j. Many, smart people are having trouble with it  - and there are many reasons for this. So let's start zooming into this Gordian Knot - and see if we can untie it - without having to cut it ;-) ...

The Graph Database Import Knot

The first thing that everyone should understand that, in a connected world, importing data is, per definition more difficult to do. It is a true "knot" that is terribly difficult to untie, for many different reasons.

Just logically, the problem of importing "connected" data is technically more difficult than with "unconnected" data structures. Importing unconnected data (eg. the nodes of your graph model) is always easy/easier. Just dump it all in there. But then you come to importing the connections, the relationships, and you find that there's no such thing as an "external entity" (aka "the database schema") that is going to be ensuring the consistency and connectedness of the import. You have to do that yourself, and explicitly, by importing the relationships between a) a start node that you have to find, and b) an end node that you have to lookup. It's just ... more complicated. Especially at scale, of course.

So how to untie this knot? I can really see two steps that everyone needs to take, in order to do so:
  1. Understand the import problem. Every import is different, just like every graph is different. There is little or no uniformity there, and in spite of the fact that many people would love to just have a silver bullet solution to this problem, the fact of the matter is that there is none - at least not today. So therefore we will have to "create" a more or less complex import solution for every use case - using one of the tools at hand. But like with any problem, understanding the import problem is often the key to choosing the right solution - so that's what I will focus on here as well.
  2. Pick the right tool. There are many tools out there, and we should not be defeated by the law of the instrument - and use the right tool for the job. Maybe, this article can help in bringing these different tools together, bring some structure to them, and then - even though I have not used all tools, but I have used a few - I can also tell you about my experiences. That should allow us to make some kind of a mapping between the different types of Import problems, and the different tools at hand.
So let's give it a shot.

YOUR import scenario

Like I said before: one import problem is different from the next one. Some people want to store the facebook social graph in neo4j, other people just want to import a couple of thousand proteins and their interactions. It's really, very different. So what are the questions that you should ask yourself? Let me try and map that out for you:


This little mindmap should give you an insight into the types of questions you should ask yourself. Some of these are project related, others are size/scale related, others are format related, and then the final set of questions are related to the type of import that you are trying to do. 

The Tools Inventory

If you have ever visited the neo4j website, you have probably come across the import page. There's a wealth of information there around the different types of tools available, but I would like to try and help by providing a bit of structure to these tools:


So these tools range from using a spreadsheet - which most of use should be able to wield as a tool - to writing a custom piece of software to achieve the solution to the import problem at hand. The order in which I present these is probably very close to "from easy to difficult", and "from not so powerful to very powerful". 

So let's start doing a little assessment on these tools. Note that this is by no means scientific - this is just "Rik's view of the world".

ProsCons
SpreadsheetsVery easy: all you need to do is write some formulas that concatenate strings with cell content - and compose cypher statements this way. These cypher statements can then just be copied into the neo4j-shell.Only works at limited scale (< 5000 nodes/relationships at a time). Performance is not good: overhead of unparametrized cypher transactions. Quirks in copying/pasting the statements above a certain scale. Piping the statements in can work on OSX/Linux - but not on Windows.
Neo4j-shell
Cypher StatementsNative toolset - no need to install anything else. Neo4j-shell can be used to pipe to in OSX/Linux - which can be very handy.You have to create the statements (see above). If they are not parametrized, they will be slow because of the parsing overhead.
neo4j-shell-toolsFantastic, rich functionality for importing .csv, geoff and graphml files. Not a part of the product (yet). Requires a separate install.
Command line
batch importerHigh-performance, easy to use (if you know maven).Specific purpose, for CSV files. Currently does not have easy install procedures.
ETL tools
TalendOut of the box, versatile, customizable, uses specific Neo4j connector - both in online and offline modes.Requires you to learn Talend. Current connector not yet upgraded to neo4j 2.0.
MulesoftOut of the box, versatile, customizable, uses the JDBC connector in online mode.Requires you to learn Mulesoft. No batch loading of offline database supported.
Custom Software
Java API
High Performance, perfectly customizable, supports different input types specific for your use case!
You have to write the code!
REST API
Spring Data Neo4j

So if this assessment is close enough, then how would we map the different import scenarios sketched above, to these different tools? Let's do an attempt at creating that.

Mapping the scenario to the inventory

Here's my mapping:


So there is pretty much a use case for every one of the tools - it's not like you can discard any of them easily. But, if you would ask my subjective assessment, here's my personal recommendation:
  • the spreadsheet way is fantastic. It just works, and it's quick to get something done in no time. I still use it regularly.
  • neo4j-shell-tools is my personal favourite in terms of versatility. Easy to use, different file format support, scales to large datasets - what's not to like?
  • for many real-world solutions which require regular updates of the database - you will need to write software. Just like you used to do with your relational databases system - nothing's changed there!

Hope this was a useful discussion - if you want you can download the entire mindmap that I used for this blogpost from over here.

All the best

Rik

No comments:

Post a Comment