Friday 20 December 2013

Graphs for Everyone!

Here's some of my thoughts on how to best promote innovative, wonderful, and new technology like neo4j 2.0 - and get it to be used ubiquitously. These are just my own thoughts - but I was hoping they would be useful to our thousands of devs and architects out there that are struggling to sell graph database technology to their peers, their bosses, their business.

So turn up your sound (the Prezi has voice-over - a great new feature!)



Let me know if you have any feedback - would love to hear your thoughts.

Hope this is useful.

Cheers

Rik

Tuesday 17 December 2013

Fascinating food networks, in neo4j

When you're passionate about graphs like I am, you start to see them everywhere. And as we are getting closer to the food-heavy season of the year, it's perhaps no coincidence that this graph I will be introducing in this blogpost - is about food.

A couple of weeks ago, when I woke up early (!) Sunday morning to get "pistolets" and croissants for my family from our local bakery, I immediately took notice when I saw a graph behind the bakery counter. It was a "foodpairing" graph, sponsored by the people of Puratos - a wholesale provider of bakery products, grains, etc. So I get home and start googling, and before you know it I find some terribly interesting research by Yong-Yeol (YY) Ahn,  featured in a Wired article, and in Scientific American, and in Nature. This researcher had done some fascinating work in understanding al 57k recipes from Epicurious, Allrecipes and Menupan, their composing ingredients and ingredient categories, their origin and - perhaps most fascinating of all - their chemical compounds.

And best of all: he made his datasets (this one and this one) available, so that I could spend some time trying to get it into neo4j and take it for a spin.

The dataset: some graph cleanup required

The dataset was there, but clearly wasn't perfect for import yet. I would have to do some work. And like always, that works starts with a model. Time to use Arrows again, and start drawing. I ended up with this:

The challenge really was in the recipes. As you can see from the screenshot below, that data is/was hugely denormalised in the dataset that I found, and logically so: some recipes will only have a very limited number of ingredients, others will have lots and lots:

So what do you do - especially when you're not a programmer like myself? Indeed, MS Excel to the rescue!

It turned out to be a bit of manual work, but in the end I found it very easy to create the sheet that I needed. It was even less than 500k rows long in the end - so Excel didn't really blink. You can find the final excel file that I created over here.

Then it was really just a matter of exporting excel to CSV files, and getting it ready for import into neo4j with neo4j-shell-tools. Again: easy enough - I sort of went through this a couple of times before. You can find the zip file with all the csv files over here, and the neo4j-shell instructions are in this gist.

As you can see from the screenshot below, the dataset was well imported, without any issues, in a matter of minutes.


So then, the fun could begin! Interactive exploration, in the awesome neo4j browser.

Query fun on the foodnetwork

I have put all of the queries that I wrote on this gist over here - but I am sure you can come up with some more interesting ones.

Let's look if we can find out how many recipe-categories there would be in the different areas if the dataset. That would mean looking for the following pattern:

The cypher query would look something like this:


and that would yield the following result:
Clearly North America is leading the charts up here, but it's kind of interesting to compare the different continents/areas and compare what types of ingredient-categories are leading there.

Or another interesting example, zooming into the specific Cuisines: what are the most popular ingredient categories in Belgium and the Netherlands, two neighbouring countries with a lot in common. The cypher query would look something like:

and the results would look like this (click for larger view): 

And then last but not least, let's look at some specific recipes based on actual ingredients that we like. For example, I am a big fan of a "salade Liègeoise", which is a lukewarm dish with bacon, green beans, potatoes and in some cases, hard boiled eggs. Let's see if we could find any other recipes in our database that would use these ingredients? Chances are that we would like them, no? So here goes. The cypher query would go like this:

Note the use of the "collect" function to get all the ingredients of a recipe into one resultset column. And the result is actually quite interesting:


And also visually this gives us a pretty interesting picture:
Turns out there's quite a few similar dishes that I could choose from. Gotta do that some day :) ...

And now it's your turn

If you want to play around with this dataset yourself, there are multiple options:
  • start with the zipped import files and the import script as described above
  • download the zipped graph.db directory from over here.
  • pay a visit to our friends at Graphenedb.com, who have an extremely nice sandbox environment that you can play around with. Handle with care, of course!
If you do, you may also want to apply this grass-file so that you don't have to mess around with the default settings. 

I hope you thought this was as interesting as I found it - and as always, would love to get your feedback! In any case, I wish you and your families
  a Merry Christmas, and a Happy New Year!  

Cheers

Rik

Friday 13 December 2013

Business Continuity Management - a perfect fit for Graphs!

At one of our recent Graph-Cafe meetup events, I had the pleasure of spending some time with a lovely gentleman from a large corporation that was into a profession that I had never heard of: Business Continuity Management. It’s always interesting to learn new things, but even more interesting it became when this fine gentleman started explaining to me that BCM is actually all about graphs. Google defines it as
"Business Continuity Management is a holistic process that identifies both potential threats and the impacts to an organization of their normal business operations should those threats be realized."
But what does that mean? When you think about it some more, you quickly realise that it’s all about the relationships between different parts of a business, and understanding and managing the relationships between these parts in such a way so that the business can run as continuously as possible. Seems obvious? Well - it’s not. Because how do you define “a business”? What does “continuous” mean? And what does that have to do with graphs?

Understanding your business - creating a model

This courteous gentleman - I cannot name him for obvious reasons - was having a little trouble getting started with neo4j, and so we decided to work together. I would create a lovely neo4j dataset for him, and he would help us document and present the use case. So we started with the obvious question: how do we plan for Business Continuity? By understanding our business, right! We have to get a grip on how our processes, departments, applications, physical environments, etc interact - and how we can model this as a graph.


Luckily, my “partner in crime” knew what he was doing. He had already thought of the model, and had created a set of MS Excel files that would accurately represent how business processes, process, business lines/departments, buildings and applications would interact and depend on each other. And: since we are talking about assuring the continuity of the business, he even had a quantitative measure of the importance of business processes and processes - the recovery time objective. You can see from the model how easy it is to represent these intricate relationships, as a graph. So how to go about importing this data into neo4j, so that we could ask some interesting questions?

Loading the data: Spreadsheets rule!

As you can probably tell from some of my previous posts, there are many ways to import data into neo4j. But since the source data in this particular case was already in spreadsheet format, I decided to use the good old spreadsheet technique. Just add a column to my excel sheets, use string concatination to generate Cypher statements based on cell contents, and then copy/paste the resulting Cypher queries into the neo4j-shell - and we’re done. Easy!




Once we have the data in neo4j, the fun can actually begin - and the neo4j browser is going to be a big part of that.

A first look at the BCM data

Let’s explore the newly created dataset a bit, by running a couple of simple queries. The first one actually is a standard query saved in the neo4j browser:

Show the data model: what is related to what, and how?

MATCH (a)-[r]->(b)
RETURN DISTINCT head(labels(a)) AS This, type(r) AS To, head(labels(b)) AS That
ORDER BY This
LIMIT 100




So this means that the import basically worked well :) …

Impact analysis: the complex what-if question

The real objective of the BCM use case for graph databases, however, is not just about playing around with the data - it’s about understanding Impact. A broad field of business and scientific understanding, and a very active use case for neo4j. Essentially, what we are talking about here are complex, densely connected data structures in which we want to understand the effects of change in that structure. What happens to the rest of the graph, if one element of the graph would change? What happens if it would disappear? What happens if … What if?
These kinds of dependency analysis is not new. We have had people discuss it with regards to source code analysis, web services, telecom, railway planning, and many other domains. But to apply it to a business-as-a-whole was very new to me - and fascinating for sure.
Let’s look at a couple of examples.

Which Applications are used in which buildings

What would happen to specific employees located in specific buildings if a particular application would “die”?

MATCH (n:Application)<-[:USES]-(m:Process)-[:USED_BY]->(l:BusinessLine)->[:LOCATED_IN]->(b:Building)
RETURN DISTINCT n,b
limit 10;


Obviously this is a quite a broad query, with a lot of different results. But by using LIMIT we can start looking into some specifics, and use a graphical visualisation to make this all less difficult to grasp.



Or another example:

What BusinessProcesses would be affected by a fire at location Loc_100

Let’s use a “shortestpath” calculation to find this:


MATCH p = ShortestPath((b:Building {name:"Loc_100"})-[*..3]-(bp:BusinessProcess))
RETURN p;


and immediately we get a very easy-to understand answer.

and maybe one more example:


Which applications that are used by a Business Process that has an RTO of 0-2hrs would be affected by a fire at Loc_100


MATCH (rto:RTO {name:"0-2 hrs"})<-[:BUSINESSPROCESS_HAS_RTO]-(bp:BusinessProcess),
p1=ShortestPath(bp-[*..3]-(b:Building {name:"Loc_100"})),
p2=ShortestPath(bp-[*..2]-(a:Application))
RETURN p1,p2,rto;


And then for some reasoning - sortof



Like with any domain, understanding the meaning of the concepts expressed there is very important. It will allow us to do “reasoning”, and potentially plug holes in our data structures that do not really make sense and may need corrective action.


In this particular case, I stumbled upon the simple understanding that
  • if business processes have a recovery time objective,
  • and processes have a recovery time objective,
  • and business processes are made up of (atomic) sub-processes
  • then therefore it should follow that the RTO of the business process can never be smaller than, or even equal to, the RTO of the constituting processes.


So let’s look for this using the following query to see if there are any cases in our organisation that violate this simple reasoning:


MATCH triangle=((bp:BusinessProcess)-[r1:BUSINESSPROCESS_HAS_RTO]->(rto:RTO)<-[r2:PROCESS_HAS_RTO]-(p:Process)<-[:CONTAINS]-(bp))
RETURN triangle LIMIT 10;


which returns the following graph:

Conceptually, this is a very valuable query, as it starts to illustrate much closer where the risk areas are for our BCM domain. This could really be a life-saving query!

Conclusion

I never thought of it this way, but business processes, especially in larger corporations, are very intertwined and networked. So if you want to better understand and manage these processes and better protect yourself from potential disruptions that may affect your entire business’ continuity - then look no further, graphs can help. Some of the queries that I prepared for this use case are quite complex and interesting - and you should definitely check them out and see what they mean for your business.


You can find the dataset and the relevant queries in this gist - make sure you use neo4j 2.0 to run these.


As always, I hope this is useful.


Cheers


Rik

Thursday 12 December 2013

Saint Nicolas brought me a new Batch Importer!!!

After my previous blogpost about import strategies, the inimitable Michael Hunger decided to take my pros/cons to heart and created a new version of the batch importer - which is now even updated to the very last GA version of neo4j 2.0. Previously you actually needed to use Maven to build the importer - which I did not have/know, and therefore never used it. But now, it's supposed to be as easy as download zip-file, unzip, run - so I of course HAD to test it out. Here's what happened.

Yet another dataset

First: I wanted to create a "large-ish" dataset (Michael actually calls it "tiny") with 1 millions nodes and 1 million relationships. So what do you do? MS Excel to the rescue. I created an Excel file with two worksheets, one for nodes and one relationships. The "nodes sheet" has nodes arranged in the following model of persons and animals that are each-other's friends (thanks Alistair again for the Arrows):

Creating the nodes sheet was easy, creating the relationship sheets I actually used a randomization function to create random relationships:

=RANDBETWEEN(nodes!A$2;nodes!A$1048576)

The Excel file that I made is over here. By doing that I actually get a fairly random graph structure - if I would manage to import it into neo4j. In order to do so with the batch importer, I simply had to export the file to two .csv files: one for nodes, one for relationships. And then there was one more step: I had to replace the semi-colons with tabs in order for the batch importer to like the files (I probably could have done it without this step, by editing the batch.properties file as in these instructions). Easy enough in any text editor - done in 2 seconds.

Drumroll: would it work?

So I downloaded the zip file, unzipped, and went

./import.sh graph.db nodes.csv rels.csv

Then I wait 20 seconds (apparently this is going to get a lot faster very soon - but I was already impressed!) and: TADAAH!

Job done!! 

All I had to do then was to copy the graph.db directory (download the zipped version from over here) to my shiny new 2.0 GA instance directory, fire up the server, and all was fun and games. Look at the queries in the neo4j browser, and you see a beautiful random animal-person social network. So cool!


What did I learn?

Thanks to Michael's work, the import process for large-ish datasets is now really easy. If I can do it you can. But. There was a but.

Turns out that the default neo4j install on my machine (with an outdated version of Java7, I must admit) actually ran painfully slow after a few queries. But as soon as I changed one little setting (the size of the initial/maximum Java Heap size = 4096, on my 8GB RAM machine) it was absolutely smoking hot fast.  Look for the neo4j-wrapper.conf file in your conf directory of the neo4j install.
I guess I just never played around with larger datasets in the past - this definitely made a HUGE difference on my machine.

UPDATE: I just updated my Java Virtual Machine to the latest version, and this problem has now gone away. You don't need the above step if you are on the latest version - just leave it with the default settings and it will work like a charm!

So: THANK YOU SAINT NICOLAS for bringing me these shiny new toys - I will try to continue to be a good boy!

Hope this was useful.

Rik

Friday 6 December 2013

Untying the Graph Database Import Knot

Working for Neo Technology has many, many upsides. I love my job, love my colleagues, love our product, love our market - I think you can pretty much say that I am a happy camper. But. There's always a but. At least a couple times a week I am confronted with things that make me go "Oh no, not that again!" And "that" is usually about one particular topic: Importing data into Neo4j. Many, smart people are having trouble with it  - and there are many reasons for this. So let's start zooming into this Gordian Knot - and see if we can untie it - without having to cut it ;-) ...

The Graph Database Import Knot

The first thing that everyone should understand that, in a connected world, importing data is, per definition more difficult to do. It is a true "knot" that is terribly difficult to untie, for many different reasons.

Just logically, the problem of importing "connected" data is technically more difficult than with "unconnected" data structures. Importing unconnected data (eg. the nodes of your graph model) is always easy/easier. Just dump it all in there. But then you come to importing the connections, the relationships, and you find that there's no such thing as an "external entity" (aka "the database schema") that is going to be ensuring the consistency and connectedness of the import. You have to do that yourself, and explicitly, by importing the relationships between a) a start node that you have to find, and b) an end node that you have to lookup. It's just ... more complicated. Especially at scale, of course.

So how to untie this knot? I can really see two steps that everyone needs to take, in order to do so:
  1. Understand the import problem. Every import is different, just like every graph is different. There is little or no uniformity there, and in spite of the fact that many people would love to just have a silver bullet solution to this problem, the fact of the matter is that there is none - at least not today. So therefore we will have to "create" a more or less complex import solution for every use case - using one of the tools at hand. But like with any problem, understanding the import problem is often the key to choosing the right solution - so that's what I will focus on here as well.
  2. Pick the right tool. There are many tools out there, and we should not be defeated by the law of the instrument - and use the right tool for the job. Maybe, this article can help in bringing these different tools together, bring some structure to them, and then - even though I have not used all tools, but I have used a few - I can also tell you about my experiences. That should allow us to make some kind of a mapping between the different types of Import problems, and the different tools at hand.
So let's give it a shot.

YOUR import scenario

Like I said before: one import problem is different from the next one. Some people want to store the facebook social graph in neo4j, other people just want to import a couple of thousand proteins and their interactions. It's really, very different. So what are the questions that you should ask yourself? Let me try and map that out for you:


This little mindmap should give you an insight into the types of questions you should ask yourself. Some of these are project related, others are size/scale related, others are format related, and then the final set of questions are related to the type of import that you are trying to do. 

The Tools Inventory

If you have ever visited the neo4j website, you have probably come across the import page. There's a wealth of information there around the different types of tools available, but I would like to try and help by providing a bit of structure to these tools:


So these tools range from using a spreadsheet - which most of use should be able to wield as a tool - to writing a custom piece of software to achieve the solution to the import problem at hand. The order in which I present these is probably very close to "from easy to difficult", and "from not so powerful to very powerful". 

So let's start doing a little assessment on these tools. Note that this is by no means scientific - this is just "Rik's view of the world".

ProsCons
SpreadsheetsVery easy: all you need to do is write some formulas that concatenate strings with cell content - and compose cypher statements this way. These cypher statements can then just be copied into the neo4j-shell.Only works at limited scale (< 5000 nodes/relationships at a time). Performance is not good: overhead of unparametrized cypher transactions. Quirks in copying/pasting the statements above a certain scale. Piping the statements in can work on OSX/Linux - but not on Windows.
Neo4j-shell
Cypher StatementsNative toolset - no need to install anything else. Neo4j-shell can be used to pipe to in OSX/Linux - which can be very handy.You have to create the statements (see above). If they are not parametrized, they will be slow because of the parsing overhead.
neo4j-shell-toolsFantastic, rich functionality for importing .csv, geoff and graphml files. Not a part of the product (yet). Requires a separate install.
Command line
batch importerHigh-performance, easy to use (if you know maven).Specific purpose, for CSV files. Currently does not have easy install procedures.
ETL tools
TalendOut of the box, versatile, customizable, uses specific Neo4j connector - both in online and offline modes.Requires you to learn Talend. Current connector not yet upgraded to neo4j 2.0.
MulesoftOut of the box, versatile, customizable, uses the JDBC connector in online mode.Requires you to learn Mulesoft. No batch loading of offline database supported.
Custom Software
Java API
High Performance, perfectly customizable, supports different input types specific for your use case!
You have to write the code!
REST API
Spring Data Neo4j

So if this assessment is close enough, then how would we map the different import scenarios sketched above, to these different tools? Let's do an attempt at creating that.

Mapping the scenario to the inventory

Here's my mapping:


So there is pretty much a use case for every one of the tools - it's not like you can discard any of them easily. But, if you would ask my subjective assessment, here's my personal recommendation:
  • the spreadsheet way is fantastic. It just works, and it's quick to get something done in no time. I still use it regularly.
  • neo4j-shell-tools is my personal favourite in terms of versatility. Easy to use, different file format support, scales to large datasets - what's not to like?
  • for many real-world solutions which require regular updates of the database - you will need to write software. Just like you used to do with your relational databases system - nothing's changed there!

Hope this was a useful discussion - if you want you can download the entire mindmap that I used for this blogpost from over here.

All the best

Rik