Friday 11 January 2013

Fun with Beer - and Graphs

I make no excuses: My name is Rik van Bruggen and I am a salesperson. I think it is one of the finest and nicest professions in the world, and I love what I do. I love it specifically, because I get to sell great, awesome, fantastic products really - and I get to work with fantastic people along the way. But the point is I am not a technical person - at all. But, I do have a passion for technology, and feel the urge to understand and taste the products that I sell. And that’s exactly what happened a couple of months ago when I joined Neo Technology, the makers and maintainers of the popular Neo4j open source graph database.

So I decided to get my hands dirty and dive in head first. But also, to have some fun along the way.


The fun part would be coming from something that I thoroughly enjoy: Belgian beer. Some of you may know that Stella Artois, Hoegaerden, Leffe and the likes come from Belgium, but few of you know that this tiny little country in the lowlands around Brussels actually produces several thousand beers. 


You can read about it on the Wikipedia page: Belgian beers are good, and numerous. So how would I go about putting Belgian beers into Neo4j? Interesting challenge.

Part 1: Getting Beer into Neo4j

First, I started with the data source. The Wikipedia page actually has a full listing of all Belgian beers, and it also states the brewery, beer type, and alcohol percentage - perfect! I can see a graph already emerging. But how to get it into Neo, without doing any programming? Well, turns out that it was not that difficult...

Read on, or watch the video:
 

Next step was to “clean” the Wikipedia data, and structure it. I used a Google spreadsheet for this - I was actually amazed about its power and ease of use. 


Then after some more manipulation and spreadsheet wizardry, I managed to come up with two, very simple files: one for the nodes (the BeerBrands, the AlcoholPercentages, the BeerTypes and the Breweries) and one for the relationships (a Beerbrand “has a” Alcoholpercentage, “isa” specific beertype, and a Brewery “brews” a specific Beerbrand).




So then, I had to get these CSVs files into a graph and into a graph database that is Neo4j. I tried a number of things, but ended up going for a very simple tool called Gephi. The tool has a visualisation component, and an analysis/processing component, but I was most interested in the Data Laboratory. This allowed me to import the CSV files above with two clicks, and create a wonderful visualisation immediately.


(Editors note, you can also use this Neo4j-Batch-Importer to import CSV files directly into the graph (including indexing), ETL-article by Max de Marzi).

So now I had my Gephi project, but how to get it into Neo4j? Well, turns out there is a Gephi Neo4j plugin available that does exactly that. Just install the plugin, export the gephi project, and it will generate the neo4j store files that you can copy over your graph.db directory.

And now: my Neo4j database was up and running. And remember: NO PROGRAMMING INVOLVED. Love that.


To be honest: there was one tiny little hickup at this point. Because to use the graph in a meaningful way, you really need to have indexes. Neo4j ships with Lucene for indexing of nodes, relationships and properties, and there is an auto-indexing capability in the product - but that only kicks in AFTER you start adding data to the database. So the initial import into the database - is not indexed. Crap. Luckily, I have some very bright colleagues at Neo Technology, who have written some nifty utilities that do something about this. And that’s what I did: used the utility, repopulated the autoindex, and of we were. A but hairy, but NO PROGRAMMING :)

Part 2 - Getting Beer out of Neo4j

In the second part, I explore the beer graph visually via the Neo4j Web-interface and some nifty Cypher queries. So, enjoy the video and see the details below:



For example: let’s try to find all Belgian Trappist beers, based on one trappist beer that I know and love, Orval. To do that, we need to do a query, using the Cypher query language. Here’s how this works:

Getting my starting point in the graph, through an index lookup

START orval=node:node_auto_index(name="Orval")

Then trying to find a pattern, with a Match clause

MATCH
orval<-[:Brews]-brewery,

I want another beer, with the same beertype as Orval

orval-[:isa]->beertype,

anotherbeer-[:isa]->beertype

I want to return the other beers
RETURN
anotherbeer.name AS name,
COLLECT(beertype.name) AS beertype
ORDER BY anotherbeer.name;


This would give me a very straightforward, result set, very similar to what you would expect in traditional database systems:



Another type of query would be to try and find *paths* between two beers:

   START
       duvel=node:node_auto_index(name="Duvel"),
       orval=node:node_auto_index(name="Orval")
   MATCH p = AllshortestPaths( duvel-[*]-orval )
   return p;

In the example above, I am trying to see what connects two of my favorite beers: Orval and Duvel. I found that there are two beers that share either the AlcoholPercentage or the Beertype - very interesting! That is a great recommendation to receive and will require some tasting to be done!



Last but not least, I also experimented a bit with updating the graph. Using a Cypher statement like this one:

begin
START orval=node:node_auto_index(name="Orval")
CREATE (rik{name:"Rik"})-[:loves]->orval;
START duvel=node:node_auto_index(name="Duvel"), rik=node:node_auto_index(name="Rik")
CREATE rik-[:loves]->duvel;
commit


I was able to include a real, ACID compliant transaction on the graph - adding a “Rik” node to the graph and adding two “loves” relationship, one for Duvel and one for Orval. It was really interesting to see how Neo4j does the commit/rollback, and how it isolates these updates from the rest of the users. You can test that really easily by having two “clients” talk to the database, and executing the transaction in one client - but querying it from the other.

All in all this was a fantastic, and very learning experience for me. I feel like Neo4j is a great tool for many database problems - beer-related or other - and that it was really fun to learn how to piece things together and get it to a functioning database - WITHOUT PROGRAMMING.

I hope you found this helpful, if you want to try it yourself, here is the zipped database directory and here the initial CSV files and the Cypher queries.

Enjoy your beer

Rik van Bruggen

No comments:

Post a Comment