Bruggen Blog: November 2013

Friday, 22 November 2013

Meet this "Tubular" graph!

Many of us know London. Those of us that have visited London will know "the Tube", "the Underground" - simply the fastest and most efficient way to get around (although I must admit that Hailo has been quite a contender lately...). Beautiful city, lovely place to work, and since I started working for Neo, it feels a bit like my home away from home.

The Tube: A Great Graph

As you can easily imagine, or just plainly see from looking at any of the maps of the tube, the Underground really is a very sophisticated system, and can only be described as a very sophisticated graph. We always refer to it - in our Neo4j presentations - as the perfect example of how one-page-graphs can easily represent and provide *insight* into complex system ... without having to have a PhD in maths. Literally: almost everyone can use the tube - almost everyone can use a graph.

Finding a nice "tubular" dataset

Since we talk about this example all the time, and since I am indeed an avid, non-native tube-user, I thought it would be interesting to look at how I could fit the Tube system into a neo4j database. It took me a while, but of course the data is out there: this page links to this spreadsheet that has a very nice starting point. It contains the Line, the Direction, the Stations, the Distance between stations, and then 3 different time measurements between the stations.

Importing this into a neo4j database is really, really easy.

Creating a neo4j Tube database

First things first: from the above spreadsheet, we would probably be best off to transform it into a .csv file. Easy peasy in Excel: the result is over here. Once we have that, we can use the ever so awesome neo4j-shell-tools (the 2.0 version is over here, in case you can't find it!) to import the data into a nice little graph model:

Kudos to Alistair Jones for making Arrows - it's actually very useable these days :)) ...

In other words: Stations have to be unique, are connected by one or more "Lines" in two directions, and the "Lines" have a "Direction" property (east, west, north, south...), a "Time" between stations property (which can be different in opposite directions!), and a "Distance" between stations property.

The import script for the .csv file is quite simple, as it completely leverages the new neo4j 2.0RC1 way of working:

it uses a schema constraint to ensure that the stations are unique
it uses the new Match-syntax (with property-matching in the pattern instead of in a where clause)

All in all it is very simple and effective. The resulting graph.db directory is over here.

Exploring the tube in the neo4j browser

Ever since it's introduction at GraphConnect San Francisco, the neo4j browser has become my favourite place to play around with neo4j and cypher. One of it's coolest features is the ability to apply stylesheets to your graph visualisations. So I wanted to apply this to my new tube-graph, and use the "official" tube-line colours in the browser.

Very quickly, I found the colours conveniently online:

LINE	TRUE HEXADECIMAL	WEB SAFE HEXADECIMAL
Bakerloo	#B36305	#996633
Central	#E32017	#CC3333
Circle	#FFD300	#FFCC00
District	#00782A	#006633
Hammersmith and City	#F3A9BB	#CC9999
Jubilee	#A0A5A9	#868F98
Metropolitan	#9B0056	#660066
Northern	#000000	#000000
Piccadilly	#003688	#000099
Victoria	#0098D4	#0099CC
Waterloo and City	#95CDBA	#66CCCC

So then all I had to do was to download the .grass file from the browser, and start editing the "relationship" sections. In the example below, the .Circle and .Central are the names of the relationship types "Circle" and "Central". Logical.

You can download the full .grass file that I created from over here.

Nice: if I start exploring the surrounding tube network for "London Bridge" station, I quickly get a feel for the network:

But of course the real fun begins with the queries.

Exploring the Tube with Cypher

Obviously I don't have the technical skills - at all - to develop anything like a route planner for the London Underground. But: using the dataset that we just created, it's quite easy to see how it would be very doable to create something like that. Let's look at some of the queries that I created:

Show the different underground lines:

Show the most densely connected underground station

With "densely connected" meaning the most different underground lines passing through it.

And then you can drill into this really easily and explore some more:

And finally: pathfinding

Of course we can do some rudimentary pathfinding in Cypher. But it's rudimentary - and just included for fun. Let's say that I would want to go from Tower Hill to Southwark (one of the most tedious tube connections that I would take sometimes to get to our London office).

Anyone a bit familiar with London knows that this is "b*ll*cks", and noone would ever do that. The right thing (I think) to do is to take the district/circle lines from Tower Hill to Blackfriars - and then just walk across the bridge to the office. Easy.

I have included some other pathfinding queries in the gist - but I am pretty sure that they would need work :) ...

That's about it for now. I think I have demonstrated how easy it is make the Great Tube Graph even greater by putting it into a graph database like neo4j - and how you could easily use something like the neo4j browser to find your way around one of the world's most complicated networks.

Hope you enjoy!

Rik

Tuesday, 12 November 2013

Presenting: Neo4j!

Last week, I had the great pleasure of spending some time working on a BI-centric presentation of neo4j. Tereza subtly re-introduced me to Prezi, a presentation format that I had already used a couple of years ago when I was still working for Imprivata. I found that a great tool had only gotten better, and that it was actually quite a lot of fun to store and present neo4j this way. After all - when you think of presentations in general, and prezi more specifically - it's actually quite easy to represent any presentation/process as a graph.

So I had this weird idea: what if I would create a prezi-presentation about neo4j, and a simple but complete little neo4j database that would essentially contain the same information as the prezi - so that people could explore neo4j - in neo4j. Neo4j presenting itself in a neo4j database. I know - it's a bit of a joke. But here goes anyway.

Creating the prezi

I spent a bit of time acquainting myself with the new prezi interface, and created this prezi:

I actually quite like the result. It's a nice overview presentation - and I must say I am particularly proud of the "elevator pitches" that I created.

The Elevator Pitches

An elevator pitch is supposed to explain a concept to another person, while you're in the elevator together. Depending on the size of your typical high-rises, that means that you would have between 10-20 seconds. Short. So how do you explain neo4j in that time? Well, I think the trick is to know - or at least make some assumptions about - who you are talking to, and tune the story to the audience.

Last week, in just a few days time, I had multiple "graph database virgins" (people that had no idea what it was, and that sometimes also did not have a lot of technical baggage) ask me "what neo4j was". And even today, I still found it challenging. Here's what I came up with to explain neo4j to "mom & pop":

There's a couple of other "pitches" in the prezi, covering other audiences like developers, architects, project managers, CIOs and business managers. I am sure they are not perfect - I would love some feedback on these if you feel like it - but hey, Elevator Pitches are not meant to be perfect. They are meant to cause interest, so that the conversation can continue.

But then I wanted to have some fun and put the prezi into neo4j.

Creating the neo4j database presenting... neo4j

I know this is silly - but when you think about it really isn't that stupid. Prezi assumes that there is a "path" to present the prezi. Meaning: I, the presenter will determine how I take you through the information presented. And of course, that process is a bit arbitrary for the attentive listener: every listener has his/her own personal background, knows more or less about technology, and of course, about graph databases in general and neo4j specifically. So it actually could make sense for someone to want to present the information in the prezi in a "freer" format that could be explored randomly by the audience. And that format would be: a neo4j database :) ...

I ended up creating the database with the spreadsheet method that I used before: take a look at the sheet over here. Run the cypher queries to create the nodes in a neo4j 2.0 instance, create the index, and then connect them up with some cypher queries to create the relationships. Or: just download the graph.db directory from over here, and copy it onto your neo4j server. Fire up the awesome neo4j browser, and you will soon be looking at something like this:

It's essentially the same thing as the prezi - just nicer :) ... Neo4j explaining itself to you - with neo4j! How cool is that?

I have also created a little graphgist that you can take a look at. Download the gist from over here.

That's about it. Hope you like it - as always feel free to comment or ping me if you want.

Cheers

Rik

Wednesday, 6 November 2013

Graph Databases & BI - a happy accident waiting to happen!

That's probably how I would sum up our talk at the Enterprise Data & Business Intelligence conference today in London. Big thanks to Tereza Gregorova for her wonderful work - and to Rick van der Lans for giving us the opportunity to talk.

Monday, 4 November 2013

Clickstreams are so much nicer in Neo4j!

Clickstreams are interesting. At least that's what I think. And my browser history is nothing but that - a record of my own personal clickstreams as I make myself a way through the web. So occasionally I do find myself wading through my browser history: it's as if I like being reminded to my surfing behaviour in the middle of the night after spending just *one* beer too long in my favourite bar in the world. Yeah right. But seriously: browsing behaviour, both in the personal sense and from a business point of view (someone managing websites wanting to optimize the browsing behaviour) - browser histories really matter. They make the difference between a nice experience on the web - and a cr*p experience.

And if you had not noticed: these behaviour patterns are not much more than ... a graph. So imagine what we could do if we would treat that data - your browser history, but in the general sense, your online clickstreams - as a graph. That's what I was curious about. Starting with my own browser history - but maybe someday, one of you guys, can actually expand this to the general case.

Getting to my clickstream data

Every browser has its own way of storing its clickstream (aka "browser history") data - so this part will be different for all of you. A quick browse on the interwebs taught me that there are plenty of tools out there - but

since I am an avid Chrome user, both for personal use with my personal gmail account, and professionally (Neo uses Google Apps - very happy with the overall experience there btw)
I already have a number of chrome add-ons installed (some of them, like Collabspot, changed my life)

I ended up looking for help on the Chrome Web Store. And I found Visual History, a nice little add-on that exposes your browsing history as ... a graph. No kidding. "Vertices" (aka Nodes) and "Arcs" (aka Edges, Relationships).

You can find Visual History yourself if you're a chrome user: take a look at the website and find the code. The output of Visual History is a nice little graphic, but guess what: they also allow you to export the data into a nice little text file.

So then we have a text file. With all of our history. But there's more: I actually use Chrome in a setup that has multiple users: I like the way it allows me to keep my "personal" browsing separate from my "professional" browsing. It takes a bit of getting used to, but if you are like me and you have more than one Google identity (Gmail account) then really I think it is the way to go. But for the purpose of this article that means that - I actually have TWO sets of browser history data: one personal, one professional.

So I installed Visual History in both my Chrome users, let it analyse my history for the past 3 months, and exported that data into two text files. Now the real fun could start: getting that data into Neo4j.

Getting the data ready

Since Visual History actually already generates a graph out of the history data, the next bit is really easy. I had to do a little bit of text & file manipulation:

Splitting the files into the right parts for import:

A file for Personal Sites that I had visited. These are one set of nodes for my neo4j database.
A file for Personal "browse sessions", or relationships: everytime you browse from one site to another within 20 minutes, Visual History creates an "edge" to our graph.
A file for Professional Sites. This is the second set of nodes for the neo4j database.
A file for Professional "browse sessions", or relationships for the neo4j database.

making the file "comma separated". A couple of finds and replaces is all it takes.
adding some "type" information. Not really necessary, but good to know which nodes and relationships are personal/professional for future manipulations.

Importing the browser histories into neo4j

Ever since I learned about Michael's neo4j-shell-tools, I actually really started to use it very often and it just gets easier by the day.

The actual import script is on github (but the files are not - I am sure you understand that I like keeping my browser history a bit private):

What the import script does:

it starts with importing the personal browser history into neo4j nodes
then we create the indices on these nodes
then we import the "edges" of the graph, the actual browse patterns from site to site. We make these into "personal relationships"
then we use a very interesting new functionality of neo4j 2.0: when adding the "professional" website visits to the graph, we wanted to make sure that we would not duplicate website nodes, and are using the MERGE statement in our Cypher statement to do so.

//create professional nodes

import-cypher -d ; -i ./IMPORT/INPUT/ProfessionalNodes.csv -o ./IMPORT/OUTPUT/professionalnodeout.csv merge (n:url {name:{name}}) on match n set n.id2={id},n.type="Both" on create n set n.id2={id},n.type={type} return n.name as name

After that we can finish up by adding one remaining index, and importing the "professional" navigation relationships to the graph.
As a final step, we also introduced "Professional" and "Personal" labels to our graph to make future queries easier.

Doing a couple of quality checks

After the import, I felt I needed to do a couple of quality checks on my data, and included a couple of cypher queries that allowed me to do that. I most importantly wanted to check if the data volumes that I had in mind added up.

Once I had that done, the fun part - Cypher querying - could start.

In love with the Neo4j Browser

I now had my dataset, generously provided by Visual History, in neo4j, and I knew the data was sound. So then all I needed to do was think about my model (very simple, as you can see below), and then think about some interesting interactive queries.

To do these queries, I would be able to use the Browser. I found it to be a truly wonderful environment to experiment in, although there are still a couple of quirks (strange error messages that are much clearer in the console, a match-clause that has to have its first line on the same line as "match", some slowness with larger resultsets) that will no doubt be ironed out by the fantastic NeoTech team.

Here are a couple of examples of items that I thought about, the full gist is over here.

So let's look at a couple of specific examples:

Number of Personal visits to a site:

This would look something like this:

Number of professional and personal visits to a site

Would yield an interesting table like this. Quite a bit of overlap - but some noteable differences in my professional/personal browsing behaviour:

Number of visits to Sites that are visited PROFESSIONALLY and NOT PERSONALLY

I am guessing that you can guess from the result that I have been playing a lot with my own personal neo4j server on my localhost:

Number of visits to Sites that are visited PERSONALLY and NOT PROFESSIONALLY

This one:

Yields and interesting collection of travel, gadget, social network and other sites:

All in all I thought this was a very interesting excercise. It was a lot of fun to do, but also confirmed my intuitive view that clickstream behaviour is in fact a very graphy kind of dataset - I hope we will be seeing a lot more web-analytics firms starting to use neo4j in the near future - I am sure it will be very interesting!

Hope this was useful.

Rik

Bruggen Blog

Pages