Monday 4 November 2013

Clickstreams are so much nicer in Neo4j!

Clickstreams are interesting. At least that's what I think. And my browser history is nothing but that - a record of my own personal clickstreams as I make myself a way through the web. So occasionally I do find myself wading through my browser history: it's as if I like being reminded to my surfing behaviour in the middle of the night after spending just *one* beer too long in my favourite bar in the world. Yeah right. But seriously: browsing behaviour, both in the personal sense and from a business point of view (someone managing websites wanting to optimize the browsing behaviour) - browser histories really matter. They make the difference between a nice experience on the web - and a cr*p experience.

And if you had not noticed: these behaviour patterns are not much more than ... a graph. So imagine what we could do if we would treat that data - your browser history, but in the general sense, your online clickstreams - as a graph.  That's what I was curious about. Starting with my own browser history - but maybe someday, one of you guys, can actually expand this to the general case.

Getting to my clickstream data

Every browser has its own way of storing its clickstream (aka "browser history") data - so this part will be different for all of you. A quick browse on the interwebs taught me that there are plenty of tools out there - but
  1. since I am an avid Chrome user, both for personal use with my personal gmail account, and professionally (Neo uses Google Apps - very happy with the overall experience there btw)
  2. I already have a number of chrome add-ons installed (some of them, like Collabspot, changed my life)
I ended up looking for help on the Chrome Web Store. And I found Visual History, a nice little add-on that exposes your browsing history as ... a graph. No kidding. "Vertices" (aka Nodes) and "Arcs" (aka Edges, Relationships).


You can find Visual History yourself if you're a chrome user: take a look at the website and find the code. The output of Visual History is a nice little graphic, but guess what: they also allow you to export the data into a nice little text file.



So then we have a text file. With all of our history. But there's more: I actually use Chrome in a setup that has multiple users: I like the way it allows me to keep my "personal" browsing separate from my "professional" browsing. It takes a bit of getting used to, but if you are like me and you have more than one Google identity (Gmail account) then really I think it is the way to go. But for the purpose of this article that means that - I actually have TWO sets of browser history data: one personal, one professional.

So I installed Visual History in both my Chrome users, let it analyse my history for the past 3 months, and exported that data into two text files. Now the real fun could start: getting that data into Neo4j.

Getting the data ready

Since Visual History actually already generates a graph out of the history data, the next bit is really easy. I had to do a little bit of text & file manipulation:
  • Splitting the files into the right parts for import:
    1. A file for Personal Sites that I had visited. These are one set of nodes for my neo4j database.
    2. A file for Personal "browse sessions", or relationships: everytime you browse from one site to another within 20 minutes, Visual History creates an "edge" to our graph.
    3. A file for Professional Sites. This is the second set of nodes for the neo4j database.
    4. A file for Professional "browse sessions", or relationships for the neo4j database.
  • making the file "comma separated". A couple of finds and replaces is all it takes.
  • adding some "type" information. Not really necessary, but good to know which nodes and relationships are personal/professional for future manipulations.

Importing the browser histories into neo4j

Ever since I learned about Michael's neo4j-shell-tools, I actually really started to use it very often and it just gets easier by the day.

The actual import script is on github (but the files are not - I am sure you understand that I like keeping my browser history a bit private):



What the import script does:
  • it starts with importing the personal browser history into neo4j nodes
  • then we create the indices on these nodes
  • then we import the "edges" of the graph, the actual browse patterns from site to site. We make these into "personal relationships"
  • then we use a very interesting new functionality of neo4j 2.0: when adding the "professional" website visits to the graph, we wanted to make sure that we would not duplicate website nodes, and are using the MERGE statement in our Cypher statement to do so.
//create professional nodes
import-cypher -d ; -i ./IMPORT/INPUT/ProfessionalNodes.csv -o ./IMPORT/OUTPUT/professionalnodeout.csv merge (n:url {name:{name}}) on match n set n.id2={id},n.type="Both" on create n set n.id2={id},n.type={type} return n.name as name

  • After that we can finish up by adding one remaining index, and importing the "professional" navigation relationships to the graph.
  • As a final step, we also introduced "Professional" and "Personal" labels to our graph to make future queries easier.

Doing a couple of quality checks

After the import, I felt I needed to do a couple of quality checks on my data, and included a couple of cypher queries that allowed me to do that. I most importantly wanted to check if the data volumes that I had in mind added up.

Once I had that done, the fun part - Cypher querying - could start.

In love with the Neo4j Browser

I now had my dataset, generously provided by Visual History, in neo4j, and I knew the data was sound. So then all I needed to do was think about my model (very simple, as you can see below), and then think about some interesting interactive queries.

To do these queries, I would be able to use the Browser. I found it to be a truly wonderful environment to experiment in, although there are still a couple of quirks (strange error messages that are much clearer in the console, a match-clause that has to have its first line on the same line as "match", some slowness with larger resultsets) that will no doubt be ironed out by the fantastic NeoTech team.

Here are a couple of examples of items that I thought about, the full gist is over here.

So let's look at a couple of specific examples:

Number of Personal visits to a site:


This would look something like this:

Number of professional and personal visits to a site


Would yield an interesting table like this. Quite a bit of overlap - but some noteable differences in my professional/personal browsing behaviour:


Number of visits to Sites that are visited PROFESSIONALLY and NOT PERSONALLY


I am guessing that you can guess from the result that I have been playing a lot with my own personal neo4j server on my localhost:


Number of visits to Sites that are visited PERSONALLY and NOT PROFESSIONALLY

This one:

Yields and interesting collection of travel, gadget, social network and other sites:


All in all I thought this was a very interesting excercise. It was a lot of fun to do, but also confirmed my intuitive view that clickstream behaviour is in fact a very graphy kind of dataset - I hope we will be seeing a lot more web-analytics firms starting to use neo4j in the near future - I am sure it will be very interesting!

Hope this was useful.

Rik

No comments:

Post a Comment