Bruggen Blog: graphs

Showing posts with label graphs. Show all posts

Sunday, 23 March 2014

Media, Politics and Graphs

My dear friend and neo4j community member Ron recently pointed me to an amazing piece of work. Thomas Boeschoten, of the Utrecht Data School among many other things, published some amazing work of analysing the Dutch Talk Shows from different perspectives, using Gephi as one of his tools. Some of his results are nothing short of fascinating, and very cool to look at.

Now I think the world needs more people like Thomas: media have long abandoned political neutrality, and as citizens, we owe it to ourselves to understand the politics of the things that we see, read and hear in the media. Without it, democracy will be short lived and (to paraphrase de Maistre) we "will get the government we deserve". We need to understand this interplay between media and politics - and graphs can help there.

I will not try to help you understand the depths of Thomas' research (just visit his site - lots of cool stuff there, but mostly in Dutch), I would just like to take this dataset - which he kindly shared - for a spin using neo4j.

Importing the dataset

As Thomas visibly already is a graphista through and through, he shared his datasets with me as a Gephi file. So that made it really easy to get the initial stuff into neo4j: all I needed to do was use the recently updated neo4j-Gephi plugin and generate the data store files. Copy those over to my neo4j server's data directory, call it graph.db and boom - we're done!

However, when I fired up the server, I soon found out that I would have to do some work :) ... the graph that Thomas created did not really have a "database-like" model (it did not do any normalisation of the model, for instance) - and the neo4j browser looked a bit boring:

I needed to add some structure to this all, in order to be able to query it meaningfully.

Adding a model

After browsing around through the data, I decided that the model that I would be playing with would look something like this:

You can see that it is not a very big graph:

but it is quite densely connected - it has a lot of relationships between the nodes:

So now I can do some more interesting queries on the data, and see if - like in Thomas' research - I kind find out some interesting stuff about this dataset.

Take it for a spin: CYPHER queries!

Let's start with some simple queries. Let's figure out how many people have visited the different shows:

match (g:GUEST)-[v:VISITED]->(sh:SHOW)

return sh.id as Show, count(v) as NrOfVisits

order by NrOfVisits desc;

And we immediately get a feel for the dominant talkshows:

But then let's see how many of these talkshow guests are politicians (or have political affiliations at least). Let's expand the query a bit:

match (g:GUEST)-[v:VISITED]->(sh:SHOW),

g-[:AFFILIATED_WITH]->(p:PARTY)

return sh.id as Show, count(v) as NrOfVisits

order by NrOfVisits desc;

And see if there is any difference in the way the shows are ranked:

Interesting. There are indeed some differences, as you can see.

Now let's look at another perspective in our dataset: Gender. Let's look at the distribution of male/female guests to all of these shows:

match (g:GUEST)-[:HAS_GENDER]->(gen:GENDER),

(g)-[v:VISITED]->(sh:SHOW)

return gen.name, count(v)

order by gen.name ASC;

we can clearly still see the dominance of men in these shows:

If we then add the political dimension again, and look at gender distribution for the political visitors to the shows:

match (g:GUEST)-[:HAS_GENDER]->(gen:GENDER),

(g)-[v:VISITED]->(sh:SHOW),

(g)-[:AFFILIATED_WITH]->(p:PARTY)

return gen.name, count(v)

order by gen.name ASC;

then we can see that the distribution is broadly the same:

I am sure there are plenty of other queries to think of, but let me do one more in this post: let's see what the overlap is - in terms of guests visiting them - between the different shows. To do that, all we need to do is calculate some paths between two shows: DWDD and P&W.

match p = AllShortestPaths((s1:SHOW {id:"DWDD"})-[*..2]-(s2:SHOW {id:"P&W"}))

return p

limit 100;

The result is exactly what you would expect: a HUGE amount of overlap - at least between these two (see above: largest) shows. Hence the "limit 100" in the query - so that my poor neo4j browser would survive:

Wrap-up

That's about all I have at this point. You can download the database from over here. And the queries that I used above are all on github.

From my perspective, I think these kinds of datasets are extremely interesting and powerful. I would love to see more work like Thomas', from my own country or abroad, and look at this from an even broader perspective. In any case, I would like to thank and compliment Thomas on his work - and look forward to your feedback.

Hope this was useful.

Cheers

Rik

Monday, 4 November 2013

Clickstreams are so much nicer in Neo4j!

Clickstreams are interesting. At least that's what I think. And my browser history is nothing but that - a record of my own personal clickstreams as I make myself a way through the web. So occasionally I do find myself wading through my browser history: it's as if I like being reminded to my surfing behaviour in the middle of the night after spending just *one* beer too long in my favourite bar in the world. Yeah right. But seriously: browsing behaviour, both in the personal sense and from a business point of view (someone managing websites wanting to optimize the browsing behaviour) - browser histories really matter. They make the difference between a nice experience on the web - and a cr*p experience.

And if you had not noticed: these behaviour patterns are not much more than ... a graph. So imagine what we could do if we would treat that data - your browser history, but in the general sense, your online clickstreams - as a graph. That's what I was curious about. Starting with my own browser history - but maybe someday, one of you guys, can actually expand this to the general case.

Getting to my clickstream data

Every browser has its own way of storing its clickstream (aka "browser history") data - so this part will be different for all of you. A quick browse on the interwebs taught me that there are plenty of tools out there - but

since I am an avid Chrome user, both for personal use with my personal gmail account, and professionally (Neo uses Google Apps - very happy with the overall experience there btw)
I already have a number of chrome add-ons installed (some of them, like Collabspot, changed my life)

I ended up looking for help on the Chrome Web Store. And I found Visual History, a nice little add-on that exposes your browsing history as ... a graph. No kidding. "Vertices" (aka Nodes) and "Arcs" (aka Edges, Relationships).

You can find Visual History yourself if you're a chrome user: take a look at the website and find the code. The output of Visual History is a nice little graphic, but guess what: they also allow you to export the data into a nice little text file.

So then we have a text file. With all of our history. But there's more: I actually use Chrome in a setup that has multiple users: I like the way it allows me to keep my "personal" browsing separate from my "professional" browsing. It takes a bit of getting used to, but if you are like me and you have more than one Google identity (Gmail account) then really I think it is the way to go. But for the purpose of this article that means that - I actually have TWO sets of browser history data: one personal, one professional.

So I installed Visual History in both my Chrome users, let it analyse my history for the past 3 months, and exported that data into two text files. Now the real fun could start: getting that data into Neo4j.

Getting the data ready

Since Visual History actually already generates a graph out of the history data, the next bit is really easy. I had to do a little bit of text & file manipulation:

Splitting the files into the right parts for import:

A file for Personal Sites that I had visited. These are one set of nodes for my neo4j database.
A file for Personal "browse sessions", or relationships: everytime you browse from one site to another within 20 minutes, Visual History creates an "edge" to our graph.
A file for Professional Sites. This is the second set of nodes for the neo4j database.
A file for Professional "browse sessions", or relationships for the neo4j database.

making the file "comma separated". A couple of finds and replaces is all it takes.
adding some "type" information. Not really necessary, but good to know which nodes and relationships are personal/professional for future manipulations.

Importing the browser histories into neo4j

Ever since I learned about Michael's neo4j-shell-tools, I actually really started to use it very often and it just gets easier by the day.

The actual import script is on github (but the files are not - I am sure you understand that I like keeping my browser history a bit private):

What the import script does:

it starts with importing the personal browser history into neo4j nodes
then we create the indices on these nodes
then we import the "edges" of the graph, the actual browse patterns from site to site. We make these into "personal relationships"
then we use a very interesting new functionality of neo4j 2.0: when adding the "professional" website visits to the graph, we wanted to make sure that we would not duplicate website nodes, and are using the MERGE statement in our Cypher statement to do so.

//create professional nodes

import-cypher -d ; -i ./IMPORT/INPUT/ProfessionalNodes.csv -o ./IMPORT/OUTPUT/professionalnodeout.csv merge (n:url {name:{name}}) on match n set n.id2={id},n.type="Both" on create n set n.id2={id},n.type={type} return n.name as name

After that we can finish up by adding one remaining index, and importing the "professional" navigation relationships to the graph.
As a final step, we also introduced "Professional" and "Personal" labels to our graph to make future queries easier.

Doing a couple of quality checks

After the import, I felt I needed to do a couple of quality checks on my data, and included a couple of cypher queries that allowed me to do that. I most importantly wanted to check if the data volumes that I had in mind added up.

Once I had that done, the fun part - Cypher querying - could start.

In love with the Neo4j Browser

I now had my dataset, generously provided by Visual History, in neo4j, and I knew the data was sound. So then all I needed to do was think about my model (very simple, as you can see below), and then think about some interesting interactive queries.

To do these queries, I would be able to use the Browser. I found it to be a truly wonderful environment to experiment in, although there are still a couple of quirks (strange error messages that are much clearer in the console, a match-clause that has to have its first line on the same line as "match", some slowness with larger resultsets) that will no doubt be ironed out by the fantastic NeoTech team.

Here are a couple of examples of items that I thought about, the full gist is over here.

So let's look at a couple of specific examples:

Number of Personal visits to a site:

This would look something like this:

Number of professional and personal visits to a site

Would yield an interesting table like this. Quite a bit of overlap - but some noteable differences in my professional/personal browsing behaviour:

Number of visits to Sites that are visited PROFESSIONALLY and NOT PERSONALLY

I am guessing that you can guess from the result that I have been playing a lot with my own personal neo4j server on my localhost:

Number of visits to Sites that are visited PERSONALLY and NOT PROFESSIONALLY

This one:

Yields and interesting collection of travel, gadget, social network and other sites:

All in all I thought this was a very interesting excercise. It was a lot of fun to do, but also confirmed my intuitive view that clickstream behaviour is in fact a very graphy kind of dataset - I hope we will be seeing a lot more web-analytics firms starting to use neo4j in the near future - I am sure it will be very interesting!

Hope this was useful.

Rik

Thursday, 12 September 2013

Lots of graphyness in Benelux happening in October

Just thought I would quickly summarize some of the great activities that are happening around neo4j and graphs in Belgium and the Netherlands in October:

We are organising some wonderful trainings together with Xebia in the Netherlands:

a Neo4j tutorial that is meant to introduce you to the wonderful world of neo4j
a Neo4j in Production training that will cover more advanced topics

Our community will also be meeting up:

together with Glimworm, we are hosting a graph modeling meetup in Amsterdam. You bring your model, I will bring beer. And some.
the day after, we will do exactly the same thing with our friends at Archimiddle.

For any of these, please register on the eventbrite/meetup links above, and let's make these wonderful events.

Bruggen Blog

Pages