Sunday 23 March 2014

Media, Politics and Graphs

My dear friend and neo4j community member Ron recently pointed me to an amazing piece of work. Thomas Boeschoten, of the Utrecht Data School among many other things, published some amazing work of analysing the Dutch Talk Shows from different perspectives, using Gephi as one of his tools.  Some of his results are nothing short of fascinating, and very cool to look at.
Now I think the world needs more people like Thomas: media have long abandoned political neutrality, and as citizens, we owe it to ourselves to understand the politics of the things that we see, read and hear in the media. Without it, democracy will be short lived and (to paraphrase de Maistre) we "will get the government we deserve". We need to understand this interplay between media and politics - and graphs can help there.

I will not try to help you understand the depths of Thomas' research (just visit his site - lots of cool stuff there, but mostly in Dutch), I would just like to take this dataset - which he kindly shared - for a spin using neo4j

Importing the dataset

As Thomas visibly already is a graphista through and through, he shared his datasets with me as a Gephi file. So that made it really easy to get the initial stuff into neo4j: all I needed to do was use the recently updated neo4j-Gephi plugin and generate the data store files. Copy those over to my neo4j server's data directory, call it graph.db and boom - we're done!

However, when I fired up the server, I soon found out that I would have to do some work :) ... the graph that Thomas created did not really have a "database-like" model (it did not do any normalisation of the model, for instance) - and the neo4j browser looked a bit boring:

I needed to add some structure to this all, in order to be able to query it meaningfully.

Adding a model

After browsing around through the data, I decided that the model that I would be playing with would look something like this:


You can see that it is not a very big graph:

but it is quite densely connected - it has a lot of relationships between the nodes:

So now I can do some more interesting queries on the data, and see if - like in Thomas' research - I kind find out some interesting stuff about this dataset.

Take it for a spin: CYPHER queries!

Let's start with some simple queries. Let's figure out how many people have visited the different shows:

match (g:GUEST)-[v:VISITED]->(sh:SHOW)
return sh.id as Show, count(v) as NrOfVisits
order by NrOfVisits desc;

And we immediately get a feel for the dominant talkshows:


But then let's see how many of these talkshow guests are politicians (or have political affiliations at least). Let's expand the query a bit:

match (g:GUEST)-[v:VISITED]->(sh:SHOW),
g-[:AFFILIATED_WITH]->(p:PARTY)
return sh.id as Show, count(v) as NrOfVisits
order by NrOfVisits desc;

And see if there is any difference in the way the shows are ranked:


Interesting. There are indeed some differences, as you can see.

Now let's look at another perspective in our dataset: Gender. Let's look at the distribution of male/female guests to all of these shows:

match (g:GUEST)-[:HAS_GENDER]->(gen:GENDER),
(g)-[v:VISITED]->(sh:SHOW)
return gen.name, count(v)
order by gen.name ASC;

we can clearly still see the dominance of men in these shows:
If we then add the political dimension again, and look at gender distribution for the political visitors to the shows:

match (g:GUEST)-[:HAS_GENDER]->(gen:GENDER),
(g)-[v:VISITED]->(sh:SHOW),
(g)-[:AFFILIATED_WITH]->(p:PARTY)
return gen.name, count(v)
order by gen.name ASC;

then we can see that the distribution is broadly the same:


I am sure there are plenty of other queries to think of, but let me do one more in this post: let's see what the overlap is - in terms of guests visiting them - between the different shows. To do that, all we need to do is calculate some paths between two shows: DWDD and P&W.

match p = AllShortestPaths((s1:SHOW {id:"DWDD"})-[*..2]-(s2:SHOW {id:"P&W"}))
return p
limit 100;

The result is exactly what you would expect: a HUGE amount of overlap - at least between these two (see above: largest) shows. Hence the "limit 100" in the query - so that my poor neo4j browser would survive:

Wrap-up

That's about all I have at this point. You can download the database from over here. And the queries that I used above are all on github.

From my perspective, I think these kinds of datasets are extremely interesting and powerful. I would love to see more work like Thomas', from my own country or abroad, and look at this from an even broader perspective. In any case, I would like to thank and compliment Thomas on his work - and look forward to your feedback.

Hope this was useful.

Cheers

Rik


No comments:

Post a Comment