Sunday 24 May 2015

Cycling Tweets Part 5: querying the CycleTweets

After parts 1 to 4 of this series where we have mostly been loading and tweaking data for the Cycling Tweets, now it's time to have some fun querying this dataset with Cypher.

All of the queries are on Github, but let me just walk you through a couple more interesting ones.

Warming up the "Cypher Muscles"

Let's start with something easy, just to get the "Cypher Muscles" going:
//degree of handlesmatch (h:Handle)-[:TWEETS]->(t:Tweet)return, h.realname, count(t)order by count(t) DESClimit 10
This query gives us the "degree" (the number of "TWEETS") relationships of a "Handle" node) of the nodes in our dataset, and a first indication of what to look for further on:
 Turns out that this is quite interesting. Very few of the "top gun" cyclists seem to be the top Tweeters. The only one that really stands out I think is Luca Paolini - the others are basically excellent riders, but not the "top guns". At least not in my opinion/experience of the sport.

So lets take a look at the #hashtags. Which ones are most mentioned in Tweets?
//most mentioned handles or hashtagsmatch (h)-[:MENTIONED_IN]->(t:Tweet)return, labels(h), count(t)order by count(t) DESClimit 10
And the result is kind of obvious:
The big races like Paris-Roubaix and the "Ronde van Vlaanderen" (#RVV, or "Tour of Flanders") are the top mentions.

Using the NodeRanks

As you may remember from part 4 of this blogpost series, we used the GraphAware framework to calculate the PageRank of the different nodes in our dataset. So let's take a look at that:
 You immediately see the "big guns" (like Tour de France winners Alberto Contador, Chris Froome, Cadel Evans) pop out. But this does not say a lot - so we want to do a bit more exploration of these top riders and see what they are really connected to.

"Impossible is Nothing" with the power of WITH

To paraphrase my dear friend Jim Webber, "Impossible is Nothing" with Cypher and Neo4j.

Because here's the thing. When I was starting to do this little project, I did not know what I was going to find. I really didn't. Of course I knew a bit about Neo4j, I am a fan of cycling, but still... it was all kind of an experiment, a jump into the unknown. So this is where I fell in love - again - with Neo4j, Cypher, and the way it allows you to interactively and iteratively explore your data - hunting out the hidden insights that you may not have thought of beforehand.

A key tool in this was the "WITH" clause in Cypher. From the manual:
The WITH clause allows query parts to be chained together, piping the results from one to be used as starting points or criteria in the next.
So that means that I can basically iteratively query my dataset, and use the result of one iteration as input for the next iteration. Which is very powerful in my opinion. So here's what I did with the "top ranked" nodes:

  • First I explored which other nodes are connected to these top-ranked ones, using WITH:
//what is connected to the top NodeRanked handlesmatch (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)-[r*..2]-()return h,rlimit 50

This gave me a nice little overview:
However, because I had to "LIMIT" the result, it felt as if I was artificially skewing the view. So lets take a second pass at this.

  • Second, I looked at the labels of the nodes that are directly connected to the single one top ranked node:

//what is connected to the top NodeRanked handles at depth 1match (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)--(connected)return labels(connected), count(connected)limit 25
And I can do something very similar by just tweaking the query to find out what is connected at dept 2 or 3...

//what is connected to the top NodeRanked handles at depth 3match (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 1match (h)-[*..2]-(connected)return labels(connected), count(connected)order by count(connected) DESC
The order of the result is a bit different then:

So that gave me some good feel for the dataset. Again:I think it's mostly this interactive query capability that makes it so interesting.

Betweenness on a subgraph

So then I thought back to some work that I did last year to try and implement Betweenness Centrality in Cypher. The result of that was clearly that it was pretty easy to do, but... that it would be very expensive to do so on a large dataset. I think this would be a prime candidate for another Graphaware component :) ... but let's see if we can use WITH to

  • first find a subgraph of interesting suspect nodes
  • then calculate the betweenness on these suspect nodes
Turns out that this was pretty straightforward. Here's the query:

//betweenness centrality for the top ranked nodes - query using UNWIND//first we create the subgraph that we want to analysematch (h:Handle)where h.nodeRank is not nullwith horder by h.nodeRank DESClimit 50//we store all the nodes of the subgraph in a collection, and pass it to the next queryWITH COLLECT(h) AS handles//then we unwind this collection TWICE so that we get a product of rows (2500 in total)UNWIND handles as sourceUNWIND handles as target//and then finally we calculate the betweenness on these rowsMATCH p=allShortestPaths((source)-[:TWEETS|MENTIONED_IN*]-(target))WHERE id(source) < id(target) and length(p) > 1UNWIND nodes(p)[1..-1] as nWITH n.realname as Name, count(*) as betweennessWHERE Name is not nullRETURN Name, betweennessORDER BY betweenness DESC;
Here's the result:

As you can see - this is quite interesting. It's clear that there are a number of lesser known riders that are very "between" the top guns (in terms of PageRank).

Wrapping it up with some pathfinding

So last but not least, we need to do some pathfinding on this dataset. In my experience, that always gives away some interesting insights.

So let's experiment with two very well known riders, Tom Boonen (former world champ and winner of the Tour of Flanders and Paris Roubaix multiple times) and Alexander Kristoff (this year's winner of the Tour of Flanders). Here's the simple query:
//the link between Boonen and Kristoffmatch (h1:Handle {name:"@kristoff87"}), (h2:Handle {realname:"BOONEN Tom"}),p = allshortestpaths ((h2)-[*]-(h1))return p
The result is:

But then my suspicion is that the Teams that these riders belong to are actually really important. So lets take a look:
//the link between Boonen and Kristoff and their teamsmatch (h1:Handle {name:"@kristoff87"}), (h2:Handle {realname:"BOONEN Tom"}),p = allshortestpaths ((h2)-[*]-(h1))with nodes(p) as Nodesunwind Nodes as Nodematch (Node)--(t:Team)return Node, t
As you can see we are using the same principle as above: WITH ties it all together.

That's about it, folks. There are so many other things that I would love to do with this dataset (Community detection is high on my wishlist) - but I think 5 parts to a blogpost series is probably enough :) ...

I guess you could have seen from this series of blogposts, that I am a bit into Cycling, and that I enjoy working this stuff with Neo4j. It's been a lot of fun - and a bit of effort - to get all of this done, but overall... I am pretty happy with the result.

Please let me know what you thought of it too - would love to get feedback.



No comments:

Post a Comment