Tuesday 12 May 2015

Cycling Tweets Part 1: Who's who on Twitter

I have been a Cycling fan for a long time. Got into it at University a long time ago through some crazy friends that were already into it. And I have been on the bike ever since. Whether it's just for some grocery shopping and/or bar hopping in Antwerp, or for a longer holiday trip - or just for watching the pro riders in  CycloCross or Road racing - I am game for all of it. I even got my boys really excited about - watching legendary races like the Tour of Flanders or Paris-Roubaix together religiously.

So last March, I was thinking of yet another crazy Neo4j-experiment to do, and I thought about doing that on some cycling topics. After all, April is a "Holy Month" for cycling enthousiasts - many of the legendary "classic races" are happening during that month. So that's what I did. I will be detailing this over the next couple of blogposts - but suffice to say that I got into a bit of a project here, a journey that I gladly want to share with you.

So here it goes. This is part 1 of (what I think will be) 5 blogposts around Neo4j, graph databases, Twitter and Cycling. Hoping you will enjoy it.

Starting with the riders

The obvious idea I had was to try and do some work with some social networking data for the top riders in the pro cycling peloton. I follow some of these guys myself on my Twitter feed, but how would I be able to get to all the interesting ones like Tom Boonen, Fabian Cancellara, and others. I googled around a bit and found this site: CQ Ranking. In their own words: they are a ranking of Pro cycling riders that try to rank riders based on the past 12 performance - a bit like the UCI ranking of cyclists. And they provide some really cool data: here's an example of a sheet that you can download from their website.

One of the most interesting data elements that I found on the CQ website was the list of Twittering riders. This was almost exactly what I was looking for for my experiment, a long list of all the riders, their teams, their countries... and their twitter handles. Obviously there were going to be some mistakes/problems in this list, but still - it looked pretty sweet. So there I went, downloading everything and putting it all into a google spreadsheet for some data cleanup so that I could prepare it for an upload into Neo4j.

The only real thing that I had to do was to match the CQ ranking sheet with the with twittering riders. That was easy enough once I had both data sets different tabs of the google sheet: a simple VLOOKUP was all it took:
And then we got this sheet:
Interesting. Now I have a list of very interesting twitter accounts - what to do with them. This is when the first part of my experiment really started to materialise, and when I decided that I would love to know what all these interesting online characters would be up to in a Holy (!) month like this one. I would love to know what they were tweeting about, who they were mentioning, how they would be grouped together etc... So I needed to get to that data... 

Getting to the Twitter data

This turned out to be a bit more difficult than I thought it would be. Sure, Twitter gives you this API access to read from their vast datasets, but frankly, for a newbie like me these "rate limiting" rules are pretty confusing and intransparant. And plus - I don't really know how to code :) - so that really limited my options. So I tried a few things and then decided that the easiest way to get to all of these "April tweets" would be to create a new twitter account  (CycleF0ll0w) and then follow all of the accounts that I wanted to follow (from the spreadsheet). So that's what I did: a ghost twitter ID appeared: all it does is follow people - so that I have access to the timeline that contains the information I want.

In order to easily create the list of people that "I" followed, I used a tool called Tweepi: it allows for bulk creation of "follow" links really easily. I decided to go with the top 500 (as per their CQ ranking) riders - that should be more than interesting enough.

Exporting and Cleaning the timeline

So now I have a timeline. How do I get that to be extracted so that I can work with it and get it ready to be imported into Neo4j? Again, I investigated multiple options, but ended up going for a paid service: Exporttweet. On a daily basis, this service automatically created an Excel spreadsheet containing all the tweets appearing on my CycleF0ll0w timeline.
The output was really simple: an Excel file a day, keeps the doctor away!
Now all there was left to do was to clean these sheets up a bit. I used my tried and tested Open Refine install to do that:
It was mostly taking out special characters, renaming column names, and removing duplicates, and I ended up with a very simple set of Replay Refine operations based on the saved json file:
That was it. Now I had everything ready to get started:
  • a google doc with a bunch of metadata about riders (names, teams, rankings, twitter handles)
  • a timeline with all the tweets of these riders, and a way to export that into daily XLS files
  • an OpenRefine process to create CSV files out of these tweets on a daily basis. 
In the next blog post I will go and get started with this - and start having some fun with the data.

Hope you enjoyed this so far - already looking forward to part 2. As always, feedback welcome.



No comments:

Post a Comment