So here it goes. This is part 1 of (what I think will be) 5 blogposts around Neo4j, graph databases, Twitter and Cycling. Hoping you will enjoy it.
Starting with the ridersThe obvious idea I had was to try and do some work with some social networking data for the top riders in the pro cycling peloton. I follow some of these guys myself on my Twitter feed, but how would I be able to get to all the interesting ones like Tom Boonen, Fabian Cancellara, and others. I googled around a bit and found this site: CQ Ranking. In their own words: they are a ranking of Pro cycling riders that try to rank riders based on the past 12 performance - a bit like the UCI ranking of cyclists. And they provide some really cool data: here's an example of a sheet that you can download from their website.
One of the most interesting data elements that I found on the CQ website was the list of Twittering riders. This was almost exactly what I was looking for for my experiment, a long list of all the riders, their teams, their countries... and their twitter handles. Obviously there were going to be some mistakes/problems in this list, but still - it looked pretty sweet. So there I went, downloading everything and putting it all into a google spreadsheet for some data cleanup so that I could prepare it for an upload into Neo4j.
The only real thing that I had to do was to match the CQ ranking sheet with the with twittering riders. That was easy enough once I had both data sets different tabs of the google sheet: a simple VLOOKUP was all it took:
Interesting. Now I have a list of very interesting twitter accounts - what to do with them. This is when the first part of my experiment really started to materialise, and when I decided that I would love to know what all these interesting online characters would be up to in a Holy (!) month like this one. I would love to know what they were tweeting about, who they were mentioning, how they would be grouped together etc... So I needed to get to that data...
Getting to the Twitter data
This turned out to be a bit more difficult than I thought it would be. Sure, Twitter gives you this API access to read from their vast datasets, but frankly, for a newbie like me these "rate limiting" rules are pretty confusing and intransparant. And plus - I don't really know how to code :) - so that really limited my options. So I tried a few things and then decided that the easiest way to get to all of these "April tweets" would be to create a new twitter account (CycleF0ll0w) and then follow all of the accounts that I wanted to follow (from the spreadsheet). So that's what I did: a ghost twitter ID appeared: all it does is follow people - so that I have access to the timeline that contains the information I want.
In order to easily create the list of people that "I" followed, I used a tool called Tweepi: it allows for bulk creation of "follow" links really easily. I decided to go with the top 500 (as per their CQ ranking) riders - that should be more than interesting enough.
Exporting and Cleaning the timeline
So now I have a timeline. How do I get that to be extracted so that I can work with it and get it ready to be imported into Neo4j? Again, I investigated multiple options, but ended up going for a paid service: Exporttweet. On a daily basis, this service automatically created an Excel spreadsheet containing all the tweets appearing on my CycleF0ll0w timeline.
Now all there was left to do was to clean these sheets up a bit. I used my tried and tested Open Refine install to do that:saved json file:
That was it. Now I had everything ready to get started:
- a google doc with a bunch of metadata about riders (names, teams, rankings, twitter handles)
- a timeline with all the tweets of these riders, and a way to export that into daily XLS files
- an OpenRefine process to create CSV files out of these tweets on a daily basis.
In the next blog post I will go and get started with this - and start having some fun with the data.
Hope you enjoyed this so far - already looking forward to part 2. As always, feedback welcome.