So here it goes. This is part 1 of (what I think will be) 5 blogposts around Neo4j, graph databases, Twitter and Cycling. Hoping you will enjoy it.
Starting with the riders
The obvious idea I had was to try and do some work with some social networking data for the top riders in the pro cycling peloton. I follow some of these guys myself on my Twitter feed, but how would I be able to get to all the interesting ones like Tom Boonen, Fabian Cancellara, and others. I googled around a bit and found this site: CQ Ranking. In their own words: they are a ranking of Pro cycling riders that try to rank riders based on the past 12 performance - a bit like the UCI ranking of cyclists. And they provide some really cool data: here's an example of a sheet that you can download from their website.One of the most interesting data elements that I found on the CQ website was the list of Twittering riders. This was almost exactly what I was looking for for my experiment, a long list of all the riders, their teams, their countries... and their twitter handles. Obviously there were going to be some mistakes/problems in this list, but still - it looked pretty sweet. So there I went, downloading everything and putting it all into a google spreadsheet for some data cleanup so that I could prepare it for an upload into Neo4j.
The only real thing that I had to do was to match the CQ ranking sheet with the with twittering riders. That was easy enough once I had both data sets different tabs of the google sheet: a simple VLOOKUP was all it took:
And then we got this sheet:
Interesting. Now I have a list of very interesting twitter accounts - what to do with them. This is when the first part of my experiment really started to materialise, and when I decided that I would love to know what all these interesting online characters would be up to in a Holy (!) month like this one. I would love to know what they were tweeting about, who they were mentioning, how they would be grouped together etc... So I needed to get to that data...
Getting to the Twitter data
This turned out to be a bit more difficult than I thought it would be. Sure, Twitter gives you this API access to read from their vast datasets, but frankly, for a newbie like me these "rate limiting" rules are pretty confusing and intransparant. And plus - I don't really know how to code :) - so that really limited my options. So I tried a few things and then decided that the easiest way to get to all of these "April tweets" would be to create a new twitter account (CycleF0ll0w) and then follow all of the accounts that I wanted to follow (from the spreadsheet). So that's what I did: a ghost twitter ID appeared: all it does is follow people - so that I have access to the timeline that contains the information I want.
In order to easily create the list of people that "I" followed, I used a tool called Tweepi: it allows for bulk creation of "follow" links really easily. I decided to go with the top 500 (as per their CQ ranking) riders - that should be more than interesting enough.
Exporting and Cleaning the timeline
So now I have a timeline. How do I get that to be extracted so that I can work with it and get it ready to be imported into Neo4j? Again, I investigated multiple options, but ended up going for a paid service: Exporttweet. On a daily basis, this service automatically created an Excel spreadsheet containing all the tweets appearing on my CycleF0ll0w timeline.
The output was really simple: an Excel file a day, keeps the doctor away!
Now all there was left to do was to clean these sheets up a bit. I used my tried and tested Open Refine install to do that:
It was mostly taking out special characters, renaming column names, and removing duplicates, and I ended up with a very simple set of Replay Refine operations based on the saved json file:
That was it. Now I had everything ready to get started:
- a google doc with a bunch of metadata about riders (names, teams, rankings, twitter handles)
- a timeline with all the tweets of these riders, and a way to export that into daily XLS files
- an OpenRefine process to create CSV files out of these tweets on a daily basis.
In the next blog post I will go and get started with this - and start having some fun with the data.
Hope you enjoyed this so far - already looking forward to part 2. As always, feedback welcome.
Cheers
Rik
No comments:
Post a Comment