Thursday, 14 July 2016

Graphing the Tour de France - part 1/3

Alright, it's time to come out of the closet. I have to admit, over the past couple of years, I have turned into a bit of a cycling geek. I love watching the races in Flanders in spring, the legendary "ride through hell" from Paris to Roubaix, and of course, now, in summertime, the big tours of Italy, France and Spain. I have grown quite addicted to it - and have taken to riding my own bike a couple of times a week as well... it's a ton of fun. Last year I did a fun experiment in a series of 5 blog posts about the Professional Cycling twitterverse, but this year, I had something else thrown into my lap. Here's what happened.

A couple of weeks ago, my darling 11-year old son, who is an avid cyclist too and a member of a local club, was leaving on a scouting camp. Now scouting camp means: no TV, no internet, no mobile, no tablet, no... nothing... just enjoy life with your scouting buddies and have fun... And while my 11-year old was super looking forward to that - he also was in a bit of a panic, as it meant that he would have to miss the first 10 stages of the Tour. He made me promise that I would keep track of the results for him - which of course I dutifully took to heart. But not without doing something graphy with it :)) ... So here's a couple of things that I looked at and experienced.

Getting the Tour de France data

So my go-to-source of Tour de France data is a local TV-station's website: Sporza Tour. They have a very handy results section that has a bunch of data, and as it so happens, it's pretty easy to export that data into a Google Spreadsheet, which I have published over here. There's three tabs in the sheet:
  1. a "riders" tab which includes basic information about the 198 riders and their 22 teams. 
  2. a "sporza" tab, which actually imports data from the Sporza website on a daily basis, using the illustrious "ImportHTML" function. It's a bit of a hack, as I basically had to reverse engineer the URLs of the results pages on the Sporza website, and then use the function to get the data automatically. Works though.
  3. a "stages" tab that has a nice little overview of the different stages, their profile, the stage results, and the Jersey holders after every stage. 
It's a pretty great summary of the current situation in the Tour. Now all I need to do is get that data into Neo4j to see what fun we can have with it in a graph. Of course, the good news is that Google spreadsheets allow you to load data really easily from a CSV file - so that should be easy enough.

That's, of course, what I will be doing in the next blogpost.



No comments:

Post a Comment