Monday 8 August 2016

The Great Olympian Graph - part 1/3


After my previous experiments with some sports data (most recently, the Tour de France 2016 results) in Neo4j, I recently saw the 2016 Olympic games coming up, and thought: well, there MUST be some interesting datasets to find around that - especially now that one of my favourite bike-riders in the world, Greg Van Avermaet, won the Gold Medal in the Cycling Road Race. Still so excited!!!




I did a bit of research and decided to settle on a combination of two datasets:
Just before the London Olympics in 2012,
So I got to work in my favourite tool, Google Sheets, and started to consolidate the data into one big sheet. Nothing terribly fancy:
  • I had to reformat some of the medallists names 
  • I had to rework some of the country names 
  • I had to map the sports and disciplines into a fairly consistent structure - which is probably the most difficult part, and probably also the part where the new dataset still has the most issues - specifically for the 2012 medallists. 
The new sheet is kind of like an update to the old Guardian sheet - and can be found over here.



So there is the dataset, so all I had to do is load it. Essentially I ended up with 4 .csv files:

  1. One for the countries, and their 3-letter country-codes 
  2. One for the hosting cities, and their mappings to the above countries 
  3. A two-level categorisation (let's call it a "sports taxonomy") of some sort containing Sports (eg. Aquatics, Cycling,...), and Disciplines (eg. Swimming, Track Cycling, …) 
  4. A much bigger and longer sheet / .csv file that contains the actual data about each and every of the 30000+ medallists. 
I think the new sheet is kind of nice and cool - but there are still some data issues left. If you explore it in any level of detail, you will soon find that there's a bit of an issue with the Event names: within each discipline of each sport, there are Events, and these event's will tell you much more about the specifics of each medal that was granted, of course. But the names of these events is... messy at best, especially for the most recent addition of the 2012 Olympics data - it seems to be quite a bit different. In any case, I figured that I was not going to be asking too many Event-specific questions of the data anyway, so I left it as is.

Let’s explore how to load that data into Neo4j, and get down to some querying.

We'll do that in the next blogpost.

Cheers

Rik

2 comments: