After my previous experiments with some sports data (most recently, the Tour de France 2016 results) in Neo4j, I recently saw the 2016 Olympic games coming up, and thought: well, there MUST be some interesting datasets to find around that - especially now that one of my favourite bike-riders in the world, Greg Van Avermaet, won the Gold Medal in the Cycling Road Race. Still so excited!!!
I did a bit of research and decided to settle on a combination of two datasets:
Just before the London Olympics in 2012,
- The Guardian publlshed a list of all summer Olympic medallists, from 1896 to 2008
- Just after the same 2012 games, The Guardian also published the list of the 2012 medallists
- I had to reformat some of the medallists names
- I had to rework some of the country names
- I had to map the sports and disciplines into a fairly consistent structure - which is probably the most difficult part, and probably also the part where the new dataset still has the most issues - specifically for the 2012 medallists.
So there is the dataset, so all I had to do is load it. Essentially I ended up with 4 .csv files:
- One for the countries, and their 3-letter country-codes
- One for the hosting cities, and their mappings to the above countries
- A two-level categorisation (let's call it a "sports taxonomy") of some sort containing Sports (eg. Aquatics, Cycling,...), and Disciplines (eg. Swimming, Track Cycling, …)
- A much bigger and longer sheet / .csv file that contains the actual data about each and every of the 30000+ medallists.
Let’s explore how to load that data into Neo4j, and get down to some querying.
We'll do that in the next blogpost.