Friday 12 August 2016

The Great Olympian Graph - part 3/3

In part 1 of this blogpost series, we created and prepared a dataset with all the modern Olympian medallists from 1896 to 2012. In part 2, we loaded all that data into Neo4j: here's that article. Now, we of course want to do some interesting queries on the dataset, and see if the Graph will yield any interesting insights - as it usually does.

Easy querying - number of sports per Olympic game

Let's start with something easy - doing some counting of the numbers of sports in every game since 1896. Here's how we do that:
//number of sports per game 
match (y:Year)<--(e:Event)-->(d:Discipline)-->(s:Sport) 
with distinct y.name as game, s.name as sport 
return game, count(sport)order by game ASC
Then you can see that the number of sports has not really changed that dramatically over the years: In the early days we immediately went from 9 sports at the first Olympic game in 1896, to 19 in 1900.


Then the number remained fairly constant for years - settling at 22 in 2012.
As an orienteer (our sport applied to be an Olympic sport fairly recently, but got rejected), I follow that with some interest, so I wanted to take a look at which sports were previously exercised at the Olympic. So instead of just counting the Sports, let's just collect and return them:
//number of sports per game 
match (y:Year)<--(e:Event)-->(d:Discipline)-->(s:Sport) 
with distinct y.name as game, s.name as sport 
return game, count(sport), collect(sport)order by game ASC
And then came a first surprise from the data: did you know that Tug of War was an Olympic sport from 1900 to 1920?



That looks just so unbelievably cool:
Wish we still had that!

The Medals Table

Next up, the "classic" tally board of Olympic medals per country.  Of course we could just "google it" and find the inevitable Wikipedia page, but let's do it with Cypher, shall we? Let's randomly pick the 1956 Olympics in Melbourne (which, notoriously, were the only Olympics that were held in two locations - read about it here).
//medals table 
match (y:Year {name:"1956"}),(c:Country)<-[:REPRESENTS]-(a:Athlete)-[:WINS]->(m:Medal)-[:REWARD_FOR]->(e:Event)-[:AT_GAMES]->(y)with y, c, count(m) as totalmedals 
match (c)<-[:REPRESENTS]-(a:Athlete)-[:WINS]->(m:Medal)-[:REWARD_FOR]->(e:Event)-[:AT_GAMES]->(y) 
return c.name, totalmedals, m.type, count(m) 
order by totalmedals desc 
limit 10;
The results are there in no time:

Now we can of course also look at the different other dimensions, and for example look at the number of medals per sport per country:
//medals per sport per country
match (c:Country)<--(a:Athlete)-->(m:Medal)-->(e:Event)-->()-->(s:Sport),
(e)--(y:Year {name:"1956"})
return c.name, s.name, count(s) as NrOfMedals
order by NrOfMedals desc;
And we get the following results:

Nice! Now we can go an explore another dimension.

The Home Advantage?

So then I thought of some interesting other questions - like for example "who actually won a medal when competing in an Olympic game hosted in their own home country?". It took me a couple of interations to get the query right, but then I realised it's just a matter of looking at the loop in the data model, which is easy enough. Here's the query:
//athletes with medals in games hosted in their home countries 
MATCH (y:Year)<--(e:Event)<--(m:Medal)<--(a:Athlete)-->(co:Country)<--(c:City)-->(y) 
return y.name, c.name, co.name, m.type, count(a) 
order by y.name desc
And then you get this:
Which I thought was interesting - because immediately I noticed that a small country like Greece, actually got a LOT of Silver medals in 2004. Why is that? Let's look around. So I immediately found this page on Wikipedia, which is giving me very different numbers:

So then I started looking at the details, and added a collection of Athlete names to the query:
//same query with names 
MATCH (y:Year)<--(e:Event)<--(m:Medal)<--(a:Athlete)-->(co:Country)<--(c:City)-->(y) 
return y.name, c.name, co.name, m.type, count(a), collect(a.name) 
order by y.name desc
And then you immediately see that the Women's waterpolo team was getting individual medals, and that was being counted differently in our dataset:
Last but not least, we can now start wrapping up our query examples, looking at the best of the best: the Top Olympian medallists.

The Top Olympians


Let's do that simple query first:
//top Olympians 
match (a:Athlete) 
with a, size((a)-[:WINS]->()) as countmedals 
where countmedals > 1 
return a.name, countmedals 
order by countmedals desc 
limit 10
and then we see the giants like Larisa Latynina and Michael Phelps:

We can of course also split that up by medal type:
//top Olympians and their types of medals 
match (a:Athlete) 
with a, size((a)-[:WINS]->()) as countmedals 
order by countmedals desc 
limit 10 
match (a)--(m:Medal) 
return a.name, m.type, count(m) 
order by a.name, count(m) desc;
and we see a bit more detail:

And then finally, I decided to split this up by different games:
//Olympians with medals in different games 
match (a:Athlete)--(m:Medal)--(e:Event)--(y:Year) 
return a.name as name, collect(y.name+": "+m.type) as Years 
order by size(Years) desclimit 10;
And then I start to see some more familiar names like Anky van Grunsven (a famous Dutch dressage champion):


Last but not least, I decided to try and find the different Athletes that had actually won an Olympic medal in more than one sport - and if so, when that happened. Here are those queries:
//Olympians with medals in different sports 
match (a:Athlete) 
with a, size((a)-[:WINS]->()) as countmedals 
limit 100 
where countmedals > 1 
match (s:Sport)--(:Discipline)--(:Event)--(m:Medal)--(a) 
with distinct a,s 
match (a)-->(c:Country) 
with a.name as Athlete, c.name as Country, collect(s.name) as Sports 
where size(Sports)>1 
return Athlete, Country, Sports;
Guess what: only 7 people ever succeeded in doing that:
That looks interesting! Now, of course I had to wonder that this probably did not happen very recently, as sports in general and Olympic sports more specifically have gotten SOOOOO much more professional - you just can't get any results there if you don't dedicate your life to it these days. So let's get the most recent Olympians with medals in different sports:
//Most recent Olympian with medals in different sports 
match (a:Athlete) 
with a, size((a)-[:WINS]->()) as countmedals 
limit 100 
where countmedals > 1 
match (s:Sport)--(:Discipline)--(:Event)--(m:Medal)--(a) 
with distinct a,s 
with a, collect(s.name) as Sports 
where size(Sports)>1 
with a 
match path=((s:Sport)<--(:Discipline)<--(e:Event)<--(:Medal)<--(a)-->(c:Country)), (e)-->(y:Year) 
return path,y 
order by y.name desc;
and then the graph immediately shows us that it has been more than a century since this last happened!




Just read the fascinating story of Otto Herschmann if you want to get a feel for how much the Olympics have changed - from a rich-man's passtime to a full-time profession.

All of the above queries are of course on github for you to explore and expand - I hope you enjoyed this series as much as I did.

Cheers

Rik

3 comments:

  1. Thanks Rik - great post. I used this last night to demonstrate Neo4j to the Pittsburgh Python Users Group. Lots of smart people. Have you updated the data since creating this? I get different results on a lot of these - especially the 'top olympians' query (Michael Phelps is #11 on my search). Also - on the last search - it looks like the results are being constrained to different events during the same year. There are actually lots of athletes who have medalled in different sports over different years: https://en.wikipedia.org/wiki/List_of_athletes_with_Olympic_medals_in_different_disciplines

    Looked for example for Sven Thofelt of Sweden who medaled in Fencing and Pentathalon - 1920-1928. Is this a data problem or is there a search that will get it?

    ReplyDelete
    Replies
    1. Ah - looks like you have caught a little query problem there. My original query was:

      //Olympians with medals in different sports
      match (a:Athlete)
      with a, size((a)-[:WINS]->()) as countmedals
      limit 100
      where countmedals > 1
      match (s:Sport)--(:Discipline)--(:Event)--(m:Medal)--(a)
      with distinct a,s
      match (a)-->(c:Country)
      with a.name as Athlete, c.name as Country, collect(s.name) as Sports
      where size(Sports)>1
      return Athlete, Country, Sports

      and *of course* that is wrong. The query really should be

      //Olympians with medals in different sports
      match (a:Athlete)
      with a, size((a)-[:WINS]->()) as countmedals
      where countmedals > 1
      match (s:Sport)--(:Discipline)--(:Event)--(m:Medal)--(a)
      with distinct a,s
      match (a)-->(c:Country)
      with a.name as Athlete, c.name as Country, collect(s.name) as Sports
      where size(Sports)>1
      return Athlete, Country, Sports

      This removes the LIMIT in the first part - and then you get 91 athletes. Thofelt is one of them. My bad.

      I did not update the data otherwise - so not sure what else could be playing. Please do note that the results for 2012 could be a bit weird as the Event naming from 1896-2008 was consistent - but it was not consistent with 2012. I did not have time / was too lazy to try and make it consistent.

      Delete
  2. Thank you! I copied your query (Ctrl-C) and got 749 rows back. This made a really nice example of the power of graphs. It was well-received live.

    ReplyDelete