Tuesday 3 March 2015

Challenge ACCEPTED: Belgian TV holds a "wiki wiki" challenge

My friend and future colleague Pieter alerted me to an interesting challenge hosted  on Belgian Television: a Wiki Wiki Challenge. The assignment is simple: find the link between two pages on the Dutch Wikipedia. How difficult can it be? Well, pretty difficult if you look at the assignments:
  1. Nicole Kidman -> TO -> Hugo Claus
  2. De ring van Brussel -> TO -> Thuis
  3. Okselhaar -> TO -> Postmodernisme
  4. Henk Rijckaert -> TO -> Chuck Norris
That could be interesting. So what to do? Waste my time browsing through thousands of wikipedia pages? Or use a sharp tool, and use it wisely? Could that tool be called, Neo4j, by any chance?


Importing nl.wikipedia.org into Neo4j

In order to calculate these paths between these pages, we first would have to import the Wikipedia articles into Neo4j. We had done this once before, using Graphipedia, a very easy piece of software to convert the downloaded wikipedia archive into a Neo4j database. It generates a 2.0 store format, but that's an easy upgrade to the 2.2M04 that I have running on my machine. The import was done in less than 30mins on this machine. So then we could start playing around...

Just quickly browsing through the "Pages" (a label in this database) and the "Links" (a relationship type in this database) gave us a feel for the assignment within Neo4j.


All we needed to do was to grab the two nodes in the Wiki Wiki challenge assignments, and run a "ShortestPath" algorithm on it - which is conveniently part of the Cypher query language. That would be it - so let's try it out.

Answering the Challenge Questions

Let's go through the 4 questions:

1. Nicole Kidman -> TO -> Hugo Claus

The query for this is pretty easy: 
 match (hc:Page {title:"Hugo Claus"}), (nk:Page {title:"Nicole Kidman"}),   
 p = allshortestpaths((hc)<-[*]-(nk))  
 return p  

which gives you this result. So apparently there are quite a few paths possible:


If we just limit it to one result

you quickly figure out that we are going from
Job done!

Similarly we can do the same with the next assignment:

2. De ring van Brussel -> TO -> Thuis

Same kind of query
 match (t:Page {title:"Thuis (televisieserie)"}), (rb:Page {title:"R0 (Belgiƫ)"}),   
 p = allshortestpaths((t)<-[*]-(rb))  
 return p  

and the result is in the city of Dilbeek, where there is a horseriding stable called "Hof ter Smissen" that is often featured in the TV show Thuis.


Again: job done! Onto the third (bizarre) assignment!

3. Okselhaar -> TO -> Postmodernisme

What is the link between armpit hair and postpodernism? We always wanted to know. A quick query later, we see the answer:

 match (o:Page {title:"Okselhaar"}), (pm:Page {title:"Postmodernisme"}),   
 p = allshortestpaths((o)-[*]->(pm))  
 return p  

This answer is obviously a lot less trivial:
Here you can see that result in the Neo4j browser:

And another one bites the dust! One more to go!

4. Henk Rijckaert -> TO -> Chuck Norris

Probably the most telling and funniest link is the one between Henk Rijckaert and Chuck Norris. We only need three hops:
Look at it over here:

That was it. Easy peasy, and very cute to do - even though there are so many very very serious use cases where pathfinding over a graph is actually a fantastic use case for Neo4j. People like TomTom, the Belgian Railroad, UGent, and many others use this capability for very serious use cases - and it is so so powerful.

If you want to learn a bit more about similar use cases on Wikipedia, in English, then please take a look at WikiDistrict, developed by Kernix using Neo4j.

Hope you found this as interesting as we did.

Cheers

Rik & Pieter

No comments:

Post a Comment