My friend and future colleague Pieter alerted me to an interesting challenge hosted on Belgian Television: a Wiki Wiki Challenge. The assignment is simple: find the link between two pages on the Dutch Wikipedia. How difficult can it be? Well, pretty difficult if you look at the assignments:
- Nicole Kidman -> TO -> Hugo Claus
- De ring van Brussel -> TO -> Thuis
- Okselhaar -> TO -> Postmodernisme
- Henk Rijckaert -> TO -> Chuck Norris
That could be interesting. So what to do? Waste my time browsing through thousands of wikipedia pages? Or use a sharp tool, and use it wisely? Could that tool be called, Neo4j, by any chance?
Importing nl.wikipedia.org into Neo4j
In order to calculate these paths between these pages, we first would have to import the Wikipedia articles into Neo4j. We had done this once before, using Graphipedia, a very easy piece of software to convert the downloaded wikipedia archive into a Neo4j database. It generates a 2.0 store format, but that's an easy upgrade to the 2.2M04 that I have running on my machine. The import was done in less than 30mins on this machine. So then we could start playing around...
Just quickly browsing through the "Pages" (a label in this database) and the "Links" (a relationship type in this database) gave us a feel for the assignment within Neo4j.
All we needed to do was to grab the two nodes in the Wiki Wiki challenge assignments, and run a "ShortestPath" algorithm on it - which is conveniently part of the Cypher query language. That would be it - so let's try it out.
Answering the Challenge Questions
Let's go through the 4 questions:
1. Nicole Kidman -> TO -> Hugo Claus
The query for this is pretty easy:
match (hc:Page {title:"Hugo Claus"}), (nk:Page {title:"Nicole Kidman"}),
p = allshortestpaths((hc)<-[*]-(nk))
return p
which gives you this result. So apparently there are quite a few paths possible:
If we just limit it to one result
you quickly figure out that we are going from
- http://nl.wikipedia.org/wiki/Nicole_Kidman, to
- http://nl.wikipedia.org/wiki/20_juni (Nicole's birthday), to
- http://nl.wikipedia.org/wiki/1971 (Audi Murphy died on 20th of June 1971), to
- http://nl.wikipedia.org/wiki/Hugo_Claus (he published his novel "Schola Nostra") in that year.
Job done!
Similarly we can do the same with the next assignment:
2. De ring van Brussel -> TO -> Thuis
Same kind of query
match (t:Page {title:"Thuis (televisieserie)"}), (rb:Page {title:"R0 (Belgiƫ)"}),
p = allshortestpaths((t)<-[*]-(rb))
return p
and the result is in the city of Dilbeek, where there is a horseriding stable called "Hof ter Smissen" that is often featured in the TV show Thuis.
Again: job done! Onto the third (bizarre) assignment!
3. Okselhaar -> TO -> Postmodernisme
What is the link between armpit hair and postpodernism? We always wanted to know. A quick query later, we see the answer:
match (o:Page {title:"Okselhaar"}), (pm:Page {title:"Postmodernisme"}),
p = allshortestpaths((o)-[*]->(pm))
return p
This answer is obviously a lot less trivial:
- start at http://nl.wikipedia.org/wiki/Okselhaar
- go to http://nl.wikipedia.org/wiki/1915 - the year when armpit shaving started in the Western World
- go to http://nl.wikipedia.org/wiki/1992, the year that Harry Prenen (who was born in 1915) died
- go to http://nl.wikipedia.org/wiki/Jeff_Koons, home page of the artist that created Puppy in 1992.
- end up at http://nl.wikipedia.org/wiki/Postmodernisme, which is the style that Jeff Koons is categorized as.
Here you can see that result in the Neo4j browser:
And another one bites the dust! One more to go!
4. Henk Rijckaert -> TO -> Chuck Norris
Probably the most telling and funniest link is the one between Henk Rijckaert and Chuck Norris. We only need three hops:- start at http://nl.wikipedia.org/wiki/Henk_Rijckaert
- go to http://nl.wikipedia.org/wiki/16_augustus, his birthday
- go to http://nl.wikipedia.org/wiki/1940, the year that Henri Desgrange died on the 16th of August.
- go to http://nl.wikipedia.org/wiki/Chuck_Norris, who was born on the 10th of March 1940.
Look at it over here:
That was it. Easy peasy, and very cute to do - even though there are so many very very serious use cases where pathfinding over a graph is actually a fantastic use case for Neo4j. People like TomTom, the Belgian Railroad, UGent, and many others use this capability for very serious use cases - and it is so so powerful.
If you want to learn a bit more about similar use cases on Wikipedia, in English, then please take a look at WikiDistrict, developed by Kernix using Neo4j.
Hope you found this as interesting as we did.
Cheers
Rik & Pieter
No comments:
Post a Comment