Alright, here's a project that has been a long time in the making. As you may know from reading this blog, I have had an interest, a fascination even, with all the wonderful use cases that the graph ecosystem holds. To me, it continues to be such a fantastic thing to be able to work in - graphs are everywhere, and more and more people are waking up to the fact that they really should look at their data as a network, and leverage the important relationships that are often hidden from plain sight.
One of these use cases that has been intriguing me for years, literally, is
clickstream analysis. In fact, I
wrote about this already back in 2013 - amazing when you think about it. Other people, like our friends at Snowplow Analytics,
have been writing about this as well, but somehow the use case has been
snowed under a little maybe. With this blogpost, I want to illustrate why I think that this particular use case - which is really a typical
pathfinding application when you think about it, is such a great fit for
Neo4j.
A real dataset: Wikipedia clickstream data
This crazy journey obviously started with finding a good dataset. There's quite a few of them around, but I wanted to find something realistic, representative and useful. So after some digging around I found the
fantastic site of Wikimedia, where they actually structurally make all aggregated clickstream data of Wikipedia's pages available. You can just download them from this
their website, and grab the latest zipped up files. In this blogpost, I worked with the February 2021 data, which
you can find over here.
When you dowload that fine, you will find a tab-separated text file that includes the following 4 fields
- prev: the previous page that the navigation came from
- curr: the current page that the navigation came into
- type: the description of the type of navigation that was occuring. There's different possible values here
- link: a regular link between pages
- external: a link from an external page to the current page
- other: a different type - which can occur if people try to hide their navigation patterns
- n: the number of occurrences of the (prev, curr) pair - so the number of times this navigation took place.
So this is the dataset that we want to import into Neo4j. But - we need to do one tiny little fix: we need to escape the “ characters that are in the dataset. To do that, I just opened the file in a text editor (eg. TextEdit on OSX) and did a simple Find/Replace of " with "". This take care of it.