Monday 29 March 2021

Part 1/3: Wikipedia Clickstream analysis with Neo4j - the data import

Alright, here's a project that has been a long time in the making. As you may know from reading this blog, I have had an interest, a fascination even, with all the wonderful use cases that the graph ecosystem holds. To me, it continues to be such a fantastic thing to be able to work in - graphs are everywhere, and more and more people are waking up to the fact that they really should look at their data as a network, and leverage the important relationships that are often hidden from plain sight.

One of these use cases that has been intriguing me for years, literally, is clickstream analysis. In fact, I wrote about this already back in 2013 - amazing when you think about it. Other people, like our friends at Snowplow Analytics, have been writing about this as well, but somehow the use case has been snowed under a little maybe. With this blogpost, I want to illustrate why I think that this particular use case - which is really a typical pathfinding application when you think about it, is such a great fit for Neo4j.

A real dataset: Wikipedia clickstream data

This crazy journey obviously started with finding a good dataset. There's quite a few of them around, but I wanted to find something realistic, representative and useful. So after some digging around I found the fantastic site of Wikimedia, where they actually structurally make all aggregated clickstream data of Wikipedia's pages available. You can just download them from this their website, and grab the latest zipped up files. In this blogpost, I worked with the February 2021 data, which you can find over here.

When you dowload that fine, you will find a tab-separated text file that includes the following 4 fields
  • prev: the previous page that the navigation came from
  • curr: the current page that the navigation came into
  • type: the description of the type of navigation that was occuring. There's different possible values here
    • link: a regular link between pages
    • external: a link from an external page to the current page
    • other: a different type - which can occur if people try to hide their navigation patterns
  • n: the number of occurrences of the (prev, curr) pair - so the number of times this navigation took place.
So this is the dataset that we want to import into Neo4j. But - we need to do one tiny little fix: we need to escape the “ characters that are in the dataset. To do that, I just opened the file in a text editor (eg. TextEdit on OSX) and did a simple Find/Replace of " with "". This take care of it.

Once we have that, we are ready to fire up the Neo4j Desktop, and put the unzipped file into the <your neo4j server home directory>/import directory. 

Once we have that, we are almost ready for the data import. All we need to do before we fire up our empty Neo4j server, is to install the APOC plugin. This is just a few clicks in the Neo4j Desktop, and we should be ready to go in seconds.

The data model that we will be loading the data into with the script below is of course very, very simple, and typically recursive:

So we can kind of already suspect that the recursiveness of the data model is going to imply some very interesting graph querying. Lots of fun ahead!

Loading the Wikipedia Clickstream data

I have put the scripts onto github, of course. You can find it over here, but let's go through it together.

First thing we need to do is to create a uniqueness constraint on the Page titles, as that will greatly improve the speed of import (as we will be using MERGE for that).


Once we have that, we can run the import job. As per usual with larger imports like this, it's quite important to make sure that you make the batching of your write operations inline with the size of your memory configuration. On my laptop, I have the following memory settings for this database:


which allows for fairly large transactions. You should tweak this for your case, obviously.

Once you have that, we can use apoc.periodic.iterate for the batching of the import.

Call apoc.periodic.iterate('
Call apoc.load.csv("file:///clickstream-enwiki-2021-02.tsv", {sep:"TAB", header:false}) yield list as line return line',
'MERGE (p1:Page {title: line[0]})
MERGE (p2:Page {title: line[1]})
CREATE (p1)-[:LINKS_TO {type: line[2], quantity: toInteger(line[3])}]->(p2)
{batchSize:25000, iterateList:true, parallel:false});

Note that you may need to tweak the batchSize parameter for your environment, depending on the memory configuration. This takes some time to complete, but once it does I roughly had 4.6M nodes, and 31M relationships imported into my Neo4j database.

Cleaning up the data for clickstream analysis

As I started to look into this data, I found that there are actually quite a few nodes and relationships that really don't give me a lot of useful information. All of the links that have a type equal to "other" don't really mean much. All the links that start from a start node that is "other-<somethingsomething>" also does not help us understand much. So I decided to clean these out of the database, with a few household queries. Let's take a look.

Here's how I remove the "other" links

match ()-[r:LINKS_TO {type:"other"}]->()
delete r;

And here's how I removed all the "other-" source nodes and the links that they had into the various pages. First I wanted to make sure that no information was lost, so I stored the "other" link quantities in separate properties on the target page nodes, before deleting them.

match (source:Page {title:"other-empty"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-empty` = r.quantity;
match (source:Page {title:"other-external"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-external` = r.quantity;
match (source:Page {title:"other-internal"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-internal` = r.quantity;
match (source:Page {title:"other-other"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-other` = r.quantity;
match (source:Page {title:"other-search"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-search` = r.quantity;

Once I have that, I can then proceed to just remove all the "other-"relationships. Since there's a lot of them, I decided to batch this operation again with apoc.periodic.iterate:

Call apoc.periodic.iterate('
match (source:Page)-[r]-()
where source.title starts with "other-"
return r, source',
'Delete r',
{batchSize:10000, parallel:false});

And the finally I can delete the "other-" nodes - there's just a handful so that's super easy.

match (source:Page)
where source.title starts with "other-"
delete source;

The result is beautiful, shiny, clickstream database in Neo4j.

The scripts above are on github, so take a look and take it for a spin!

In the next part, part 2, of this blogpost series, I will be showing you what we can do with this data.



PS: all scripts are nicely put together on github.

No comments:

Post a Comment