Monday 29 March 2021

Part 3/3 - Wikipedia Clickstream analysis with Neo4j - some Graph Data Science & Graph Exploration

In the previous blogposts, I have tried to show
  • How easy it is to import the Wikipedia Clickstream data into Neo4j. You can find that post over here.
  • How you can start doing some interesting querying on that data, with some very simple but powerful Cypher querying. You can find that post over here.
In this final blogpost I want to try to add two more things to the mix. First, I want to see if I can do some useful "Graph Data Science" on this dataset. I will be using the Neo4j Graph Data Science Library for this, as well as the Neuler Graph App that plugs into the Neo4j Desktop. Next, I will be exposing some of the results of these Graph Data Science calculations in Neo4j's interactive graph exploration tool, Neo4j Bloom. So let's do that. 

Installing/Running the Graph Data Science Library

Thanks to Neo4j's plugin architecture, and the Neo4j Desktop tool around that, it is now super easy to install and run the Graph Data Science Libary - it installs in a few clicks and that you are off to the races:
Next, we can fire up the Neuler graphapp:
And start running some algorithms. Let's see what we can find out here. We can choose from quite a wide variety of different algorithms, and the first thing that I think we could explore here are the Centrality algorithms, as they will hopefully tell us something about the most important pages in the Wikipedia dataset that we have just loaded.
So let's start.

The most basic centrality algorithm: Degree

Probably the least sophisticated measure of centrality in the graph, is just to count the number of connections (ie. the degree) that a particular node would have. Yes, very basic indeed, but as Euler already illustrated with the 7 bridges of Königsberg, this can already be a very important and indicative metric to take into account. So we quickly go about configuring the algorithm:
And get the result very quickly afterwards

Another way to understand the importance of the different pages in the dataset, is to use the algorithm pioneered by Google - Pagerank.

Calculating the Pagerank centrality metric

This metric is really interesting in many ways, and those of you old enough to remember Web Search before Google will testify that it was super sh!tty, did not work, and this was at a time when the web was infinitely smaller than what it is now. Google revolutionised Web Search, and built a subsequent software empire, by understanding how the different pages in the web were linked together, and assigning it with a higher score (a ranking) if it was well linked, and a lower score if it was more badly linked into the web. This is of course very applicable to this particular dataset. So let's configure the algorithm:
The results are actually available quite quickly:
And can also be visualised in Neuler:

Other centrality algorithms to be run

I won't bore you with many more details, but I did spend a little more time calculating a few more centrality metrics, specifically 
  • the approxBetweenness metric: this tries to understand how "between" a node is by calculating how often that node is on the shortest path between two other nodes. It does so on a subset of the dataset - the "real" Betweenness score (which calculates the same metric on all pairs of nodes) would be way to costly to run on my little laptop. 
  • the HITS metrics: this rates notes based on two metrics
    • the Hub score: how important is this node in the structure of the network
    • the Authority score: how relevant is this node in the network
Both algorithms give you some interesting additional background information, and are not that hard to run, at all.

Community detection on this dataset
Using the Graph Data Science library, it's also possible and quite easy to run algorithms that try to understand how different parts of the graph belong together, how they form cliques or communities. This is really interesting conceptually, as you would of course expect visitors to the Wikipedia website to be interested in other pages that belong to the same community, for example.

Calculating the communities can be done in different ways, and in general I think it makes sense to try different approaches on a specific dataset to see what works best. In our example below, we will be working with the Louvain, and the Label Propagation method of understanding these communities.


The Louvain method for community detection is an algorithm for detecting communities in networks. It maximizes a modularity score for each community, where the modularity quantifies the quality of an assignment of nodes to communities. We can configure it quite easily, and does not take a lot of time to run. Here's what the result looked like:

The algorithm has now created a new "community number" property on our :Page nodes, so that we can find what pages belong together in the same theoretical community very easily. In order to use that property, however, it makes sense to create an index.

create index on :Page(louvain);

And then we can proceed.

Label Propagation

The Label Propagation Algorithm is a fast algorithm for finding communities in a graph, which it does by propagating labels and forming communities based on this process of label propagation. On initialisation, every node has its own label, but as labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label. At the end of the propagation only a few labels will remain.

Here's what the result looks like:
This algorithm has now created an "lpa" property on all the :Page nodes. Let's now try to get a better understanding of the results of these algorithms.

Understanding the communities

Here's a first little interesting query: let's find out how many pages there are in a Louvain community:

match (p:Page)
return distinct p.louvain, count(p)
order by count(p) asc
limit 10;

We can also explore a community of beer pages:

CALL db.index.fulltext.queryNodes("pagetitleindex", "*beer*") YIELD node, score
RETURN distinct node.louvain, collect(node.title), count(node.title)
order by count(node.title) desc
limit 10;

Or explore the community of Neo4j pages:

CALL db.index.fulltext.queryNodes("pagetitleindex", "*neo4j*") YIELD node, score
RETURN distinct node.louvain, collect(node.title), count(node.title)
order by count(node.title) desc
limit 10;

All of these queries can be run in the Neo4j browser. If we want to actually give this type of query capability to less technical people, we may want to consider using a different tool for that.

Graph exploration in Bloom

Last quick thing I would like to mention here, is the fact that we can of course now explore the results of all of the above algorithms, as well as the entire dataset that we had already, using a tool like Neo4j Bloom. We can add some specific stylings, for the relationships, so that the links between the pages become thicker if there are higher "quantity" properties on the links:
The result then looks like this:

I have also created a few search phrases. This following one allows you to find the Louvain community of a certain node:

Here's the Louvain community of the Page with title "Neo4j":

Or you can create a search phrase that finds the paths between two Page nodes:

If I run that for "Neo4j" and "Beer" I get the following:

If I run that for "Neo4j" and "Antwerp" I also get an interesting graph:

The Bloom perspective can be found on github, of course.

I hope this has illustrated some of the interesting parts of what you can do with graph data science, and with graph exploration, and that could be of use for you and your work.

Let me know if you have any comments or questions - I would be happy to hear!



PS: all scripts are nicely put together on github.

No comments:

Post a Comment