Monday 29 March 2021

Part 2/3: Wikipedia Clickstream analysis with Neo4j - queries and exploration

In the previous blogpost, I showed you how easy it was to import data into Neo4j from the official Wikipedia clickstream data. I am sure you would agree that it was surprisingly easy to import a reasonably sized dataset like that, within a very reasonable timeframe. So now we can have some fun with that data, and start applying some graph queries to it. All of these queries are also on github, of course, and you can play around with them there as well.

So let's take a look at some of these queries. 

Some data profiling and exploration

Here's a very simple query to give you a feel for the dataset:

match (n)-[r:LINKS_TO]->(m)
return distinct r.type, count(r);
match (n) return count(n);

The results are telling:

And so now we can start taking a look at some specific links between pages. One place to investigate would be the Neo4j wikipedia page. Here's a query that looks at the source pages that are generating traffic into the Neo4j wikipedia page:

match (source:Page)-[sourcelink:LINKS_TO]->(neo:Page {title:"Neo4j"}) 
return source, sourcelink, neo;

That gives you a very simple graph:

So what would happen if we would take that one step further, and look at the source pages of the source pages for traffic to the Neo4j website. That's an easy enough variable length path query on Neo4j:

match path = (sourcepage:Page)-[:LINKS_TO*..2]->(neopage:Page {title: "Neo4j"})
return path
limit 100;

That is starting to give you an even richer view:
That's already very cool - but let's continue.

Understanding the importance of the links

Now, in order to understand the importance of these links and paths, we are going to start working with the relationship property quantity that is going to give us a feel for the relative importance of every link. Here's what we'll do:

match (source:Page)-[sourcelink:LINKS_TO]->(neo:Page {title:"Neo4j"}) 
return source.title, sourcelink.quantity as nrofclicks
order by nrofclicks desc;

This is the result:

And we can actually do that for the links two hops away:
match (source:Page)-[sourcelink:LINKS_TO*..2]->(neo:Page {title:"Neo4j"}) 
return source.title, REDUCE(sumofq = 0, r IN sourcelink | sumofq + r.quantity) AS total
order by total desc
limit 10;
Or if you want to do it a bit different and understand the actual full path a bit better, you could do this:

match (sourceofsource:Page)-[sourceofsourcelink:LINKS_TO]->(source:Page)-[sourcelink:LINKS_TO]->(neo:Page {title:"Neo4j"})
return sourceofsource.title+"==>"+source.title+"==> Neo4j" as pages, sourceofsourcelink.quantity+sourcelink.quantity as total
order by total desc
limit 10;

and get this result:

This is super interesting, but a bit of an ugly query. If we would want to do this at 4 hops deep, the cypher statement would become extremely tedious and ugly. That's why we actually can reformulate that query using pattern comprehension and the reduce function:

match path = (source:Page)-[sourcelink:LINKS_TO*..2]->(neo:Page {title:"Neo4j"}) 
RETURN [node IN nodes(path) | node.title] as titles
reduce(sumofq = 0, r in relationships(path) | sumofq + r.quantity) as total
order by total desc
limit 10;

This gives you the same result as above, but all of a sudden I can also run this for depts of 4 or more even. Here's the result at dept 2:
or at depth 4:
If you want to represent this as a graph, you would get:

Which highlights a very interesting link between the British royals and Neo4j :) ...

Of course we can also do the reverse, and look at where people are navigating TO from the Neo4j wikipedia page. The only we need to do is to reverse the direction of the query:

match path = (target:Page)<-[targetlink:LINKS_TO*..4]-(neo:Page {title:"Neo4j"}) 
return reverse([node IN nodes(path) | node.title]) as titles, 
reduce(sumofq = 0, r in relationships(path) | sumofq + r.quantity) as total
order by total desc
limit 10;

Which then again shows some very interesting outgoing links:
Clearly this is not very sophisticated yet in terms of clickstream analysis, but it is starting to give you a feel of how Neo4j's pathfinding capabilities can really help us here to better understand the navigation patterns to and from a wikipedia page. There's lots of potential here.

Now before we move to some graph data science exercises, I wanted to highlight one more interesting query problem that I had while doing this exercise - which is all about case sensitivity.

Tackling case sensitivity

One thing I noticed while I was working on the above examples is that it pretty confusing to do some querying that is case insensitive. Take the following example:

match (p:Page)
where p.title contains "beer"
return count(p);

This yields this results:

But this would actually only give me the case sensitive results. I can actually solve this by using regular expressions in my where clause:

match (p:Page)
where p.title =~ "(?i).*beer.*"
return count(p);

This gives me a much wider result set:

but is also quite a bit less efficient, as it cannot use the Neo4j native indexes to do this matching. This can be a real problem. One solution for this would be to use Neo4j's native support for Lucene indexes, aka the "fulltext indexes". You can actually really easily add these to the index providers, on top of the native ones. You can do this as follows:

CALL db.index.fulltext.createNodeIndex("pagetitleindex", ["Page"], ["title"]);

This adds the index to the database, as you can see below.

So there we have 2 different indexes, one native (used for the uniqueness constraint, as well), one fulltext, on the same Label/Property. 

Now, in order to do a case insensitive query, we can actually use that fulltext index to do the lookup, instead of the above regular expression based solution.

CALL db.index.fulltext.queryNodes("pagetitleindex", "*beer*") YIELD node return count(node);

The result is the same as the one we had above, but it is way more efficient!

Here's a sample of what the results actually look like:

CALL db.index.fulltext.queryNodes("pagetitleindex", "*beer*") YIELD node, score
RETURN node.title;

You can clearly see that this is no longer case sensitive! Great!

That covers part 2 of this blogpost series. In part 3 we will start looking at some graph data science on this dataset - which should be a ton of fun as well.

Hope this was useful - as always comments very welcome.



PS: all scripts are nicely put together on github.

No comments:

Post a Comment