Wednesday 13 January 2016

The GraphBlogGraph: 3rd blogpost out of 3

Querying the GraphBlogGraph

After having created the GraphBlogGraph in a Google Spreadsheet in part 1, and having imported it into Neo4j in part 2, we can now start having some fun and analysing and querying that dataset. There are obviously a lot of things we could do here, but in this final blog post I am just going to explore some initial things that I am sure you could then elaborate and extend upon.

Let’s start with a simple query

// Which pages have the most links
match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)
return b.name, p.title, count(r)
order by count(r) desc
Run this in the Neo4j browser and we get:

or just return the graphical result with a slightly different query:

match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)
with b,p,r,p2, count(r) as count
order by count DESC
limit 50
return b,p,r,p2

And then you start to see that Max De Marzi is actually the “king of linking”: he links his pages to other web pages a lot (which is actually very good for search-engine-optimization) .

A quick visit to one of Max’ pages does actually confirm that: there’s a lot of cool, bizarre, but always interesting links on Max’ blogposts:
So let’s do another query. Let’s look at the different links that exist between blogposts of our blog-authors. Are they actually quoting/referring to one another or not? Let’s do

//links between blogposts
MATCH p=((n1:Blog)--(p1:Page)-[:LINKS_TO]-(p2:Page)--(b2:Blog))
RETURN p;

and then we actually find that there are some links - but not that many.


Same thing if we look at this a different way: let’s do some pathfinding and check out the paths between different blogs, for example my blog and Michael’s

match (b1:Blog {name:"Bruggen"}),(b3:Blog {name:"JEXP Blog"}),
p2 = allshortestpaths((b1)-[*]-(b3))
return p2 as paths

Then we actually see a bit more interesting connections: we don’t refer to one another directly very often, but we both refer to the same pages - and those pages become the links between our blogs. At depth 4 we see these kinds of patterns:

Interesting, right? I think so, at least!

Then let’s do some more playing around, looking at the most linked to pages:

//Which pages are being linked to most
match ()-[r:LINKS_TO]->(p:Page)
return p.url, count(r)
order by count(r) DESC
limit 10;

That quickly uncovers the true “spider in the web”, my friend, colleague and graphista-extraordinaire: Michael Hunger:

Last but not least, I wanted to revisit an old and interesting way of running PageRank on Neo4j using Cypher (not using the Graphaware NodeRank module, therefore). I blogged about some time ago, and it’s actually really interesting and easy to do. Here’s the query:

UNWIND range(1,50) AS round
MATCH (n:Page)
WHERE rand() < 0.1
MATCH (n:Page)-[:LINKS_TO*..10]->(m:Page)
SET m.rank = coalesce(m.rank,0) + 1

This does 50 iterations of PageRank, using a 0,1 damping factor and a maximum depth of 10. Running it is surprisingly quick:

If you do that a couple of times, and even do a few hundred iterations at once, you will quickly see the results emerge with the following simple query:
match (n:Page)
where n.rank is not null
return n.url, n.rank
order by n.rank desc
limit 10;
Confirming the “spider in the web” theory that I mentioned above. Michael rules the links!


All of these queries are of course on Github for you to play around with. Would love to hear your thoughts on these three blogposts, and hope that they were as fun for you to read as they were for me to write.

All the best.

Rik

No comments:

Post a Comment