Querying the GraphBlogGraph
After having created the GraphBlogGraph in a Google Spreadsheet in part 1, and having imported it into Neo4j in part 2, we can now start having some fun and analysing and querying that dataset. There are obviously a lot of things we could do here, but in this final blog post I am just going to explore some initial things that I am sure you could then elaborate and extend upon.
Let’s start with a simple query
// Which pages have the most links
match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)
return b.name, p.title, count(r)
order by count(r) desc
Run this in the Neo4j browser and we get:
or just return the graphical result with a slightly different query:
match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)
with b,p,r,p2, count(r) as count
order by count DESC
limit 50
return b,p,r,p2
So let’s do another query. Let’s look at the different links that exist between blogposts of our blog-authors. Are they actually quoting/referring to one another or not? Let’s do
//links between blogposts
MATCH p=((n1:Blog)--(p1:Page)-[:LINKS_TO]-(p2:Page)--(b2:Blog))
RETURN p;
and then we actually find that there are some links - but not that many.
Same thing if we look at this a different way: let’s do some pathfinding and check out the paths between different blogs, for example my blog and Michael’s
match (b1:Blog {name:"Bruggen"}),(b3:Blog {name:"JEXP Blog"}),
p2 = allshortestpaths((b1)-[*]-(b3))
return p2 as paths
Then we actually see a bit more interesting connections: we don’t refer to one another directly very often, but we both refer to the same pages - and those pages become the links between our blogs. At depth 4 we see these kinds of patterns:
Interesting, right? I think so, at least!
Then let’s do some more playing around, looking at the most linked to pages:
//Which pages are being linked to most
match ()-[r:LINKS_TO]->(p:Page)
return p.url, count(r)
order by count(r) DESC
limit 10;
That quickly uncovers the true “spider in the web”, my friend, colleague and graphista-extraordinaire: Michael Hunger:
UNWIND range(1,50) AS round
MATCH (n:Page)
WHERE rand() < 0.1
MATCH (n:Page)-[:LINKS_TO*..10]->(m:Page)
SET m.rank = coalesce(m.rank,0) + 1
This does 50 iterations of PageRank, using a 0,1 damping factor and a maximum depth of 10. Running it is surprisingly quick:
If you do that a couple of times, and even do a few hundred iterations at once, you will quickly see the results emerge with the following simple query:
match (n:Page)
where n.rank is not null
return n.url, n.rank
order by n.rank desc
limit 10;
Confirming the “spider in the web” theory that I mentioned above. Michael rules the links!
All of these queries are of course on Github for you to play around with. Would love to hear your thoughts on these three blogposts, and hope that they were as fun for you to read as they were for me to write.
All the best.
Rik