Tuesday, 9 May 2017

Part 2/2: looking at the Web of Belgian Public Companies in Neo4j

Yesterday, I published part 1 of this short little blogpost on how we could load the dataset of a great newspaper article in De Tijd (our local financial/economic newspaper) into Neo4j. Of course, the whole point of that loading process (all of which is easily copied from github, btw) is to be able to do some additional querying on the dataset - just because we can :) ... So let's do some simple queries here, and then you can of course explore this some more yourself!

Start with some simple queries

In the article above, one of the key figures in the web of public companies, is Luc Bertrand, the CEO of Ackermans & Van Haaren - a former dredging company that turned into a holding company. Let's explore the network around him - by walking the paths from his node for three hops.
//network around Luc Bertrand 
match path = (m:Male)-[r*..3]-(n) 
where m.name contains "Bertrand"return path
That query gives us a nice little graph that we can explore:




Let's look at another interesting metric, by understanding the number of links that a particular person has - the degree of a node - and see which nodes bubble up to the top as "most connected" nodes:
// degree of the Person nodes 
match (p:Person)return p.name, size( (p)--() ) as degreeorder by  degree desclimit 10
Indeed, that gives us Luc Bertrand as one of the top members of this network - but there's some other interesting ones that bubble up:


So why don't we explore some of the paths between these highly connected nodes, starting with the connections between Mr. Bertrand and Mr. Vlerick (the other person with a degree of "5") - surely that would be interesting? Let's see:
//links between highly connected nodes 
match (vlerick:Person {name:"Philippe Vlerick"}), (bertrand:Person {name:"Luc Bertrand"}),path = allshortestpaths ((vlerick)-[*]-(bertrand))return path;
And indeed we can see that there are some other interesting nodes (Person and Company nodes) that sit in between:


We can do that trick again for a couple other nodes:
match (degraeve:Person {name:"Bert De Graeve"}), (bertrand:Person {name:"Luc Bertrand"}),
path = allshortestpaths ((degraeve)-[*]-(bertrand))
return path;
Gives us this:

And then exploring further from there:
match (degraeve:Person {name:"Bert De Graeve"}), (donck:Person {name:"Frank Donck"}),
path = allshortestpaths ((degraeve)-[*]-(donck))
return path;
Gives us longer, perhaps even more interesting path

And we can do a similar trick to look at links between companies: let's look at the link between the largest Belgian bank and the largest beer producer :) ...
//links between companies 
match (kbc:Company {name:"KBC"}), (li:Company {name:"AB INBEV"}),
path = allshortestpaths ((kbc)-[*]-(li))
return path;
That gives us:

There's a ton of other queries that we could have done (like for example zooming in on the gender distribution of the boardroom representation (which has been the subject of a government-imposed quota since a few years), but I will leave that up to you :) ... Let's now do some more graph querying.

Let's do some more Graphy analysis

There's a bunch of much more interesting queries, but some of them I think are more interesting than others. I particularly wanted to this one: the maximum diameter of this network. This gives us an idea of the sparsity and density of the network, in a way, and I guess the outliers of the network would also be kind of interesting to explore further. Here's the query that will get us the diameter:
//maximum diameter as a graph 
MATCH (a:Person), (b:Person) WHERE id(a) > id(b)
MATCH p=shortestPath((a)-[:RELATED_TO*]-(b))
with length(p) AS len, p
ORDER BY len DESC LIMIT 1
return p
and as you can see from the below, it really is quite large.


So let's now go and get a handle on some of the more traditional graph metrics. For this we'll be using the Awesome Procedures (aka APOCs) on Neo4j, which include very simple and powerful ways to calculate these metrics.

Let's start with "Betweenness Centrality", which is defined as
In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.
So it's a measure that will help us understand the nodes in the network the will be connecting different parts of the network. Here's how we calculate that using the APOC procedure:
//betweenness centrality 
MATCH (node:Person)
WHERE id(node) %2 = 0
WITH collect(node) AS nodes
CALL apoc.algo.betweenness(['RELATED_TO'],nodes,'BOTH') YIELD node, score
RETURN node.name, score
ORDER BY score DESC
And we see some interesting nodes pop up:


Next up: closeness centrality.
In a connected graph, the closeness centrality (or closeness) of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes.
Here's the query:
//closeness centrality 
MATCH (node)
WHERE id(node) %2 = 0
WITH collect(node) AS nodes
CALL apoc.algo.closeness(['RELATED_TO'],nodes,'INCOMING') YIELD node, score
RETURN node.name, score
ORDER BY score DESC
And then we can see that there's a bunch of Companies that will stand out:


And then last but not least, we calculate the PageRank on our network, the basic algorithm that Google uses to rank search results:
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.[2]
It's something that we've used before on this blog, and that definitely could be interesting on this graph. Using APOC, it's so easy:
//pageRank for Companies 
MATCH (node:Company)WHERE id(node) %2 = 0WITH collect(node) AS nodes// compute over relationships of all typesCALL apoc.algo.pageRank(nodes) YIELD node, scoreRETURN node.name, scoreORDER BY score DESC
This gives me a set of companies that would be really interesting to explore further, of course:


I think that's about it for now. All the queries are of course over here on Github - so please take a look and let me know what you think!

Hope this was useful.

All the best

Rik

No comments:

Post a Comment