Long time ago, when dinosaurs roamed the earth and Neo4j was just a tiny cute little junior graph database ;-), I wrote a 2 part blogpost about a newspaper article that I had come across in De Tijd about the network of executives of Belgian public companies. You can find the articles over here: Part 1 and Part 2. Turns out - and I really was not aware of this until recently - that the newspaper has been running this type of publication on a yearly basis. Here's another article from 2018 on De Tijd’s website.
So imagine my surprise a few weeks ago, when I was contacted by one of the authors of that article, Thomas Roelens, to verify some info for the 2020 edition of this analysis. We had a great chat, and Thomas basically asked me to double check some of the analysis that he had done himself already. So, contrary to what happened in 2017 (where I had to dig into the HTML source to download the info from the website - Thomas just sent it to me, and basically allowed me to take it for a spin :) ...
Meanwhile, Thomas' article has been published in the newspaper: you can find it over here or over here. But here's my update below too.
Data in a Google Sheet
Thomas sent me the data in two csv files, but to make the data easily available, I converted it into a Google spreadsheet. You can find it over here. There are two sheets in the workbook:
-
one with nodes: find it over here.
-
one with edges: find it over here.
As you can tell, you can easily download the data as .CSV files now as well - ready for import into Neo4j.
Import of the data into Neo4j
Importing the nodes
Since the data was already nicely structured, all we needed to do was to convert the nodes/relationship sheets into Neo4j nodes and relationships using LOAD CSV. First we import the Nodes:
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1T8vt1PvdJqvOTj_5kRLsjbOUDNId97foQkNB6U6o5_8/export?format=csv&id=1T8vt1PvdJqvOTj_5kRLsjbOUDNId97foQkNB6U6o5_8&gid=1067482365" as csv
create (n:Node)
set n = csv;
Then we convert the Nodes into specific Person and Company nodes, by assigning the right labels:
match (n:Node)
where n.type = "person" set n:Person remove n.type;
match (n:Node)
where n.type = "company"
set n:Company
remove n.type;
Importing the edges / relationships
Once we have the Nodes, we can proceed to importing the relationships. It's really simple: first read the relationships from the sheet, then lookup the start- and end-nodes that we just imported, and then create a generic "RELATED_TO" relationship. We can convert this generic relationship into specific relationship types if we want to, but that really is not necessary for the purposes of this exploration. Making the relationships more specifically typed would only complicate the model, so we won’t go there:
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1T8vt1PvdJqvOTj_5kRLsjbOUDNId97foQkNB6U6o5_8/export?format=csv&id=1T8vt1PvdJqvOTj_5kRLsjbOUDNId97foQkNB6U6o5_8&gid=0" as csv
match (startnode:Node {id: csv.from}), (endnode:Node {id: csv.to})
create (startnode)-[:RELATED_TO {mandate: csv.mandate, weight: toInteger(csv.weight)}]->(endnode);
That was easy. Now for a little bit of cleanup, and then we can continue with some fun queries: I just need to remove the Node label and add some indexes for querying:
match (n:Node)
remove n:Node;create index on :Person(id);
create index on :Person(label);
create index on :Company(id);
create index on :Company(label);
That's it! Let's do some querying.
Queries on the Belgian Public Company Executive graph
Graph Data Science on the Belgian Public Company Executive graph
Contrary to what we had in 2017, we actually have made a lot of changes to the graph theoretical functionalities that Neo4j offers today. In fact, we actually have a proper name for this now: we call this "Graph Data Science" these days, and we actually have a brand new Neo4j graph data science library and the Neo4j graph data science playground to help us get our arms around this. Look at the github page and the installation page for all the Neo4j Desktop Graphapps, and add the functionality to your Neo4j Desktop to get running with this.Running Pagerank and Betweenness centrality algorithms
Using Neuler and the Graph Data Science library, we wanted to calculate the PageRank scores for the different parts of the graph. See this article about the Pagerank algoritm, for a bit more information. Essentially this gives us a bit more information about the structural importance that the nodes have in the network structure - by understanding how the nodes are connected to one another. It's the same algorithm that Google uses for its ranking of webpage search results.
Neuler allows us to choose a few parameters, and then we can go ahead and run the algo and store the results as a property on the nodes::param limit => ( 50);
:param config => ({
nodeProjection: '*',
relationshipProjection: {
relType: {
type: '*',
orientation: 'UNDIRECTED',
properties: {}
}
},
relationshipWeightProperty: null,
dampingFactor: 0.85,
maxIterations: 20,
writeProperty: 'pagerank'
});
CALL gds.pageRank.write($config);
We can then do something very similar to run the Betweenness centrality algorithm. See this article for more info about Betweenness centrality: it tells us less about the individual importance of nodes in the network structure - but more about how information may flow BETWEEN different parts of the network. As a consequence, Betweenness centrality is really interesting when we want to understand how the network might evolve in the future - based on these information flows. Here's the code that Neuler generated for me:
Both of these centrality metrics - PageRank and Betweenness - are really interesting, but we have actually added another category of algorithms to our data science toolset: Community Detection algorithms.:param limit => ( 50);
:param config => ({
nodeProjection: '*',
relationshipProjection: {
relType: {
type: '*',
orientation: 'UNDIRECTED',
properties: {}
}
},
writeProperty: 'betweenness'
});
CALL gds.alpha.betweenness.write($config);
Louvain community detection
Community Detection algorithms tells us something about the way that certain parts of the network "belong together". There's quite a few different techniques that we can use here, but one of the most frequently used techniques seems to be the one based on "Louvain modularity". Lots of complicated math behind this - this article about Louvain modularity for example if you want to understand it better.
Using Neuler, running the algorithm and storing the results is super easy:
Now we have all of these network science scores in our database, we can start looking into this in more detail and explore this in our Neo4j database.:param limit => ( 50);
:param config => ({
nodeProjection: '*',
relationshipProjection: {
relType: {
type: '*',
orientation: 'UNDIRECTED',
properties: {}
}
},
relationshipWeightProperty: null,
includeIntermediateCommunities: false,
seedProperty: '',
writeProperty: 'louvain'
});
CALL gds.louvain.write($config);
Is the network being dominated by a small group of executives?
Density of the network
One key metric to understand is "how dense is the network", meaning how many relationships are there between the different entities in the graph. Take a look at this article about graph density - it's a great indicator that is pretty easy to calculate in Neo4j:
match (n)-[r]->()
with count(distinct n) as nrofnodes,
count(distinct r) as nrofrels
return nrofnodes, nrofrels,
nrofrels/(nrofnodes * (nrofnodes - 1.0)) as density;
Degree of the Company and Person nodes
Let’s look at the distribution of the degree: a histogram in the google spreadsheet.
match (c:Company)
return c.label, apoc.node.degree(c,'<') as degree
order by degree desc;
This table is of course interesting - but not that interesting. If we want to understand how "centralised" the power of the network is here, we want to understand the distribution of the degree a bit better. I created a histogram for this:
We can do the same thing for the Person nodes, as added to the google spreadsheet.
Again, both indicators seem to suggest a very low degree of centralisations.
Look at the different communities and their numbers
What are the largest communities, and who is in there?
Let's run a few queries to see what the communities are that we detected with our graph algorithms.match (p:Person)
with distinct p.louvain as louvains, count(p) as count
order by count desc
limit 10
unwind louvains as onelouvain
match (p:Person {louvain: onelouvain})--(c:Company)
return p.louvain, collect(distinct(p.label)), collect(distinct(c.label));
Now let's see what a particular community looks like
Explore a particular community
Here are some queries that we can use for that. First let's see how many members are in every community:match (p:Person)That gives us:
return distinct p.louvain as Community, count(p) as NumberOfMembers
order by count(p) desc;
Now if we want to take a look at a particular community and its surroundings, we can do this query:
match path = (p:Person {louvain: 993})-[*..2]-(conn)
return path;
Using Pagerank and Betweenness centrality
match (n)
return n.label, head(labels(n)), n.pagerank, n.betweenness, n.closeness, n.louvain
order by n.pagerank desc
limit 10;
match (n:Company)
return n.label, n.pagerank, n.betweenness, n.closeness, n.louvain
order by n.pagerank desc
limit 10
match (p:Person)
with p.louvain as community, p order by p.pagerank
return community, collect(p.label)[0] as person, collect(p.pagerank)[0] as pagerank
order by community;
Finnaly, we would also like to look at the Betweenness centrality metric: what are the most important people as per betweenness centrality?
match (p:Person)
return p.label, p.betweenness as betweenness
order by betweenness desc
limit 10;
One more thing: monopartite vs multipartite
match path = (p1:Person)-[:RELATED_TO]->(c:Company)<-[:RELATED_TO]-(p2:Person)
where p1 <> p2
merge (p1)-[:KNOW_EACHOTHER]-(p2);
The result of this can be quite interesting.
No comments:
Post a Comment