Monday, 29 June 2020

Executives of Belgian Public Companies - revisited!

Data in a Google Sheet

Thomas sent me the data in two csv files, but to make the data easily available, I converted it into a Google spreadsheet. You can find it over here. There are two sheets in the workbook:

As you can tell, you can easily download the data as .CSV files now as well - ready for import into Neo4j.

Import of the data into Neo4j

Importing the nodes

Since the data was already nicely structured, all we needed to do was to convert the nodes/relationship sheets into Neo4j nodes and relationships using LOAD CSV. First we import the Nodes:

load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1T8vt1PvdJqvOTj_5kRLsjbOUDNId97foQkNB6U6o5_8/export?format=csv&id=1T8vt1PvdJqvOTj_5kRLsjbOUDNId97foQkNB6U6o5_8&gid=1067482365" as csv
create (n:Node)
set n = csv;

Then we convert the Nodes into specific Person and Company nodes, by assigning the right labels:

match (n:Node)
where n.type = "person"
set n:Person
remove n.type;

match (n:Node)
where n.type = "company"
set n:Company
remove n.type;

Importing the edges / relationships

Once we have the Nodes, we can proceed to importing the relationships. It's really simple: first read the relationships from the sheet, then lookup the start- and end-nodes that we just imported, and then create a generic "RELATED_TO" relationship. We can convert this generic relationship into specific relationship types if we want to, but that really is not necessary for the purposes of this exploration. Making the relationships more specifically typed would only complicate the model, so we won’t go there:

load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1T8vt1PvdJqvOTj_5kRLsjbOUDNId97foQkNB6U6o5_8/export?format=csv&id=1T8vt1PvdJqvOTj_5kRLsjbOUDNId97foQkNB6U6o5_8&gid=0" as csv
match (startnode:Node {id: csv.from}), (endnode:Node {id: csv.to})
create (startnode)-[:RELATED_TO {mandate: csv.mandate, weight: toInteger(csv.weight)}]->(endnode);

That was easy. Now for a little bit of cleanup, and then we can continue with some fun queries: I just need to remove the Node label and add some indexes for querying:

match (n:Node)
remove n:Node;
create index on :Person(id);
create index on :Person(label);
create index on :Company(id);
create index on :Company(label);

That's it! Let's do some querying.

Queries on the Belgian Public Company Executive graph

Based on the the current model in our database, we can start running some queries on the dataset. The model is super simple:
Thomas' question was mostly about the different graph theory metrics that we could find based on the model. That means calculating some centrality and related scores to give us a sense of what is important and what is not in our dataset.

Graph Data Science on the Belgian Public Company Executive graph

Contrary to what we had in 2017, we actually have made a lot of changes to the graph theoretical functionalities that Neo4j offers today. In fact, we actually have a proper name for this now: we call this "Graph Data Science" these days, and we actually have a brand  new Neo4j graph data science library and the Neo4j graph data science playground to help us get our arms around this. Look at the github page and the installation page for all the Neo4j Desktop Graphapps, and add the functionality to your Neo4j Desktop to get running with this.

Running Pagerank and Betweenness centrality algorithms

Using Neuler and the Graph Data Science library, we wanted to calculate the PageRank scores for the different parts of the graph. See this article about the Pagerank algoritm, for a bit more information. Essentially this gives us a bit more information about the structural importance that the nodes have in the network structure - by understanding how the nodes are connected to one another. It's the same algorithm that Google uses for its ranking of webpage search results.

Neuler allows us to choose a few parameters, and then we can go ahead and run the algo and store the results as a property on the nodes:

:param limit => ( 50);
:param config => ({
nodeProjection: '*',
relationshipProjection: {
relType: {
type: '*',
orientation: 'UNDIRECTED',
properties: {}
}
},
relationshipWeightProperty: null,
dampingFactor: 0.85,
maxIterations: 20,
writeProperty: 'pagerank'
}); CALL gds.pageRank.write($config);

We can then do something very similar to run the Betweenness centrality algorithm. See this article for more info about Betweenness centrality: it tells us less about the individual importance of nodes in the network structure - but more about how information may flow BETWEEN different parts of the network. As a consequence, Betweenness centrality is really interesting when we want to understand how the network might evolve in the future - based on these information flows. Here's the code that Neuler generated for me:

:param limit => ( 50);
:param config => ({
nodeProjection: '*',
relationshipProjection: {
relType: {
type: '*',
orientation: 'UNDIRECTED',
properties: {}
}
},
writeProperty: 'betweenness'
}); CALL gds.alpha.betweenness.write($config);
Both of these centrality metrics - PageRank and Betweenness - are really interesting, but we have actually added another category of algorithms to our data science toolset: Community Detection algorithms.

Louvain community detection

Community Detection algorithms tells us something about the way that certain parts of the network "belong together". There's quite a few different techniques that we can use here, but one of the most frequently used techniques seems to be the one based on "Louvain modularity". Lots of complicated math behind this -  this article about Louvain modularity for example if you want to understand it better.

Using Neuler, running the algorithm and storing the results is super easy:

:param limit => ( 50);
:param config => ({
nodeProjection: '*',
relationshipProjection: {
relType: {
type: '*',
orientation: 'UNDIRECTED',
properties: {}
}
},
relationshipWeightProperty: null,
includeIntermediateCommunities: false,
seedProperty: '',
writeProperty: 'louvain'
}); CALL gds.louvain.write($config);
Now we have all of these network science scores in our database, we can start looking into this in more detail and explore this in our Neo4j database.

Is the network being dominated by a small group of executives?

This was the key question that the journalists of De Tijd were trying to get their head around. There's lots of different ways to approach this, but we actually did some pretty interesting analysis.

Density of the network

One key metric to understand is "how dense is the network", meaning how many relationships are there between the different entities in the graph. Take a look at this article about graph density - it's a great indicator that is pretty easy to calculate in Neo4j:

match (n)-[r]->()
with count(distinct n) as nrofnodes,
count(distinct r) as nrofrels
return nrofnodes, nrofrels,
nrofrels/(nrofnodes * (nrofnodes - 1.0)) as density;

We seem to have a very low density: the maximum density number is 1, and the minimum is 0. 

So that seems like a useful initial indicator.

Degree of the Company and Person nodes

Let’s look at the distribution of the degree: a histogram in the google spreadsheet.

match (c:Company)
return c.label, apoc.node.degree(c,'<') as degree
order by degree desc;
This gives us a table like this:

This table is of course interesting - but not that interesting. If we want to understand how "centralised" the power of the network is here, we want to understand the distribution of the degree a bit better. I created a histogram for this:


We can do the same thing for the Person nodes, as added to the google spreadsheet.


Again, both indicators seem to suggest a very low degree of centralisations.

Look at the different communities and their numbers

Using the different graph algorithms that we calculated abouve, we can of course start to explore the network from this perspective. 

What are the largest communities, and who is in there?

Let's run a few queries to see what the communities are that we detected with our graph algorithms.

match (p:Person)
with distinct p.louvain as louvains, count(p) as count
order by count desc
limit 10
unwind louvains as onelouvain
match (p:Person {louvain: onelouvain})--(c:Company)
return p.louvain, collect(distinct(p.label)), collect(distinct(c.label));

This gives us a nice list of communities and their members
Now let's see what a particular community looks like

Explore a particular community

Here are some queries that we can use for that. First let's see how many members are in every community:
match (p:Person)
return distinct p.louvain as Community, count(p) as NumberOfMembers
order by count(p) desc;
That gives us:

Now if we want to take a look at a particular community and its surroundings, we can do this query:

match path = (p:Person {louvain: 993})-[*..2]-(conn)
return path;

And get a good view:
As we said before: community detection algorithms give us a good view of the way that specific parts of the graph are sticking together. But which parts of the graph are more interesting? To do that, we will look at the Pagerank and Betweenness centrality measures.

Using Pagerank and Betweenness centrality

Let's look at who/what is most important as per Pagerank?
match (n)
return n.label, head(labels(n)), n.pagerank, n.betweenness, n.closeness, n.louvain
order by n.pagerank desc
limit 10;

Then we get:

It's quite striking to see that the top most important (as per Pagerank) nodes in our graph are all people - which is obviously to be expected because of the nature of our graph. But it's interesting to run Pagerank scores just for Companies:

match (n:Company)
return n.label, n.pagerank, n.betweenness, n.closeness, n.louvain
order by n.pagerank desc
limit 10
That gets us this result:
We can also look at the person in every community that has the highest pagerank.
match (p:Person)
with p.louvain as community, p order by p.pagerank
return community, collect(p.label)[0] as person, collect(p.pagerank)[0] as pagerank
order by community;

The result is definitely interesting:


Finnaly, we would also like to look at the Betweenness centrality metric: what are the most important people as per betweenness centrality?

match (p:Person)
return p.label, p.betweenness as betweenness
order by betweenness desc
limit 10;

That gives us the headline that we found in the newspaper article - Hilde Laga, a top lawyer in Belgium, is one of the best connected people in the industry.

One more thing: monopartite vs multipartite

As part of this article, we also experimented with a different way of structuring the data. In the dataset that we worked with above, we found that the mix of people and companies was no doubt interesting, but also made it semantically more difficult to reason about the importance of the structural elements of the graph. This is something that happens quite often in "multipartite" graphs, where we are mixing different types of entities in the same structure. For that reason, it actually can be quite interesting to make the graph mono-partite, and to work with one specific type of entities. In the graph above, we can actually do that quite easily, by inferring a new kind of relationship between people.

Here's the idea: if two people are related to the same company, we can infer that they must know one another - as they are both executives of the same company and must see one another at least with some frequency. So this is what we can do:

match path = (p1:Person)-[:RELATED_TO]->(c:Company)<-[:RELATED_TO]-(p2:Person)
where p1 <> p2
merge (p1)-[:KNOW_EACHOTHER]-(p2);

This adds quite a few new relationships to the graph:

The result of this can be quite interesting.
Note that the KNOW_EACHOTHER relationship is semantically undirected, but that in our graph we will always have a direction - which we will ignore for all queries and algorithms.

All of the queries in this blogpost have also been added to this gist on github - you can play around with the data at your own ease if you want.

Hope this was useful / fun - it sure was for me.

All the best

Rik


No comments:

Post a Comment