Monday 16 December 2019

Part 2/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 2 of this blogpost.  Part 1 and Part 3 are also published.)

In the previous post around the emails of Hillary Clinton, we were able to import the data from a CSV file, and use some really cool graph refactoring tools to make the database a little more easy to work with - bad data is bad data, and the less we have of that the better.

So we ended up in a reasonably stable state, where we could do some querying. In this post, we will do exactly that.

Exploring the graph with graph algos

It's fairly easy to get a good initial view of the structure and size of the graph. I just run a few queries like this:

//what nodes are in the db
match (n) return labels(n), count(n)


//what rels are in the db
MATCH p=()-[r]->() RETURN type(r), count(r)

and we very quickly see that, while this is clearly not a "big" dataset, it's still big enough to start loosing some significant time sifting through data if you want to make some sense of it. This is where our fantastic graph algorithms come in. I installed the plugin into my database, restarted it, and then I also played around a bit with Neuler, a graph algo playground that basically allows you to quickly experiment with different algorithms. You can download Neuler from and install it into your Neo4j Desktop really quickly.

As I wanted to figure out the more interesting parts of the graph, I was going to try and run two specific Degree-related algorithms. Let's do that now.

Calculating the Pagerank score

To get a feel for the importance of certain nodes in a network, there are quite a few different metrics that we could consider - but one of the most reknowned ones is PageRank. Originally described, proposed and used by Google in the first incarnation of its web search engine, it offered a radical innovation in the way that these engines would rank the results that were generated by a specific keyword search. You can read more about PageRank over here, and in this summary description: 
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
Now, I am old enough to remember web search before Google, and I can tell you - it wasn't pretty. The web was only a fraction of what it is today, and still companies like Lycos (still out there!), Altavista (still surviving as Yahoo! search - it seems) and the likes were not able to provide users with quality results. Then came Google. It's safe to say that it overhauled the web in a few years.

So could we run that algorithms on this email dataset, and get a feel for the most important email senders/receivers in our graph? Let's find out. 

Here's how I ran the algo:

 //first we set some algo config parameters
:param label => ('Person');
:param relationshipType => ('HAS_EMAILED');
:param limit => ( 100);
:param config => ({concurrency: 8, direction: 'Outgoing', weightProperty: null, defaultValue: 1, dampingFactor: 0.85, iterations: 25, writeProperty: 'pagerank'});

//then we calculate pagerank with one simple command
CALL algo.pageRank($label, $relationshipType, $config);

This algo runs really quickly - given also that we are doing "only" 25 iterations - the PageRank score would become more accurate if we did more. Still, it finishes quickly and gives solid results, with a pagerank score now written down on the "pagerank property" of every Person node in the database:

That means that we can easily query that, and we should be able to see if the algo yielded some interesting results. Let's query these:

//query pagerank
MATCH (n:Person) 
RETURN n.alias, n.pagerank
order by n.pagerank desc;

The results are quick and easy to understand:

So that gives us already some indication of the more interesting people in the graph. But let's look at it in a different way.

Calculating the betweenness centrality score

Another way of looking at importance in a graph like this one, is by looking at a different measure called betweenness centrality. Here's what that means:
Betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of nodes in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through is minimized. The betweenness centrality for each node is the number of these shortest paths that pass through the node.
As such this is an interesting measure as nodes with a high betweenness typically have great importance in the way information is shared in the network.

So we can do something very similar to what we did in the Pagerank example above, and calculate betweenness centrality using our algos. We use a two-step process, again: set the parameters first, and then run the algo:

//set the betweenness algo parameters
:param label => ('Person');
:param relationshipType => ('HAS_EMAILED');
:param limit => ( 100);
:param config => ({concurrency: 8, direction: 'Outgoing', writeProperty: 'betweenness'});

//calculate the betweenness scores
CALL algo.betweenness($label, $relationshipType, $config);

As with the previous pagerank example, this runs very quickly:

And since the algo writes back the resulting score to the Person nodes, we can query this in no time too:

//query betweenness
MATCH (n:Person) RETURN n.alias, n.betweenness
order by n.betweenness desc;

With these two metrics, we are starting to get a good picture of who's who and who's important in the graph - without having read a single email :) ... So much more we can do here, but there are two other aspects that I wanted to explore in this post - just for fun.

Looking for an email backchannel

One of the things that of course strikes you when you start looking at these emails, is the importance of Hillary herself - after all, it was her email server, right! So you would expect that many of the metrics and importance of the communication to revolve around her - and that's clearly what we saw above. Both the betweenness and the pagerank for Mrs. Clinton where very high, and that is very much to be expected - many of these emails will be either directed to or coming from her.

We can actually look at the numbers:

//how many emails involve Hillary as sender/receiver?
match (p1:Person)-[in:HAS_EMAILED]->(p:Person)
where p.alias contains "Hillary"
and not (p1.alias contains "Hillary")
with p, count(in) as in
match (p)-[out:HAS_EMAILED]->(p2:Person)
where not (p2.alias contains "Hillary")
return in, count(out) as out;

This tells us that Hillary emailed 91 people, and received emails from 113 people. And they are clearly the core of our graph here: if we run this query

//how many mails did HRC send or receive?
match (:Email)-[to:TO]->(p:Person)
where p.alias contains "Hillary"
with p, count(to) as incoming
match (:Email)-[from:FROM]->(p:Person)
return incoming, count(from) as outgoing;

we quickly see that out of 7945 emails in this dataset, Hillary sent 1990 and received 5433 of them. But that means that there's over 500 emails in the dataset where Hillary was just not part of the communication. Interesting. Could we be having an email backchannel here, with some of Hillary's aids / staff / friends having conversations amongs themselves without involving Mrs. Clinton herself? Let's take a look at that.

We are looking for a pattern where Hillary has emailed two people, and that these two people have then emailed eachother separately without including Mrs. Clinton. The pattern would look something like this:

//emails backchannel query
match (p1:Person)-[:HAS_EMAILED]-(p:Person)-[:HAS_EMAILED]-(p2:Person)-[:HAS_EMAILED]-(p1)
where p.alias contains "Hillary" and p1 <> p2
with p1, p2
match path = (p1)--(e:Email)--(p2)
where not p1.alias contains "Hillary" and not p2.alias contains "Hillary"
return path
limit 10;

and yeah - there it is. 

Now, there's no way (yet) of telling if this backchannel was of any material importance - but we can definitely see that it existed.

That is it for this blogpost - and for the initial structural analysis that we did on this email corpus. All of the above queries and scripts are on Github, of course, and I would love to hear your feedback.

Next: let's do some text analysis and natural language processing on this dataset - that should be even more fun.



No comments:

Post a Comment