Bruggen Blog: HRC

Monday, 16 December 2019

Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 3 of this blogpost. Part 1 and Part 2 are also published.)

Alright this is going to be the third and final part of my work on the Hillary Clinton Email Corpus. There's two posts that came before this article:

in the first post we focused on importing and refactoring the data in Neo4j
in the second post we spent some time analysing the dataset with some algorithms and some specific pattern matching queries

Now we are going to spent some time with the "heart of the matter", the actual content of the emails. We are going to do that in two steps: first we will do some "full text" querying of some data, using Neo4j's specific full text indexing capabilities. Then we are going to go a step further and try to extract more knowledge from this dataset in an automated way, by running some Natural Language Processing (NLP) algorithms and processes on it.

Let's get right to it.

Fulltext querying of Emails

Those of you that have been following Neo4j for some time, may remember that we have always bundled Apache Lucene with Neo4j. For the longest time, Neo4j used Lucene for it's indexing capabilities. This turned out to be a great choice for many things, but also one that had its limitations and trade-offs. This is why Neo4j has gradually been switching away from Lucene for its core schema indexing capability, and has adopted a modular, pluggable indexing architecture that allows for different indexing techniques to be used for different data types. This is great news for many reasons, but one of the most important benefits has been a dramatic increase in write performance - as the newer indexes are much more optimized and leaner than the older Lucene based structures. Read more about indexing in the Neo4j manual.

So as I started to think about some text-oriented queries, I quickly realised that I would need an index on Email text. So I wanted to do

create index on :Email(text)

and query that index afterwards. But the result was pretty obvious:

Part 2/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 2 of this blogpost. Part 1 and Part 3 are also published.)

In the previous post around the emails of Hillary Clinton, we were able to import the data from a CSV file, and use some really cool graph refactoring tools to make the database a little more easy to work with - bad data is bad data, and the less we have of that the better.

So we ended up in a reasonably stable state, where we could do some querying. In this post, we will do exactly that.

Exploring the graph with graph algos

It's fairly easy to get a good initial view of the structure and size of the graph. I just run a few queries like this:

//what nodes are in the db

match (n) return labels(n), count(n)

and:

//what rels are in the db

MATCH p=()-[r]->() RETURN type(r), count(r)

and we very quickly see that, while this is clearly not a "big" dataset, it's still big enough to start loosing some significant time sifting through data if you want to make some sense of it. This is where our fantastic graph algorithms come in. I installed the plugin into my database, restarted it, and then I also played around a bit with Neuler, a graph algo playground that basically allows you to quickly experiment with different algorithms. You can download Neuler from https://install.graphapp.io/ and install it into your Neo4j Desktop really quickly.

Part 1/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 1 of this blogpost. Part 2 and Part 3 are also published.)

With lots of interesting political manoeuvring going on in the USA and in Europe, I somehow got into a rabbit hole where I came across the corpus of emails that were published in the aftermath of the 2016 US presidential elections. They have been analysed a number of times, both by citizens and the press: see the great site published by the Wall Street Journal and Ben Hamner's github repo (which is based on a Kattle dataset).

Some of my friends and colleagues have also done some work on this dataset in Neo4j - there's this graphgist, Linkurio.us' blogpost, as well as Ryan Boyd's older article on DeveloperAdvocate. But I decided I was interested enough to take it for a spin.

Importing the email corpus into Neo4j

I got the dataset from this url, and it looks pretty straightforward. There's a very simple datamodel that we can work with, which would look something like this:

Bruggen Blog

Pages