Bruggen Blog: Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

Monday 16 December 2019

Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 3 of this blogpost. Part 1 and Part 2 are also published.)

Alright this is going to be the third and final part of my work on the Hillary Clinton Email Corpus. There's two posts that came before this article:

in the first post we focused on importing and refactoring the data in Neo4j
in the second post we spent some time analysing the dataset with some algorithms and some specific pattern matching queries

Now we are going to spent some time with the "heart of the matter", the actual content of the emails. We are going to do that in two steps: first we will do some "full text" querying of some data, using Neo4j's specific full text indexing capabilities. Then we are going to go a step further and try to extract more knowledge from this dataset in an automated way, by running some Natural Language Processing (NLP) algorithms and processes on it.

Let's get right to it.

Fulltext querying of Emails

Those of you that have been following Neo4j for some time, may remember that we have always bundled Apache Lucene with Neo4j. For the longest time, Neo4j used Lucene for it's indexing capabilities. This turned out to be a great choice for many things, but also one that had its limitations and trade-offs. This is why Neo4j has gradually been switching away from Lucene for its core schema indexing capability, and has adopted a modular, pluggable indexing architecture that allows for different indexing techniques to be used for different data types. This is great news for many reasons, but one of the most important benefits has been a dramatic increase in write performance - as the newer indexes are much more optimized and leaner than the older Lucene based structures. Read more about indexing in the Neo4j manual.

So as I started to think about some text-oriented queries, I quickly realised that I would need an index on Email text. So I wanted to do

create index on :Email(text)

and query that index afterwards. But the result was pretty obvious:

The creation of the index failed, as I found out that the size of the email text can be quite big:

match (e:Email)

return max(size(e.text));

told me that there are emails in there of 98196 bytes. That's too big for our normal schema indexes, so i quickly dropped it again.

For these larger text properties, Neo4j actually offers some specific, and still Lucene-based, indexing capabilities. You can read about these over here and learn how to configure them over here. Our friends at Graphaware also wrote a great article about it on their blog.

In order to set up a full text index, we need to actually call a procedure:

//create the indexes

CALL db.index.fulltext.createNodeIndex("fullEmails",["Email"],["text","subject"]);

Here's the cool part: this index actually contains indexing information for multiple properties combined - which is great in this case, as I don't have to separately index the subject and the text (body) of the emails.

Once I run this procedure, I get something like this in my schema:

This clearly indicates the separate nature of that fulltext index.

Now I can do some queries. This also requires me to use a procedure, like this

//do some queries

CALL db.index.fulltext.queryNodes("fullEmails", "trump") YIELD node, score

RETURN node;

This gives us a very simple graph of the emails concerned.

Textually, we can also extract the fulltext match scoring, and only return what Lucene thinks are the most relevant emails. To do that we can do

//query the fulltext index
CALL db.index.fulltext.queryNodes("fullEmails", "trump") YIELD node, score
RETURN node.text, score
order by score desc
limit 5;

The result of that is also interesting:

There's some very interesting and - on the surface of it - weird emails that you can find that way. Look for emails that say "CAN THE DUDE ABIDE?" or "gullibility would trump lofty rationality" in the text. Good stuff.

You can see how this could lead to more and better exploration already, but in order to kind of automate this a bit more, I decided to look for some NLP tools to help me out.

NLP with GraphAware's NLP plugins

Our friends at GraphAware have been working in the Neo4j ecosystem for a long time, and over time they have developed some amazing software. Look at Graphaware Hume for example, and take a look their videos and presentation for more details.

Hume of course is really slick and definitely the thing to go to if you want to make progress fast, but Graphaware has also open sourced some of their basic NLP tooling as Neo4j plugins that everyone can use. So I decided to take their open source tools for a spin, and apply them to Mrs. Clinton's emails. I am sure she would approve.

So here's what I had to do in order to download and install the GraphAware NLP module on my humble Neo4j instance:

get the tools use the "GraphAware Framework" for some of the infrastructure. Look at this download page to get the latest.
After that, you need to download the NLP modules. There's multiple parts to this: the basic NLP module, the Stanford NLP module, and the language model for the English language - this too.

All of these files needed to be placed in the Neo4j Plugins directory of your Neo4j server:

Once that's done, you need to add a few lines to the neo4j.conf configuration file (in the Conf directory of the Neo4j server):

dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware

com.graphaware.runtime.enabled=true

com.graphaware.module.NLP.1=com.graphaware.nlp.module.NLPBootstrapper

dbms.security.procedures.whitelist=apoc.*,algo.*,ga.nlp.*

Additional to these config changes, I knew from some of my research that I would need a bit of memory for the next bit of the exercise (the Natural Language Processing of the emails), so I quickly moved to increase the heap of my Neo4j server:

dbms.memory.heap.initial_size=8G

dbms.memory.heap.max_size=8G

One more comment: in my environment (which is a mid-sized 2-year old MBP laptop with 16GB of memory) I was seemingly quite strapped for resources. So: I would really recommend that you start your Neo4j server OUTSIDE of the Neo4j Desktop environment, as we have seen in some cases that starting Neo4j inside the Neo4j Desktop can eat a lot more resources than necessary. Just a tip.

So: after having done all of the above, we (re)start the server, and get cracking.

Preparing the NLP

Once you start the server, you will see that the Graphaware NLP components are getting loaded in the neo4j.log file. So now we want to start using that, and in order to do so we first need to enable the database infrastructure. That means creating a Neo4j database "schema" (indexes and constrains) to support the NLP processes. This is quite easily achieved by calling a Graphaware procedure:

//Create the schema

CALL ga.nlp.createSchema();

Next we configure the language of the engine:
CALL ga.nlp.config.setDefaultLanguage('en');

And then we define my email text analysis pipeline that I will be using to process all the emails text properties:

CALL ga.nlp.processor.addPipeline({textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', name: 'emailanalyser', processingSteps: {tokenize: true, ner: true, dependency: true}, stopWords: '+,result, all, during',

threadNumber: 20});

Now I won't pretend to fully understand what all these configuration options mean - but the tokenization, NER (name-entity-recognition) and dependency=true configuration options are all necessary for the next steps. You can figure out the other configuration options quite easily.

Next I will make the "emailanalyser" processing pipeline the default by calling these procedures:
CALL ga.nlp.processor.pipeline.default("emailanalyser");

And then I can verify the pipeline with this call:
CALL ga.nlp.processor.getPipelines();

After having done, that, I must say that I got into a lot of trouble on my resource-constrained environment. The text processing (which I will explain below) was always failing and failing again, and I could not get it to work. So I reached out to Christophe from Graphaware, and asked him for some ideas - and of course he came through like a boss.

Here's the problem: on my memory constrained machine, I was having issues treating some of the larger emails - and this would just kill the processing pipeline - and with it the entire database instance. So I started checking the size of the email bodies:

//check email sizes

match (e:Email)

with size(e.text) as size

where size > 10000

with count(size) as LARGE_EMAILS

match (e:Email)

with LARGE_EMAILS, size(e.text) as size

where size < 10000

with LARGE_EMAILS, count(size) as NORMAL_EMAILS

return LARGE_EMAILS, NORMAL_EMAILS

And quickly found out that there's a small bunch of emails (254 of them) that are over 10000 characters, and these guys would be the culprit of my failing process.

All I needed to do is to take this into account in my processing instructions, and we would be fine.

Do the Natural Language text Processing

All the infrastructure was ready, so I could kickstart the process. Here's the query to do that:

CALL apoc.periodic.iterate(

"MATCH (e:Email)

where not (e)-[:HAS_ANNOTATED_TEXT]->()

return e", "

CALL ga.nlp.annotate({text: left(e.text,10000), id: id(e)})

YIELD result

MERGE (e)-[:HAS_ANNOTATED_TEXT]->(result)

RETURN result", {batchSize:1, iterateList:true});

There's a couple of interesting parts to this, right:

we use apoc.periodic.iterate, with a batchsize of 1. That means we will be processing 1 email at a time - using multiple threads, as configured in the pipeline above. This is necessary to avoid deadlocks in the processing.
we cut off the text of the emails after 10000 characters - as we saw above there's only a small number of these larger emails, and otherwise the process would not finish on my little laptop.

Once I did that, all I needed to do was wait - it took 65mins to finish the process:

1 row available after 3927228 ms, consumed after another 6 ms

And the schema of the database now looks a lot more complicated :) - see below.

The core of all of these work seems to be the AnnotatedText nodes - which are connected to the emails. However, these don't always provide a good basis for further analysis - we need to do one more step on top of this to make real use of this now, and that is to extract keywords from these Annotations. Let's do that next.

Do the keyword extraction

The process of Keywords extraction is really well explained in this article from the Graphaware website. It explains a multi-step process, called Textrank, where

Pre-select relevant words from the NLP annotated text.
Create a graph of tag co-occurrences.
Run undirected weighted PageRank on this graph
finally also save the top 1/3 of tags as keywords and identify key phrases.

That's what we want! Here's the super simple command to start this:

MATCH (a:AnnotatedText)

CALL ga.nlp.ml.textRank({annotatedText: a, stopwords: '+,other,email', useDependencies: true})

YIELD result RETURN result

This actually happens quite quickly, and after a short while we end up with some really interesting Keywords and Tags in our graph. Just run:

match (t:Tag) with count(t) as tags
match (k:Keyword) with tags, count(k) as keywords
return tags, keywords;

and see the result:

Now, I am not sure about your ideas here, but it seems to me like there are still quite a few too many Keywords here to really help me in understanding this email corpus a little better. Look at how many multi-word keywords we still have in this graph:

MATCH (k:Keyword)-[:DESCRIBES]->(a:AnnotatedText)
WHERE size(split(k.value, " "))>1
RETURN k.value AS Keyphrase, count(*) AS count
ORDER BY count DESC;

And we see that it's really quite a lot.

So that's why it makes sense to some postprocessing on this, and reduce the number of Keywords by grouping them together.

Do some post-processing

The postprocessing is quite important, but it requires two additional indexes to be put in place in order to run efficiently:

CREATE INDEX ON :Keyword(numTerms);
CREATE INDEX ON :Keyword(value);

Then we just call the Textrank postprocessing process as follows:

CALL ga.nlp.ml.textRank.postprocess({keywordLabel: "Keyword", method: "subgroups"})
YIELD result
RETURN result;

This returns after a good 6-7 minutes.

And then we can actually really easily observe that the Keywords have now been linked together into groups:

MATCH p=()-[r:HAS_SUBGROUP]->() RETURN p LIMIT 25

Gives you this graph which is quite telling and impressive, in my opinion:

If we now take a look at the keywords again and exclude the ones with subgroups:

MATCH (k:Keyword)-[:DESCRIBES]->(a:AnnotatedText)
WHERE k.numTerms > 1 AND NOT (k)-[:HAS_SUBGROUP]->(:Keyword)-[:DESCRIBES]->(a)
RETURN k.value as Keyphrase, count(*) AS count
ORDER BY count DESC
LIMIT 20

Then we see a more manageable set of keywords to explore

Look at the above: this immediately reveals some interesting stuff right: the first three keyphrases are not bringing us anything material, but the fourth one immediately shows us that a bunch of these emails were about... the Benghazi committee appears. This is related to the
2012 Benghazi attack that killed a number of US diplomatic staff, and for which Mrs. Clinton was quite harshly treated and accused in the US congressional committees.

Wrapping up: Benghazi

As a wrap up to this post, I am going to explore the emails related to Benghazi a bit more. Let's take a look at who are the players in the email conversation on Benghazi. Here's a query for that:

match path=(k:Keyword)-[:DESCRIBES]->(a:AnnotatedText)<-[:HAS_ANNOTATED_TEXT]-(e:Email)--(p:Person) where k.value contains "benghazi" return path

and we get this result:

We can of course look at it a little differently, and look at the "relevance" of the Keyword relationships, and look at the most relevant emails first. Here's a query that does that:

match (k:Keyword)-[r]-(:AnnotatedText)--(e:Email)--(p:Person)
where k.value contains "benghazi"
return k.value, r.relevance, e.subject, collect(p.alias)
order by r.relevance desc

This gives us this result:

That's about it for now, and here's where I will wrap up this blogpost series. It's been one of the more interesting "assignments" that I have taken on, and - even knowing that I could do a lot more here (like filtering out some stopwords, filtering out less relevant keywords, pruning the graph to only look at (groups of) keywords - but that will be something that you can do, maybe?

Hope this was a fun and useful read for you - it certainly was for me!

All the best

Rik

Bruggen Blog

Pages

Monday 16 December 2019

Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

Fulltext querying of Emails

NLP with GraphAware's NLP plugins

Preparing the NLP

Do the Natural Language text Processing

Do the keyword extraction

Do some post-processing

Wrapping up: Benghazi

No comments:

Post a Comment

Labels

Blogarchive

Metricool