Monday 16 December 2019

Part 1/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 1 of this blogpost. Part 2 and Part 3 are also published.)

With lots of interesting political manoeuvring going on in the USA and in Europe, I somehow got into a rabbit hole where I came across the corpus of emails that were published in the aftermath of the 2016 US presidential elections. They have been analysed a number of times, both by citizens and the press: see the great site published by the Wall Street Journal and Ben Hamner's github repo (which is based on a Kattle dataset).

Some of my friends and colleagues have also done some work on this dataset in Neo4j - there's  this graphgistLinkurio.us' blogpost, as well as Ryan Boyd's older article on DeveloperAdvocate. But I decided I was interested enough to take it for a spin.

Importing the email corpus into Neo4j

I got the dataset from this url, and it looks pretty straightforward. There's a very simple datamodel that we can work with, which would look something like this:


Running the import is really easy. First we prepare some indexes to make sure that we get decent import performance:

//setting up the indexes
create constraint on (p:Person)
assert p.alias is unique;
create constraint on (e:Email)
assert e.id is unique;
create index on :Email(subject);

And then we can run the import

// Creating the graph
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "https://s3-us-west-2.amazonaws.com/neo4j-datasets-public/Emails-refined.csv" AS line
MERGE (fr:Person {alias: COALESCE(line.MetadataFrom, line.ExtractedFrom, '')})
MERGE (to:Person {alias: COALESCE(line.MetadataTo, line.ExtractedTo, '')})
MERGE (em:Email { id: line.Id })
ON CREATE SET em.foia_doc=line.DocNumber, em.subject=line.MetadataSubject, em.to=line.MetadataTo, em.from=line.MetadataFrom, em.text=line.RawText, em.ex_to=line.ExtractedTo, em.ex_from=line.ExtractedFrom
MERGE (to)<-[:TO]-(em)-[:FROM]->(fr)
MERGE (fr)-[r:HAS_EMAILED]->(to)
ON CREATE SET r.count = 1
ON MATCH SET r.count = r.count +1;

Note that the import is significantly faster if you can just load it from your local filesystem - which is what I did most if the time. 7 seconds or so later the result should be loaded then.



The script for the import can be found over here. However, soon after having done the import, I noticed some problems with data quality. For example, I noticed quite a few duplicate "Person" entities, and some nodes that did not have email "aliases", but actual email addresses. I wanted to fix that - as it could really affect some analysis that I wanted to do later on.


Refactoring the email corpus in Neo4j

To do something about these quality problems in the dataset, I decided to do some refactoring inside Neo4j. It's actually super easy to do that.

Refactoring email addresses

The first step is to basically review all the person.alias properties of the :Person nodes and see if the alias already contains an email address - and if so move it to a p.email_address property.

So we ran

match (p:Person) 
where p.alias contains ".gov"
return p.alias 
order by p.alias

and got the list of email addresses that needed fixing:

Then we ran this simple query to move the email address to a property:

match (p:Person) where p.alias contains ".gov"
set p.email_address = p.alias;

And that fixed most of the problem. Now we needed to get to the bigger problem: removing all the duplicate Person nodes. 

Refactoring duplicate Person nodes

To illustrate the problem, I quickly ran this query

match path = (p:Person)-[r]-()
where p.alias contains "Axelrod"
return path

which shows you this:


The problem is pretty easy to see: we have multiple nodes for Mr. Axelrod, and they refer to different emails to Ms. Clinton. We need to come up with a way to "merge" these two nodes and the relationships that connect to them. Not a trivial thing to do - and so I turned to APOC. Turns out there's a procedure in this library that allows us to do exactly what we need: apoc.refactor.mergeNodes is what we want.

Here's how that works:

//first we create a list of "aliases"
with ["Axelrod, David M", "Axelrod_D"] as list
//then we look for all Person nodes that have an alias property in that list
match (a:Person)
where a.alias in list
//then we collect these nodes - and we grab the FIRST element of the above list as the "to be" future alias property.
with collect(a) as nodes, list[0] as tobeAlias
//Then we call the procedure:
call apoc.refactor.mergeNodes(nodes,{
properties:"discard",
mergeRels:true
})
yield node
//for the resulting node, we now set the alias property to what we want it to be
set node.alias = tobeAlias
return "Nodes merged.";

The important thing here is of course the order of the list of alias properties that we are providing to the query - as we are setting the property to the "to be alias" at the very end of the query. So let's run the query:


And then we get the exact result that we want:

Now, all we need to do is to run a statement like that for every one of these duplicate nodes in our dataset. Call me stupid - or call me a non-developer, if you so wish - but I did not know of any easy solution to this, and ended up turning to a Google sheet to 
  • first document all of the duplicate alias names that I wanted to deduplicate - this of course is a very manual process, but it's not very complicated and did not take very long. I actually refined this over a couple of runs as I spotted more and more irregularities in my dataset. Note that this list of aliases is ORDERED - I want the first one in the list to become the future property value of my merged node.
  • then generate the statements that I needed using a very simple formula that took the list in column B and generated the cypher statement like this:
="with ["&B2&"] as list match (a:Person) where a.alias in list with collect(a) as nodes, list[0] as tobeAlias call apoc.refactor.mergeNodes(nodes,{ properties:""discard"", mergeRels:true }) yield node set node.alias = tobeAlias return ""Nodes merged."";"


You can find the Google sheet over here. It looks a bit like this:


All of the statements have been put together in this gist. Running these generated statements would clean up all of the duplicated nodes in no time. Great.

Final cleanup of some more alias problems

As happens so often, these graph refactorings are going to be very iterative. As I was now exploring the dataset on the hunt for some more fixes, I found this problem:


This is basically a Person with an alias property that is wrongly set: it contains a string of aliases that should have been split up. In order to fix this, I had to do some manual wrangling. I ended up writing this query:

match (to:Person)<--(e:Email)-->(from:Person)-->(to)
where to.alias = "Mills, Cheryl D; Toiv, Nora F; Sullivan, Jacob 3; Ross, Alec J; Slaughter, Anne-Marie; ChoIlet, Derek H; Crowley, Philip"
with from, to, e
match (cheryl:Person {alias: "Mills, Cheryl D"})
match (nora:Person {alias: "Toiv, Nora F"})
match (jacob:Person {alias: "Sullivan, Jacob H"})
match (am:Person {alias: "Slaughter, Anne-Marie"})
match (derek:Person {alias:"Chollet, Derek H"})
match (philip:Person {alias:"Crowley, Philip J"})
merge (alec:Person {alias: "Ross, Alec J"})
create (from)-[:HAS_EMAILED]->(cheryl)
create (cheryl)<-[:TO]-(e)
create (from)-[:HAS_EMAILED]->(nora)
create (nora)<-[:TO]-(e)
create (from)-[:HAS_EMAILED]->(jacob)
create (jacob)<-[:TO]-(e)
create (from)-[:HAS_EMAILED]->(alec)
create (alec)<-[:TO]-(e)
create (from)-[:HAS_EMAILED]->(am)
create (am)<-[:TO]-(e)
create (from)-[:HAS_EMAILED]->(derek)
create (derek)<-[:TO]-(e)
create (from)-[:HAS_EMAILED]->(philip)
create (philip)<-[:TO]-(e)
detach delete to;

And this gave me this new, and more correct view of the above picture:


That's about it for now. We have a great basic email database that we can start exploring now, and which will be the basis for further queries and analytics. All of the above queries and scripts are on Github, of course, and I would love to hear your feedback.

Next: start playing around with the imported email corpus.

Cheers

Rik



No comments:

Post a Comment