Bruggen Blog: analytics

Showing posts with label analytics. Show all posts

Monday, 16 December 2019

Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 3 of this blogpost. Part 1 and Part 2 are also published.)

Alright this is going to be the third and final part of my work on the Hillary Clinton Email Corpus. There's two posts that came before this article:

in the first post we focused on importing and refactoring the data in Neo4j
in the second post we spent some time analysing the dataset with some algorithms and some specific pattern matching queries

Now we are going to spent some time with the "heart of the matter", the actual content of the emails. We are going to do that in two steps: first we will do some "full text" querying of some data, using Neo4j's specific full text indexing capabilities. Then we are going to go a step further and try to extract more knowledge from this dataset in an automated way, by running some Natural Language Processing (NLP) algorithms and processes on it.

Let's get right to it.

Fulltext querying of Emails

Those of you that have been following Neo4j for some time, may remember that we have always bundled Apache Lucene with Neo4j. For the longest time, Neo4j used Lucene for it's indexing capabilities. This turned out to be a great choice for many things, but also one that had its limitations and trade-offs. This is why Neo4j has gradually been switching away from Lucene for its core schema indexing capability, and has adopted a modular, pluggable indexing architecture that allows for different indexing techniques to be used for different data types. This is great news for many reasons, but one of the most important benefits has been a dramatic increase in write performance - as the newer indexes are much more optimized and leaner than the older Lucene based structures. Read more about indexing in the Neo4j manual.

So as I started to think about some text-oriented queries, I quickly realised that I would need an index on Email text. So I wanted to do

create index on :Email(text)

and query that index afterwards. But the result was pretty obvious:

Part 2/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 2 of this blogpost. Part 1 and Part 3 are also published.)

In the previous post around the emails of Hillary Clinton, we were able to import the data from a CSV file, and use some really cool graph refactoring tools to make the database a little more easy to work with - bad data is bad data, and the less we have of that the better.

So we ended up in a reasonably stable state, where we could do some querying. In this post, we will do exactly that.

Exploring the graph with graph algos

It's fairly easy to get a good initial view of the structure and size of the graph. I just run a few queries like this:

//what nodes are in the db

match (n) return labels(n), count(n)

and:

//what rels are in the db

MATCH p=()-[r]->() RETURN type(r), count(r)

and we very quickly see that, while this is clearly not a "big" dataset, it's still big enough to start loosing some significant time sifting through data if you want to make some sense of it. This is where our fantastic graph algorithms come in. I installed the plugin into my database, restarted it, and then I also played around a bit with Neuler, a graph algo playground that basically allows you to quickly experiment with different algorithms. You can download Neuler from https://install.graphapp.io/ and install it into your Neo4j Desktop really quickly.

Part 1/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 1 of this blogpost. Part 2 and Part 3 are also published.)

With lots of interesting political manoeuvring going on in the USA and in Europe, I somehow got into a rabbit hole where I came across the corpus of emails that were published in the aftermath of the 2016 US presidential elections. They have been analysed a number of times, both by citizens and the press: see the great site published by the Wall Street Journal and Ben Hamner's github repo (which is based on a Kattle dataset).

Some of my friends and colleagues have also done some work on this dataset in Neo4j - there's this graphgist, Linkurio.us' blogpost, as well as Ryan Boyd's older article on DeveloperAdvocate. But I decided I was interested enough to take it for a spin.

Importing the email corpus into Neo4j

I got the dataset from this url, and it looks pretty straightforward. There's a very simple datamodel that we can work with, which would look something like this:

Podcast Interview with Dilyan Damyanov, Snowplow Analytics

Here's another great podcast for you: I had a chat with Dilyan Damyanov of Snowplow Analytics, chatting about how you can use a graph database for enhancing your event analytics, specifically for clickstream analysis. I wrote about this myself a while back, but of course there is so much more to it - and Snowplow has really done a great job at enabling it with their toolset.

Here's our chat:

Here's the transcript of our conversation:

RVB: 00:00:14.000 Hello everyone. My name is Rik Van Bruggen from Neo4j and here I am recording another Graphistania Neo4j podcast. And today, I've got someone from London on the phone. That's Dilyan Damyanov. Hi, Dilyan.

Graph Karaoke using "Natural Language Analytics": Billie Jean

Last week, my friend and colleague Michael wrote a really interesting blogpost on natural language analytics using Neo4j. He used the One Ring poem as an example of how you could use Cypher to analyse a text file and put it into a Neo4j database for some advanced analytics. That immediately made me think about my Graph Karaoke Playlist, and how I could use this technique for some more Graph Karaoke generation. Wouldn't that be nice? More graph karaoke == good!

So in this post I will show you how easy it is to get this done. A couple of quick steps is all what is needed. Let's run through it and show you how it's done.

Loading a song

The first thing to do, as always, was picking a song. So this time, my kids picked it:

Billie Jean, by the King of Pop himself. Not wanting to sound pretentious, but I think it's great for my kids to big fans of that kind of music - seems like all of our educational efforts are yielding some results :) ...

Then I picked up the lyrics of the song over here, and put it into a google doc. The reason why, is that I wanted to do one small manipulation to the file in order to be able to use it for Karaoke: I added the Songpart and the Songpartsentence in two additional columns. Plus: the Google sheet has a very easy conversion into a csv file that we can then point the Load CSV process to.

Customizing the query

With that CSV file available, I then proceeded to customize Michael's query. Here it is:

 //create the karaoke graph  
 load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE/export?format=csv&id=1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE&gid=0" as csv  
 with csv.Songpart as songpart, csv.Songpartsentence as songpartsentence, csv.Songsentence as row  
 unwind row as text  
 with songpart, songpartsentence, reduce(t=tolower(text), delim in [",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as normalized  
 with songpart, songpartsentence, [w in split(normalized," ") | trim(w)] as words  
 unwind range(0,size(words)-2) as idx  
 MERGE (w1:Word {name:words[idx]})  
 MERGE (w2:Word {name:words[idx+1]})  
 MERGE (w1)-[r:NEXT {songpart:toInt(songpart), songpartsentence:toInt(songpartsentence)}]->(w2)  
  ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1

Let's run through this query to make it easier for you to digest. We start with the "load csv" statement. We point to the csv download link mentioned above, user the first row as headers and identify that with an identifier called "csv".

 load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE/export?format=csv&id=1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE&gid=0" as csv

Then we pull the csv into three different sets that we can address separately with separate identifiers:

 with csv.Songpart as songpart, csv.Songpartsentence as songpartsentence, csv.Songsentence as row

Then we use the Cypher "unwind" operator to create separate rows out of the "row" collection, and call these rows containing lyrics "text".

 unwind row as text

Afterwards, we are gong to be using "reduce" to remove punctuation marks and then split the text into individual lyrical words:

 with songpart, songpartsentence, reduce(t=tolower(text), delim in [",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as normalized  
 with songpart, songpartsentence, [w in split(normalized," ") | trim(w)] as words

Lastly, we want to write these words into the graph. In order to do that, we are going to use "unwind" to generate an in-memory index, and then stepping through every sentence to generate the sequences. We do that with "Merge", first for the words, and then for the relationships. On every relationship, we will "karaoke-ize" the graph by assigning "songpart" and "songpartsentence" identifiers to every relationship.

 UNWIND range(0,size(words)-2) as idx  
 MERGE (w1:Word {name:words[idx]})  
 MERGE (w2:Word {name:words[idx+1]})  
 MERGE (w1)-[r:NEXT {songpart:toInt(songpart), songpartsentence:toInt(songpartsentence)}]->(w2)  
  ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1

That was easy!

So where is the KARAOKE???

Hah! That's what you came here for huh? Well, here's the result.

I have put the queries on a gist so that you can take a look at it yourself. If you have any comments, then please let me know!

Cheers

Rik

Friday, 19 September 2014

Graphs for HR Analytics

Yesterday, I had the pleasure of doing a talk at the Brussels Data Science meetup. Some really cool people there, with interesting things to say. My talk was about how graph databases like Neo4j can contribute to HR Analytics. Here are the slides of the talk:

I truly had a lot of fun delivering the talk, but probably even more preparing for it.

My basic points that I wanted to get across where these:

the HR function could really benefit from a more real world understanding of how information flows in its organization. Information flows through the *real* social network of people in your organization - independent of your "official" hierarchical / matrix-shaped org chart. Therefore it follows logically that it would really benefit the HR function to understand and analyse this information flow, through social network analysis.
In recruitment, there is a lot to be said to integrate social network information into your recruitment process. This is logical: the social network will tell us something about the social, friendly ties between people - and that will tell us something about how likely they are to form good, performing teams. Several online recruitment platforms are starting to use this - eg. Glassdoor uses Neo4j to store more than 70% of the Facebook sociogram - to really differentiate themselves. They want to suggest and recommend the jobs that people really want.
In competence management, large organizations can gain a lot by accurately understanding the different competencies that people have / want to have. When putting together multi-disciplinary, often times global teams, this can be a huge time-saver for the project offices chartered to do this.

For all of these 3 points, a graph database like Neo4j can really help. So I put together a sample dataset that should explain this. Broadly speaking, these queries are in three categories:

"Deep queries": these are the types of queries that perform complex pattern matches on the graph. As an example, that would something like: "Find me a friend-of-a-friend of Mike that has the same competencies as Mike, has worked or is working at the same company as Mike, but is currently not working together with Mike." In Neo4j cypher, that would something like this

 match (p1:Person {first_name:"Mike"})-[:HAS_COMPETENCY]->(c:Competency)<-[:HAS_COMPETENCY]-(p2:Person),  
 (p1)-[:WORKED_FOR|:WORKS_FOR]->(co:Company)<-[:WORKED_FOR]-(p2)  
 where not((p1)-[:WORKS_FOR]->(co)<-[:WORKS_FOR]-(p2))  
 with p1,p2,c,co  
 match (p1)-[:FRIEND_OF*2..2]-(p2)  
 return p1.first_name+' '+p1.last_name as Person1, p2.first_name+' '+p2.last_name as Person2, collect(distinct c.name), collect(distinct co.name) as Company;

"Pathfinding queries": this allows you to explore the paths from a certain person to other people - and see how they are connected to eachother. For example, if I wanted to find paths between two people, I could do

 match p=AllShortestPaths((n:Person {first_name:"Mike"})-[*]-(m:Person {first_name:"Brandi"}))  
 return p;

and get this:

Which is a truly interesting and meaningful representation in many cases.

Graph Analysis queries: these are queries that look at some really interesting graph metrics that could help us better understand our HR network. There are some really interesting measures out there, like for example degree centrality, betweenness centrality, pagerank, and triadic closures. Below are some of the queries that implement these (note that I have done some of these also for the Dolphin Social Network). Please be aware that these queries are often times "graph global" queries that can consume quite a bit of time and resources. I would not do this on truly large datasets - but in the HR domain the datasets are often quite limited anyway, and we can consider them as valid examples.

 //Degree centrality  
 match (n:Person)-[r:FRIEND_OF]-(m:Person)  
 return n.first_name, n.last_name, count(r) as DegreeScore  
 order by DegreeScore desc  
 limit 10;  
   
 //Betweenness centrality  
 MATCH p=allShortestPaths((source:Person)-[:FRIEND_OF*]-(target:Person))  
 WHERE id(source) < id(target) and length(p) > 1  
 UNWIND nodes(p)[1..-1] as n  
 RETURN n.first_name, n.last_name, count(*) as betweenness  
 ORDER BY betweenness DESC  
   
 //Missing triadic closures  
 MATCH path1=(p1:Person)-[:FRIEND_OF*2..2]-(p2:Person)  
 where not((p1)-[:FRIEND_OF]-(p2))  
 return path1  
 limit 50;  
   
 //Calculate the pagerank  
 UNWIND range(1,10) AS round  
 MATCH (n:Person)  
 WHERE rand() < 0.1 // 10% probability  
 MATCH (n:Person)-[:FRIEND_OF*..10]->(m:Person)  
 SET m.rank = coalesce(m.rank,0) + 1;

I am sure you could come up with plenty of other examples. Just to make the point clear, I also made a short movie about it:

The queries for this entire demonstration are on Github. Hope you like it, and that everyone understands that Graph Databases can truly add value in an HR Analytics contect.

Feedback, as always, much appreciated.

Rik

Bruggen Blog

Pages