Showing posts with label betweenness centrality. Show all posts
Showing posts with label betweenness centrality. Show all posts

Monday, 29 June 2020

Executives of Belgian Public Companies - revisited!

Monday, 16 December 2019

Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 3 of this blogpost.  Part 1 and Part 2 are also published.)

Alright this is going to be the third and final part of my work on the Hillary Clinton Email Corpus. There's two posts that came before this article:
Now we are going to spent some time with the "heart of the matter", the actual content of the emails. We are going to do that in two steps: first we will do some "full text" querying of some data, using Neo4j's specific full text indexing capabilities. Then we are going to go a step further and try to extract more knowledge from this dataset in an automated way, by running some Natural Language Processing (NLP) algorithms and processes on it.

Let's get right to it.

Fulltext querying of Emails

Those of you that have been following Neo4j for some time, may remember that we have always bundled Apache Lucene with Neo4j. For the longest time, Neo4j used Lucene for it's indexing capabilities. This turned out to be a great choice for many things, but also one that had its limitations and trade-offs. This is why Neo4j has gradually been switching away from Lucene for its core schema indexing capability, and has adopted a modular, pluggable indexing architecture that allows for different indexing techniques to be used for different data types. This is great news for many reasons, but one of the most important benefits has been a dramatic increase in write performance - as the newer indexes are much more optimized and leaner than the older Lucene based structures. Read more about indexing in the Neo4j manual.

So as I started to think about some text-oriented queries, I quickly realised that I would need an index on Email text. So I wanted to do

create index on :Email(text)

and query that index afterwards. But the result was pretty obvious:


Part 2/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 2 of this blogpost.  Part 1 and Part 3 are also published.)

In the previous post around the emails of Hillary Clinton, we were able to import the data from a CSV file, and use some really cool graph refactoring tools to make the database a little more easy to work with - bad data is bad data, and the less we have of that the better.

So we ended up in a reasonably stable state, where we could do some querying. In this post, we will do exactly that.

Exploring the graph with graph algos

It's fairly easy to get a good initial view of the structure and size of the graph. I just run a few queries like this:

//what nodes are in the db
match (n) return labels(n), count(n)

and: 

//what rels are in the db
MATCH p=()-[r]->() RETURN type(r), count(r)

and we very quickly see that, while this is clearly not a "big" dataset, it's still big enough to start loosing some significant time sifting through data if you want to make some sense of it. This is where our fantastic graph algorithms come in. I installed the plugin into my database, restarted it, and then I also played around a bit with Neuler, a graph algo playground that basically allows you to quickly experiment with different algorithms. You can download Neuler from https://install.graphapp.io/ and install it into your Neo4j Desktop really quickly.

Part 1/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 1 of this blogpost. Part 2 and Part 3 are also published.)

With lots of interesting political manoeuvring going on in the USA and in Europe, I somehow got into a rabbit hole where I came across the corpus of emails that were published in the aftermath of the 2016 US presidential elections. They have been analysed a number of times, both by citizens and the press: see the great site published by the Wall Street Journal and Ben Hamner's github repo (which is based on a Kattle dataset).

Some of my friends and colleagues have also done some work on this dataset in Neo4j - there's  this graphgistLinkurio.us' blogpost, as well as Ryan Boyd's older article on DeveloperAdvocate. But I decided I was interested enough to take it for a spin.

Importing the email corpus into Neo4j

I got the dataset from this url, and it looks pretty straightforward. There's a very simple datamodel that we can work with, which would look something like this:


Friday, 2 December 2016

Exploring the Paris Terrorist Attack network - part 3/3

Previously, on this blog, I had started writing about how we could get some of the data published by a local Belgian newspaper, De Standaard, on the Paris Terrorist Attack Network into Neo4j. In
  • Part 1, we talked about loading the raw JSON data into Neo4j, and then in
  • Part 2, we cleaned up some of the data for easy querying in Neo4j. 
So that's where we are. To wrap things up, I just wanted to illustrate some of the results and queries in Neo4j around some of the most interesting figures in this Terrorist network. I started some of my explorations around a widely reported terrorist, and Belgian national, called Salah Abdeslam.


So let's take a look at Salah in Neo4j.

Wednesday, 20 July 2016

Graphing the Tour de France - part 3/3

In the past two blogposts I have been creating and importing some nice Tour de France 2016 data. It's a small dataset, for sure, and this is by no means a realistic graph application - but perhaps we can still have some fun exploiting the data with some cypher queries. That's what we'll try now. I have put all of the example queries together in this gist, so please feel free to play around with it :) ... let's take you through it.

Is the model really there?

First and foremost, let's verify the model that we wanted to put in place, with yet another AAPOC (Awesome APOC). We thought we were going to get this model:

Friday, 19 September 2014

Graphs for HR Analytics

Yesterday, I had the pleasure of doing a talk at the Brussels Data Science meetup. Some really cool people there, with interesting things to say. My talk was about how graph databases like Neo4j can contribute to HR Analytics. Here are the slides of the talk:

I truly had a lot of fun delivering the talk, but probably even more preparing for it.

My basic points that I wanted to get across where these:
  • the HR function could really benefit from a more real world understanding of how information flows in its organization. Information flows through the *real* social network of people in your organization - independent of your "official" hierarchical / matrix-shaped org chart. Therefore it follows logically that it would really benefit the HR function to understand and analyse this information flow, through social network analysis.
  • In recruitment, there is a lot to be said to integrate social network information into your recruitment process. This is logical: the social network will tell us something about the social, friendly ties between people - and that will tell us something about how likely they are to form good, performing teams. Several online recruitment platforms are starting to use this - eg. Glassdoor uses Neo4j to store more than 70% of the Facebook sociogram - to really differentiate themselves. They want to suggest and recommend the jobs that people really want.
  • In competence management, large organizations can gain a lot by accurately understanding the different competencies that people have / want to have. When putting together multi-disciplinary, often times global teams, this can be a huge time-saver for the project offices chartered to do this. 
For all of these 3 points, a graph database like Neo4j can really help. So I put together a sample dataset that should explain this. Broadly speaking, these queries are in three categories:
  1. "Deep queries": these are the types of queries that perform complex pattern matches on the graph. As an example, that would something like: "Find me a friend-of-a-friend of Mike that has the same competencies as Mike, has worked or is working at the same company as Mike, but is currently not working together with Mike." In Neo4j cypher, that would something like this
 match (p1:Person {first_name:"Mike"})-[:HAS_COMPETENCY]->(c:Competency)<-[:HAS_COMPETENCY]-(p2:Person),  
 (p1)-[:WORKED_FOR|:WORKS_FOR]->(co:Company)<-[:WORKED_FOR]-(p2)  
 where not((p1)-[:WORKS_FOR]->(co)<-[:WORKS_FOR]-(p2))  
 with p1,p2,c,co  
 match (p1)-[:FRIEND_OF*2..2]-(p2)  
 return p1.first_name+' '+p1.last_name as Person1, p2.first_name+' '+p2.last_name as Person2, collect(distinct c.name), collect(distinct co.name) as Company;  

  1. "Pathfinding queries": this allows you to explore the paths from a certain person to other people - and see how they are connected to eachother. For example, if I wanted to find paths between two people, I could do
 match p=AllShortestPaths((n:Person {first_name:"Mike"})-[*]-(m:Person {first_name:"Brandi"}))  
 return p;  

and get this:
Which is a truly interesting and meaningful representation in many cases.
  1. Graph Analysis queries: these are queries that look at some really interesting graph metrics that could help us better understand our HR network. There are some really interesting measures out there, like for example degree centrality, betweenness centrality, pagerank, and triadic closures. Below are some of the queries that implement these (note that I have done some of these also for the Dolphin Social Network). Please be aware that these queries are often times "graph global" queries that can consume quite a bit of time and resources. I would not do this on truly large datasets - but in the HR domain the datasets are often quite limited anyway, and we can consider them as valid examples.
 //Degree centrality  
 match (n:Person)-[r:FRIEND_OF]-(m:Person)  
 return n.first_name, n.last_name, count(r) as DegreeScore  
 order by DegreeScore desc  
 limit 10;  
   
 //Betweenness centrality  
 MATCH p=allShortestPaths((source:Person)-[:FRIEND_OF*]-(target:Person))  
 WHERE id(source) < id(target) and length(p) > 1  
 UNWIND nodes(p)[1..-1] as n  
 RETURN n.first_name, n.last_name, count(*) as betweenness  
 ORDER BY betweenness DESC  
   
 //Missing triadic closures  
 MATCH path1=(p1:Person)-[:FRIEND_OF*2..2]-(p2:Person)  
 where not((p1)-[:FRIEND_OF]-(p2))  
 return path1  
 limit 50;  
   
 //Calculate the pagerank  
 UNWIND range(1,10) AS round  
 MATCH (n:Person)  
 WHERE rand() < 0.1 // 10% probability  
 MATCH (n:Person)-[:FRIEND_OF*..10]->(m:Person)  
 SET m.rank = coalesce(m.rank,0) + 1;  

I am sure you could come up with plenty of other examples. Just to make the point clear, I also made a short movie about it:

The queries for this entire demonstration are on Github. Hope you like it, and that everyone understands that Graph Databases can truly add value in an HR Analytics contect.

Feedback, as always, much appreciated.

Rik

Saturday, 7 June 2014

Betweenness Centrality: who's the Dolphin Daddy?

Just wrote a graphgist on Betweenness Centrality in a social network of dolphins in the Doubtful Sound, New Zealand. I won't repeat myself here in this blog - but it is pretty fantastic. Thanks to Michael for his help, again.