Showing posts with label pagerank. Show all posts

Monday, 29 June 2020

Executives of Belgian Public Companies - revisited!

Long time ago, when dinosaurs roamed the earth and Neo4j was just a tiny cute little junior graph database ;-), I wrote a 2 part blogpost about a newspaper article that I had come across in De Tijd about the network of executives of Belgian public companies. You can find the articles over here: Part 1 and Part 2. Turns out - and I really was not aware of this until recently - that the newspaper has been running this type of publication on a yearly basis. Here's another article from 2018 on De Tijd’s website.

So imagine my surprise a few weeks ago, when I was contacted by one of the authors of that article, Thomas Roelens, to verify some info for the 2020 edition of this analysis. We had a great chat, and Thomas basically asked me to double check some of the analysis that he had done himself already. So, contrary to what happened in 2017 (where I had to dig into the HTML source to download the info from the website - Thomas just sent it to me, and basically allowed me to take it for a spin :) ...

Meanwhile, Thomas' article has been published in the newspaper: you can find it over here or over here. But here's my update below too.

Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 3 of this blogpost. Part 1 and Part 2 are also published.)

Alright this is going to be the third and final part of my work on the Hillary Clinton Email Corpus. There's two posts that came before this article:

in the first post we focused on importing and refactoring the data in Neo4j
in the second post we spent some time analysing the dataset with some algorithms and some specific pattern matching queries

Now we are going to spent some time with the "heart of the matter", the actual content of the emails. We are going to do that in two steps: first we will do some "full text" querying of some data, using Neo4j's specific full text indexing capabilities. Then we are going to go a step further and try to extract more knowledge from this dataset in an automated way, by running some Natural Language Processing (NLP) algorithms and processes on it.

Let's get right to it.

Fulltext querying of Emails

Those of you that have been following Neo4j for some time, may remember that we have always bundled Apache Lucene with Neo4j. For the longest time, Neo4j used Lucene for it's indexing capabilities. This turned out to be a great choice for many things, but also one that had its limitations and trade-offs. This is why Neo4j has gradually been switching away from Lucene for its core schema indexing capability, and has adopted a modular, pluggable indexing architecture that allows for different indexing techniques to be used for different data types. This is great news for many reasons, but one of the most important benefits has been a dramatic increase in write performance - as the newer indexes are much more optimized and leaner than the older Lucene based structures. Read more about indexing in the Neo4j manual.

So as I started to think about some text-oriented queries, I quickly realised that I would need an index on Email text. So I wanted to do

create index on :Email(text)

and query that index afterwards. But the result was pretty obvious:

Part 2/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 2 of this blogpost. Part 1 and Part 3 are also published.)

In the previous post around the emails of Hillary Clinton, we were able to import the data from a CSV file, and use some really cool graph refactoring tools to make the database a little more easy to work with - bad data is bad data, and the less we have of that the better.

So we ended up in a reasonably stable state, where we could do some querying. In this post, we will do exactly that.

Exploring the graph with graph algos

It's fairly easy to get a good initial view of the structure and size of the graph. I just run a few queries like this:

//what nodes are in the db

match (n) return labels(n), count(n)

and:

//what rels are in the db

MATCH p=()-[r]->() RETURN type(r), count(r)

and we very quickly see that, while this is clearly not a "big" dataset, it's still big enough to start loosing some significant time sifting through data if you want to make some sense of it. This is where our fantastic graph algorithms come in. I installed the plugin into my database, restarted it, and then I also played around a bit with Neuler, a graph algo playground that basically allows you to quickly experiment with different algorithms. You can download Neuler from https://install.graphapp.io/ and install it into your Neo4j Desktop really quickly.

Part 1/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

(Note: this is Part 1 of this blogpost. Part 2 and Part 3 are also published.)

With lots of interesting political manoeuvring going on in the USA and in Europe, I somehow got into a rabbit hole where I came across the corpus of emails that were published in the aftermath of the 2016 US presidential elections. They have been analysed a number of times, both by citizens and the press: see the great site published by the Wall Street Journal and Ben Hamner's github repo (which is based on a Kattle dataset).

Some of my friends and colleagues have also done some work on this dataset in Neo4j - there's this graphgist, Linkurio.us' blogpost, as well as Ryan Boyd's older article on DeveloperAdvocate. But I decided I was interested enough to take it for a spin.

Importing the email corpus into Neo4j

I got the dataset from this url, and it looks pretty straightforward. There's a very simple datamodel that we can work with, which would look something like this:

Exploring the Paris Terrorist Attack network - part 3/3

Previously, on this blog, I had started writing about how we could get some of the data published by a local Belgian newspaper, De Standaard, on the Paris Terrorist Attack Network into Neo4j. In

Part 1, we talked about loading the raw JSON data into Neo4j, and then in
Part 2, we cleaned up some of the data for easy querying in Neo4j.

So that's where we are. To wrap things up, I just wanted to illustrate some of the results and queries in Neo4j around some of the most interesting figures in this Terrorist network. I started some of my explorations around a widely reported terrorist, and Belgian national, called Salah Abdeslam.

So let's take a look at Salah in Neo4j.

Graphing the Tour de France - part 3/3

In the past two blogposts I have been creating and importing some nice Tour de France 2016 data. It's a small dataset, for sure, and this is by no means a realistic graph application - but perhaps we can still have some fun exploiting the data with some cypher queries. That's what we'll try now. I have put all of the example queries together in this gist, so please feel free to play around with it :) ... let's take you through it.

Is the model really there?

First and foremost, let's verify the model that we wanted to put in place, with yet another AAPOC (Awesome APOC). We thought we were going to get this model:

The GraphBlogGraph: 3rd blogpost out of 3

Querying the GraphBlogGraph

After having created the GraphBlogGraph in a Google Spreadsheet in part 1, and having imported it into Neo4j in part 2, we can now start having some fun and analysing and querying that dataset. There are obviously a lot of things we could do here, but in this final blog post I am just going to explore some initial things that I am sure you could then elaborate and extend upon.

Let’s start with a simple query

// Which pages have the most links
match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)

return b.name, p.title, count(r)

order by count(r) desc

Run this in the Neo4j browser and we get:

or just return the graphical result with a slightly different query:

match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)

with b,p,r,p2, count(r) as count

order by count DESC

limit 50

return b,p,r,p2

And then you start to see that Max De Marzi is actually the “king of linking”: he links his pages to other web pages a lot (which is actually very good for search-engine-optimization) .

A quick visit to one of Max’ pages does actually confirm that: there’s a lot of cool, bizarre, but always interesting links on Max’ blogposts:

So let’s do another query. Let’s look at the different links that exist between blogposts of our blog-authors. Are they actually quoting/referring to one another or not? Let’s do

//links between blogposts

MATCH p=((n1:Blog)--(p1:Page)-[:LINKS_TO]-(p2:Page)--(b2:Blog))

RETURN p;

and then we actually find that there are some links - but not that many.

Same thing if we look at this a different way: let’s do some pathfinding and check out the paths between different blogs, for example my blog and Michael’s

match (b1:Blog {name:"Bruggen"}),(b3:Blog {name:"JEXP Blog"}),

p2 = allshortestpaths((b1)-[*]-(b3))

return p2 as paths

Then we actually see a bit more interesting connections: we don’t refer to one another directly very often, but we both refer to the same pages - and those pages become the links between our blogs. At depth 4 we see these kinds of patterns:

Interesting, right? I think so, at least!

Then let’s do some more playing around, looking at the most linked to pages:

//Which pages are being linked to most

match ()-[r:LINKS_TO]->(p:Page)

return p.url, count(r)

order by count(r) DESC

limit 10;

That quickly uncovers the true “spider in the web”, my friend, colleague and graphista-extraordinaire: Michael Hunger:

Last but not least, I wanted to revisit an old and interesting way of running PageRank on Neo4j using Cypher (not using the Graphaware NodeRank module, therefore). I blogged about some time ago, and it’s actually really interesting and easy to do. Here’s the query:

UNWIND range(1,50) AS round
MATCH (n:Page)
WHERE rand() < 0.1
MATCH (n:Page)-[:LINKS_TO*..10]->(m:Page)
SET m.rank = coalesce(m.rank,0) + 1

This does 50 iterations of PageRank, using a 0,1 damping factor and a maximum depth of 10. Running it is surprisingly quick:

If you do that a couple of times, and even do a few hundred iterations at once, you will quickly see the results emerge with the following simple query:

match (n:Page)
where n.rank is not null
return n.url, n.rank
order by n.rank desc
limit 10;

Confirming the “spider in the web” theory that I mentioned above. Michael rules the links!

All of these queries are of course on Github for you to play around with. Would love to hear your thoughts on these three blogposts, and hope that they were as fun for you to read as they were for me to write.

All the best.

Rik

Friday, 19 September 2014

Graphs for HR Analytics

Yesterday, I had the pleasure of doing a talk at the Brussels Data Science meetup. Some really cool people there, with interesting things to say. My talk was about how graph databases like Neo4j can contribute to HR Analytics. Here are the slides of the talk:

I truly had a lot of fun delivering the talk, but probably even more preparing for it.

My basic points that I wanted to get across where these:

the HR function could really benefit from a more real world understanding of how information flows in its organization. Information flows through the *real* social network of people in your organization - independent of your "official" hierarchical / matrix-shaped org chart. Therefore it follows logically that it would really benefit the HR function to understand and analyse this information flow, through social network analysis.
In recruitment, there is a lot to be said to integrate social network information into your recruitment process. This is logical: the social network will tell us something about the social, friendly ties between people - and that will tell us something about how likely they are to form good, performing teams. Several online recruitment platforms are starting to use this - eg. Glassdoor uses Neo4j to store more than 70% of the Facebook sociogram - to really differentiate themselves. They want to suggest and recommend the jobs that people really want.
In competence management, large organizations can gain a lot by accurately understanding the different competencies that people have / want to have. When putting together multi-disciplinary, often times global teams, this can be a huge time-saver for the project offices chartered to do this.

For all of these 3 points, a graph database like Neo4j can really help. So I put together a sample dataset that should explain this. Broadly speaking, these queries are in three categories:

"Deep queries": these are the types of queries that perform complex pattern matches on the graph. As an example, that would something like: "Find me a friend-of-a-friend of Mike that has the same competencies as Mike, has worked or is working at the same company as Mike, but is currently not working together with Mike." In Neo4j cypher, that would something like this

 match (p1:Person {first_name:"Mike"})-[:HAS_COMPETENCY]->(c:Competency)<-[:HAS_COMPETENCY]-(p2:Person),  
 (p1)-[:WORKED_FOR|:WORKS_FOR]->(co:Company)<-[:WORKED_FOR]-(p2)  
 where not((p1)-[:WORKS_FOR]->(co)<-[:WORKS_FOR]-(p2))  
 with p1,p2,c,co  
 match (p1)-[:FRIEND_OF*2..2]-(p2)  
 return p1.first_name+' '+p1.last_name as Person1, p2.first_name+' '+p2.last_name as Person2, collect(distinct c.name), collect(distinct co.name) as Company;

"Pathfinding queries": this allows you to explore the paths from a certain person to other people - and see how they are connected to eachother. For example, if I wanted to find paths between two people, I could do

 match p=AllShortestPaths((n:Person {first_name:"Mike"})-[*]-(m:Person {first_name:"Brandi"}))  
 return p;

and get this:

Which is a truly interesting and meaningful representation in many cases.

Graph Analysis queries: these are queries that look at some really interesting graph metrics that could help us better understand our HR network. There are some really interesting measures out there, like for example degree centrality, betweenness centrality, pagerank, and triadic closures. Below are some of the queries that implement these (note that I have done some of these also for the Dolphin Social Network). Please be aware that these queries are often times "graph global" queries that can consume quite a bit of time and resources. I would not do this on truly large datasets - but in the HR domain the datasets are often quite limited anyway, and we can consider them as valid examples.

 //Degree centrality  
 match (n:Person)-[r:FRIEND_OF]-(m:Person)  
 return n.first_name, n.last_name, count(r) as DegreeScore  
 order by DegreeScore desc  
 limit 10;  
   
 //Betweenness centrality  
 MATCH p=allShortestPaths((source:Person)-[:FRIEND_OF*]-(target:Person))  
 WHERE id(source) < id(target) and length(p) > 1  
 UNWIND nodes(p)[1..-1] as n  
 RETURN n.first_name, n.last_name, count(*) as betweenness  
 ORDER BY betweenness DESC  
   
 //Missing triadic closures  
 MATCH path1=(p1:Person)-[:FRIEND_OF*2..2]-(p2:Person)  
 where not((p1)-[:FRIEND_OF]-(p2))  
 return path1  
 limit 50;  
   
 //Calculate the pagerank  
 UNWIND range(1,10) AS round  
 MATCH (n:Person)  
 WHERE rand() < 0.1 // 10% probability  
 MATCH (n:Person)-[:FRIEND_OF*..10]->(m:Person)  
 SET m.rank = coalesce(m.rank,0) + 1;

I am sure you could come up with plenty of other examples. Just to make the point clear, I also made a short movie about it:

The queries for this entire demonstration are on Github. Hope you like it, and that everyone understands that Graph Databases can truly add value in an HR Analytics contect.

Feedback, as always, much appreciated.

Rik

Bruggen Blog

Pages

Monday, 29 June 2020

Executives of Belgian Public Companies - revisited!

Monday, 16 December 2019

Part 3/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

Fulltext querying of Emails

Part 2/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

Exploring the graph with graph algos

Part 1/3: Revisiting Hillary Clinton's email corpus with graph algos and NLP

Importing the email corpus into Neo4j

Friday, 2 December 2016

Exploring the Paris Terrorist Attack network - part 3/3

Wednesday, 20 July 2016

Graphing the Tour de France - part 3/3

Is the model really there?

Wednesday, 13 January 2016

The GraphBlogGraph: 3rd blogpost out of 3

Querying the GraphBlogGraph

Friday, 19 September 2014

Graphs for HR Analytics

Labels

Blogarchive

Metricool