Thursday 26 March 2015

Data Innovation Survey for Belgium - in Neo4j

Today, I have had loads of fun at the Data Innovation Summit in Brussels, Belgium. Hosted in the beautiful Axa Belgium offices, it was a great opportunity to meet 500 (!!) data-minded professionals. I was also able to do an Ignite Talk there, which was quite an experience. 15 seconds for every slide, and no way for you to change the slides yourself and determine the "rythm" - very different. Here are the slides:

But that was not the coolest thing. They also did a "Data Innovation Survey", which was super cool. The data is all open (find it in this gist), and I of course took it from Excel
create a graph MODEL out of it

and then load it into Neo4j using this load script. You will need to tweak the load csv file locations, but after that: just download Neo4j 2.2, fire up the Neo4j-shell, and paste all the commands into it. Should be a matter of half a minute to load the data. 

Then we have the data in Neo4j, and we can start doing some queries. Now, I must admit that I am not a huge fun of working the data this way - as there are very few intricate relationships that we can use meaningfully. Nevertheless, here are a few queries:

 //respondents and techniques with PhDs  
 MATCH (dl:DegreeLevel {name:"PhD"})--(r:Respondent)--(t:Technique)  
 return dl,r,t  

That's easy:
Let's make it a bit more sophisticated:

 //respondents and techniques at level 5 with PhDs and their DegreeFields  
 MATCH (dl:DegreeLevel {name:"PhD"})--(r:Respondent)-[ht:HAS_TECHNIQUE {level:'5'}]--(t:Techniques),  
 (r)--(df:DegreeField)  
 return dl,r,t,df  
 limit 10  

You can see how that would make the visualisation a bit more complicated. 
And then finally, here is a first attempt at doing something a bit more "graphy". Let's see which "DegreeFields" are the most important in our graph. In other words - the most "Between" the other nodes of the graph. We do that with a query like this:

 //betweenness centrality of the "DegreeFields"  
 MATCH p=allShortestPaths((r1:Respondent)-[*]-(r2:Respondent))  
  WHERE id(r1) < id(r2) and length(p) > 1  
  UNWIND nodes(p)[1..-1] as n  
  WITH n, count(*) as betweenness, labels(n) as labels  
  WHERE "DegreeField" in labels  
  RETURN n.name, betweenness  
  order by betweenness desc;  
and then we see this result:


There's a lot of importance to Science/Mathematics, ICT and Engineering. Who would have thought?

You can of course apply these techniques much more generically to other problems, and that is mostly why I share it here. I hope others find it interesting, and as always...

... Feedback welcome!

Cheers

Rik

1 comment:

  1. Just take a look at the new shipping site can I transport my car to belgium??. I found on the internet. They gave the results I wanted for my car. I would consider them every time! And you should too!

    ReplyDelete