Wednesday, 21 August 2013

Finding the shortest path - through the park

As you may know by now, I like beer. A lot - why else would I keep writing and talking about it? But there’s more to life than sweet beverages, and one of the things that I have been doing for as long as I can remember is Orienteering. I have been practicing the sport in Belgium since 1984 - I was 11 years old. My dad used to take me to races all across the continent - we truly had a blast. And we still do: I still orienteer almost every week, and so does my dad. Now I take my 8 and 10-year old kids with me to the races, and their granddad cheers them on every step of the way. It’s a fantastic family sport.


One of the reasons why it is so fantastic is that - Orienteering is a “thinking sport”. You have to concentrate to navigate. You have to run to have the best time (it’s a race), but if you run too fast, you are sure to make navigation mistakes. You have to find the balance between physical and mental fitness - which is hard but completely awesome when you succeed. And: it’s outdoors - in the woods and fields. What’s not to like?



So what does that have to do with neo4j? Well, orienteering is all about “finding the shortest path”: the *fastest* route from start to finish. Fast can be short. Fast can also mean that it is better to take a detour: if it is easier to run the longer route, than to walk the shorter route, you are better off choosing the longer route. In essence, every orienteering race is … a graph problem waiting to be solved in the middle of nature.


Orienteering = a green graph problem

In case you don’t know: orienteering races are a bit like an obstacle race. Every participant gets assigned a course, out there in the green forests and fields, and along that course are sequences of beacons that one needs to get to in order. Such a sequence is … a path on a graph - you have to choose how to navigate from obstacle to obstacle, from node to node.


Essentially, the orienteer has to navigate and choose the fastest route. Finding the fastest route effectively this boils down to a “weighted shortest path” calculation. You calculate the shortest path using
  • distance: shorter = better
  • runnability: higher = better. Runnability can be affected by the type of terrain (running on a road? through a field? through a forest? through a forest with soil covered with plants? over a hill? through a valley? …)
as your parameters. For every “leg” of the race, you estimate the presumable “best route” based on the assumption that distance / runnability will be the best indicator for your likely speed.

Example: a 2 control race in Antwerp, Belgium

This is the map of a training race that I did with my kids in a beautiful Antwerp park.



As you can see - the race assignment is a graph.




If we then look at every leg separately, you can see that for every leg, there are a number of route options.


The red route is the safe choice - running along the roads - but takes quite a detour. The blue route cuts straight across the field - but then requires me to go straight through the forest for a short distance.
The red route is the shortest - but requires me to run straight through the forest. The blue route just races along the forest road.
The red route just goes straight to the finish line. The blue route cuts through the forest and then follows the road. The green route safely hurries along the roads.

So 3 controls, and different routes with different characteristics. As you can see in the schematic representations, every route has different “waypoints” - specific points of interest that I can identify on the map, and recognize in the “field”. These waypoints are extremely important for the navigation exercise that we are doing - they allow us to break the problem up in smaller pieces and evaluate our options.



Intuitively, all of us will have a “feeling” about what would be the best route choice, but now let’s use graph algorithms to do this for us!

Graph database model to navigate

In order to apply a graph algorithm, we first need to create a graph. These are my nodes:
  • Control nodes: the race beacons that I have to pass by
  • The alternative route choices, decomposed in waypoints.




Then, let’s create the relationships between these nodes. We will have “COURSE_TO” relationships between controls, and “NAVIGATE_TO” relationships between waypoints. Effectively, these will become “paths” on my graph, hopping from node to node along the relationships.
  • From the start to control 1: I have 0->0.11->0.12->0.13->0.14->0.15 as one route and 0->0.21->0.22->1 as another route
  • From control 1 to control 2 I have 1->1.11->1.12->1.13->2 as one route option, and 1->0.15->1.21->2 as another option.
  • From control 2 to the finish I have 3 options: 2->2.11->3, 2->2.21->2.22->3 and 2->1.21->2.31->2.32->3.



As you can see, I have immediately added “distance” (in meters) and “runnability” (in %) properties to my relationships.


When I then generate a neo4j database using the spreadsheet method, I get a nice little database - ready to be queried and ready for my algorithms.




Graphs algorithms to win the race!

In order to calculate the best route to win the race, I need to calculate the shortest path across the graph - which is standard functionality of neo4j. But because there’s more to it than running the course in straight lines between controls, I need to incorporate weights (distance, runnability) to get a realistic estimate of what would be the best route choice. To do so I am using a technique so well demonstrated by Ian Robinson on his blog last June.


Let’s go through two versions of this calculation:


  1. find the shortest path by distance only:


To do this, we will be using ‘reduce’ to sum the distance properties.


START  startNode=node:node_auto_index(name="Start"),
      endNode=node:node_auto_index(name="Finish")
MATCH  p=(startNode)-[:NAVIGATE_TO*]->(endNode)
RETURN p AS shortestPath,
      reduce(distance=0, r in relationships(p) : distance+r.distance) AS totalDistance
      ORDER BY totalDistance ASC
      LIMIT 1;



  1. find the best estimate of the fastest path, as a function of distance/runnability. In a real race this would probably be the route that I would choose - as it would give me the best chance of winning the race.


To do this, we will be using ‘reduce’ to sum the distance divided by the runnability: longer distance with superior runnability, is possibly faster than shorter distance with lower runnability:


START  startNode=node:node_auto_index(name="Start"),
      endNode=node:node_auto_index(name="Finish")
MATCH  p=(startNode)-[:NAVIGATE_TO*]->(endNode)
RETURN p AS shortestPath,
      reduce(EstimatedTime=0, r in relationships(p) : EstimatedTime+(r.distance/r.runnability)) AS TotalEstimatedTime
      ORDER BY TotalEstimatedTime ASC
      LIMIT 1;

As you can see, the first and second queries take the same (shortest) path for start to control 1 and from control 2 to finish, but recommends a clearly different path from control 1 to control 2 (following the forest road instead of cutting through the forest).



Many applications for weighted shortest paths!

Obviously Orienteering is not a business application, but in logistics, planning, impact analysis and many other applications, weighted shortest path algorithms will have a great potential. Whether it is to find out how things are related to each other, determining the most efficient way to get something from point A to point B, or finding out who would be affected by a particular type of capacity outage - the approach that I used for my orienteering problem would work just as nicely!

Friday, 9 August 2013

Exploring LinkedIn in Neo4j

Ever since I have been working for Neo, we have been trying to give our audience as many powerful examples of places where graph databases could really shine. And one of the obvious places has always been: social networks. That’s why I’ve written a post about facebook, and why many other graphistas have been looking at facebook and others to explain what could be done.  


But while Facebook is probably the best-known social network, the one I use professionally the most is: LinkedIn. Some call it the creepiest network, but the fact of the matter is that professional network is, and has always been, a very useful way to get and stay in contact with other people from other organisations. And guess what: they do some fantastic stuff with their own, custom-developed graphs. One of these things is InMaps - a fantastic visualisation and colour coded analysis of your professional network. That’s where this blogpost got its inspiration from.













An interactive InMap

The thing is: the InMap above is a “static” picture of your network. You can’t really *do* anything with it. You can’t browse through it. You can’t query it. So there began my quest for a way to get the data out of the InMap, and into Neo4j. Something that I expected to take days or weeks - but from the first google search to publishing this post was literally just 2-3 hours of work. It’s dead easy.


Well I should qualify that. You of course have to have somepleace to start - a place that can help you get the data from LinkedIn, into a format that I could work with to import into neo4j. So after 5 minutes of googling, I came across this Dataiku blog post by Thomas Cabrol that had written a couple of simple python scripts to get me going.

Step 1: access linkedin API

Thomas’ scripts run against the LinkedIn developer API. This API requires you to authenticate, and therefore you actually need to register an application (in this case: the python scripts) in your configuration. Easy to do: just go to LinkedIn's developer site and just register an app.
The important thing here is that this process will generate a number of OAuth keys, tokens and secrets that you will need to insert into your scripts. Easy peasy - but you need to do so before proceeding.


Thomas’ work was based on 3 scripts (one to get the OAuth credentials, one to extract the data from LinkedIn, and one to clean up some duplicates), but for the purpose of this blog post (and because of some changes in LinkedIn’s OAuth policies) you really only need 1 script. I have therefore had to fork Thomas’ script and have put my version over here. Note that you will need to insert your own credentials and name for this script to make sense. ProTip: make sure that you have installed oauth2 and urlparse, the imported modules - otherwise the python script won’t work.


Once you have the script, just open up a Terminal, run the python script (python linkedin-query.py) - and then wait a couple of minutes. If all goes well, it will generate a linked.csv file that holds all the connections (name pairs) that you need - in your 1st degree network. That means:
  • connections from yourself to people in your network
  • connections from people in your network, to other people in your network.

Exactly what InMaps shows. In my particular case, that meant 1516 nodes (I admit, I am a proficient user of LinkedIn ;)), and 5345 connections/relationships. 


Important note: LinkedIn actually limits the amount of API calls that you can make to their servers. You can read more about it on their developer pages. For my network, it meant that I could only run the python script once per day. So beware!

Step 2 is the easy part: importing the .csv into Neo4j

Now all we need to do is get the CSV file into neo4j - and as some of you probably know, there’s more than one way to do that. Whichever way you choose, you would always need to have a simple data-model, and in this case, it could not be simpler:

Because this dataset is still quite small, I chose the spreadsheet way to import the data. All I had to do was import the csv into a spreadsheet, dedupe some of the nodes, and create the cypher statements to inject the data into Neo4j.


Make sure to configure the Neo4j auto-index for the name property for this to work.(Set node_auto_indexing=true and node_keys_indexable=name in conf/neo4j.properties).

The resulting statements look like this:

create ({name:'Rik Van Bruggen'});

To create the nodes. And:

start 
       n=node:node_auto_index(name='Tareq Abedrabbo'), 
       m=node:node_auto_index(name='Yuri Bukhan') 
create unique n-[:CONNECTED_TO]->m;

to create the relationships. Again: easy.


Start the Neo4j server. Take the spreadsheet, copy/paste the cypher queries into the console that connected to the server (bin/neo4j-shell) and you are all set. If you want to skip all that you could also just download the imported neo4j graph database and take a look at my professional network - but to be honest it’s a lot more fun if you do it with your own network.


You can then just browse to the webadmin console, and there we have our interactive InMap equivalent. No color codes (yet) - but plenty of interactivity to go around.




Step 3: Querying the Interactive InMap

The nice thing of having this dataset into Neo4j now, is of course that we can interact with it. I personally found it very nice and cool - as is almost always the case with graphs - to “take a walk on the data”. Just grab a node in the webadmin, select some of its connections, and just interactively browse from node to node - and find out stuff about your LinkedIn network that you may not have known before.


And then of course, you can also “programmatically” interact with the data - and Cypher is the prime tool for that. Here’s a couple of queries that I made up - but I am sure that you can make up some more.

If you're into it, you can also put your own interactive visualization in front of the Neo4j graph database, something like Max de Marzi's Neovigator would be interesting. 

If you have any questions regarding use-cases for Neo4j or how to use Neo4j in your project, don't hesitate to contact me.