Monday 29 August 2016

Orienteering with Neo4j - solving the 1000-control race with the Dijkstra-APOC - part 2/2

In part 1 of this blogpost, I explained how you can use the Awesome Neo4j APOCs to calculate a weighted shortest path on a graph with a more optimized and more efficient algorithm, based on Dijkstra's work. In this second and last blogpost on this topic, I would love to explain a bit why I think this is pretty much a very big deal. APOCs give you access, from Cypher, to a whole slew of graph algorithms, many of them very useful for all kinds of different graph operations.

Orienteering - a bit more complicated in the real world

One reason why I wanted to write this second post, is of course because my lovely sport - Orienteering - is of course a bit more complicated in the real world than what you have seen in that little park run that I talked about in the previous two posts. To give you a feel for it:

  • Here's an excerpt of my run in the World Masters Orienteering Champs a few weeks back in Estonia. More details over here - but I can tell you that for each and every one of these legs there's at least half a dozen different route options - and of course my course had 20+ control points too. So a bit more of a bigger graph anyway!
  • And here's another example: actually being run today (August 25th) is the actual elite's World Orienteering Champs on the long distance. Just. Look. At. This. Map.

It's just amazing, and if you are an orienteer like me, pretty darn SCARY. Crazy race, physically and mentally.

Point being: even in orienteering, the "shortest weighted pathfinding" is WAY more complicated than the previous example, and in real Enterprise applications, it will be even more complicated. That's why I wanted to try the old way (using the reduce-based Cypher query) and the new way (using the Dijkstra-based APOC) on a larger graph.

So here's what I did:

  1. First: I applied the same methods that I used on the previous blogpost, to a "more realistic" orienteering graph that would feature a 10-control race. For each "leg" of the race, I would have 3 different route choices, and for each route, I would have 5 different waypoints. 
  2. Second: I created a (hypothetical, of course) 1000-control orienteering graph with the same characteristics: 3 routes for every leg, and 5 waypoints for every route.
And then I could look at the difference between the two query strategies. Let's start with the first.

A 10 control orienteering race: very different performance

Here's the query to create that 10-control orienteering race, also available on github, of course:
create index on :Control(seq); 
unwind range(0,10) as range
merge (c2:Control {seq: range});
match (n1:Control {seq:0}),(n2:Control {seq:10})
set = "Start"
set = "Finish"; 
match (c1:Control), (c2:Control)
where c2.seq=c1.seq+1
merge (c1)-[:COURSE_TO]->(c2); 
match (c1:Control)-->(c2:Control)
where c2.seq=c1.seq+1
with distinct c1,c2
create (c1)-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w1:Route1Waypoint {name: "Waypoint 1.1"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w2:Route1Waypoint {name: "Waypoint 1.2"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w3:Route1Waypoint {name: "Waypoint 1.3"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w4:Route1Waypoint {name: "Waypoint 1.4"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w5:Route1Waypoint {name: "Waypoint 1.5"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(c2); 
match (c1:Control)-->(c2:Control)
where c2.seq=c1.seq+1
with distinct c1,c2
create (c1)-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w1:Route2Waypoint {name: "Waypoint 2.1"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w2:Route2Waypoint {name: "Waypoint 2.2"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w3:Route2Waypoint {name: "Waypoint 2.3"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w4:Route2Waypoint {name: "Waypoint 2.4"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w5:Route2Waypoint {name: "Waypoint 2.5"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(c2); 
match (c1:Control)-->(c2:Control)
where c2.seq=c1.seq+1
with distinct c1,c2
create (c1)-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w1:Route3Waypoint {name: "Waypoint 3.1"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w2:Route3Waypoint {name: "Waypoint 3.2"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w3:Route3Waypoint {name: "Waypoint 3.3"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w4:Route3Waypoint {name: "Waypoint 3.4"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(w5:Route3Waypoint {name: "Waypoint 3.5"})-[:NAVIGATE_TO {distance: toInt(rand()*1000), runnability: ceil(rand()*10)/10}]->(c2);
Here's a sample of what the result looks like:

Then, I can very quickly "call a friend" and get the solution to my 10-control Orienteering race by doing one of two things:

1. Run the Cypher Shortest path query:
MATCH  (startNode:Control {name:"Start"}),
      (endNode:Control {name:"Finish"}),
RETURN p AS shortestPath,
      reduce(EstimatedTime=0, r in relationships(p) | EstimatedTime+(r.distance/r.runnability)) AS TotalEstimatedTime
      ORDER BY TotalEstimatedTime ASC
      LIMIT 1;
The result just about fits into my Neo4j Brower:

But what you probably did not see (unless you tried it) was that the:
Man! That's SLOW! It's becoming very clear now, that the combinatory explosion of paths that this query is using (intentionally), is just not a feasible option on larger graphs. If the result is already that bad on this small of a graph, then we'd better quickly turn to

2. Run the Dijkstra-APOC-based query:
MATCH  (startNode:Control {name:"Start"}), (endNode:Control {name:"Finish"})
call apoc.algo.dijkstra(startNode, endNode, 'NAVIGATE_TO', 'calculated') YIELD path, weight
return path;
Which gives me exactly the same result: just look at the colours of the waypoints to see that the route choices are identical for the different legs of the course:
And: BONUS! Thanks to the the AWESOME APOC's power, the 
Now, I guess it would become pretty clear that doing this on an even larger graph would be ... interesting, right? Let's try.

The 1000-control Orienteering Race

Edsger W. Dijkstra
Using exactly the same mechanism's as outlined above, I have created an additional set of queries that would create a much larger Orienteering graph: an imaginary race of 1000 controls, with (again) 3 route choices for every leg, and 5 waypoints for every route choice. The entire script is also on github, of course. It's still a fairly small graph if you compare it to enterprise graphs, of course, but if you start trying the two approaches for finding the weighted shortest path, you will very quickly realise that the "Cypher-reduce" approach no longer works - at least not on my little laptop. There's just too many combinations that can be explored, and it ends up taking too much time... I never waited for it to finish.

However, if you give these 1000 controls to our dear friend Mr. Dijkstra, and within a very short wait you get:

It of course does not correctly visualise the entire solution to the problem in the Browser (what would be the point!), but the important thing is that

Which is super great. Again - this is why APOCs are Awesome - you can run these complex algorithms, straight on the database, straight from Cypher, without any additional complications. 

Try THAT in your relational, document, key-value, or column-family store!

Hoping this exercise was useful for you, and that you had as much fun with this as I did.



No comments:

Post a Comment