Thursday, 12 November 2015

Querying GTFS data - using Neo4j 2.3 - part 2/2

So let's see what we can do with this data that we loaded into Neo4j in the previous blogpost. In these query examples, I will try to illustrate some of the key differences between older versions of Neo4j, and the newer, shinier Neo4j 2.3. There's a bunch of new features in that wonderful release, and some of these we will illustrate using the (Belgian) GTFS dataset. Basically there's two interesting ones that I really like:

  • using simple "full-text" indexing with the "starts with" where clause
  • using ranges in your where clauses
Both of these were formerly very cumbersome, and very easy and powerful in Neo4j 2.3. So let's explore.

Finding a stop using "STARTS WITH"

In this first set of queries we will be using some of the newer functionalities in Neo4j 2.3, which allow you to use the underlying full-text search capabilities of Lucene to quickly and efficiently find starting points for your traversals. The first examples start with the "START WITH" string matching function - let's consider this query:

 match (s:Stop)  
 where starts with "Turn"  
 return s  

In the new shiny 2.3 version of Neo4j, we generate the following query plan:
This, as you can see, is a super efficient query with only 5 "db hits" - so a wonderful example of using the Neo4j indexing system (see the NodeIndexSeekByRange step at the top). Great - this is a super useful new feature of Neo4j 2.3, which really helps to speed up (simple) fulltext queries. Now, let me tell you about a very easy and un-intuitive way to mess this up. Consider the following variation to the query:

 match (s:Stop)  
 where upper( starts with "TURN"  
 return s  

All I am doing here is using the "UPPER" function to enable case-insensitive querying - but as you can probably predict, the query plan then all of a sudden looks like this:
and it generates 2785 db hits. So that is terribly inefficient: the first step (NodeByLabelScan) basically sucks in all of the nodes that have a particular Label ("Stop") and then does all of the filtering on that set. On a smallish dataset like this one it may not really matter, but on a larger one (or on a deeper traversal) this would absolutely matter. The only way to avoid this in the current product is to have a second property that would hold the lower() or upper() of the original property, and then index/query on that property. It's a reasonable workaround for most cases.

So cool - learned something.

Range queries in 2.3

I would like to get to know a little more about Neo4j 2.3's range query capabilities. I will do that by , but limiting the desired departure and arrival times. (ie. Stoptimes) by their departure_time and/or arrival_time.  Let's try that with the following simple query to start with:

 match (st:Stoptime)  
 where st.departure_time < "07:45:00"  
 return st.departure_time;  

If I run that query without an index on :Stoptime(departure_time) I get a query plan like this:
As you can see the plan starts with a "NodeByLabelScan". Very inefficient.

If however we put the index in place, and run the same query again, we get the following plan:
Which stars with a "NodeIndexSeekByRange". Very efficient. So that's good.

Now let's see how we can apply that in a realistic route finding query.

Route finding on the GTFS dataset

The obvious application for a GTFS dataset it to use it for some real-world route planning. Let's start with the following simple query, which looks up two "Stops", Antwerp (where I live) and Turnhout (where I am from):

 match (ant:Stop), (tu:Stop)  
 where starts with "Antw"  
 AND starts with "Turn"  
 return distinct tu,ant;  

This gives me all the stops for "Antwerpen" and my hometown "Turnhout". Now I can narrow this down a bit and only look at the "top-level" stops (remember that stops can have parent stops), and calculate some shortestpaths between them. Let's use this query:

 match (t:Stop)<-[:PART_OF]-(:Stop),  
 where starts with "Turn"  
 with t,a  
 match p = allshortestpaths((t)-[*]-(a))  
 return p  
 limit 10;  

This gives me the following result (note that I have "limited") the number of paths, as there are quite a number of trains running between the two cities):

The important thing to note here is that there is a DIRECT ROUTE between Antwerp and Turnhout and that this really makes the route-finding a lot easier.

Querying for direct routes

A real-world route planning query would look something like this:

 match (tu:Stop {name: "Turnhout"})--(tu_st:Stoptime)  
 where tu_st.departure_time > "07:00:00"  
 AND tu_st.departure_time < "09:00:00"  
 with tu, tu_st  
 match (ant:Stop {name:"Antwerpen-Centraal"})--(ant_st:Stoptime)  
 where ant_st.arrival_time < "09:00:00"  
 AND ant_st.arrival_time > "07:00:00"  
 and ant_st.arrival_time > tu_st.departure_time  
 with ant,ant_st,tu, tu_st  
 match p = allshortestpaths((tu_st)-[*]->(ant_st))  
 with nodes(p) as n  
 unwind n as nodes  
 match (nodes)-[r]-()  
 return nodes,r  

which would give me a result like this:

The interesting thing here is that you can immediately see from this graph visualization that there is a "fast train" (the pink "Trip" at the bottom) and a "slow train" (the pink "Trip" at the top) between origin and destination. The slow train actually makes three additional stops.

Querying for indirect routes

Now let's look at a route-planning query for an indirect route between Turnhout and Arlon (the Southern most city in Belgium, close to the border with Luxemburg). Running this query will show me that I can only get from origin to destination by transferring from one train to another midway:

 match (t:Stop),(a:Stop)  
 where = "Turnhout"  
 with t,a  
 match p = allshortestpaths((t)-[*]-(a))  
 where NONE (x in relationships(p) where type(x)="OPERATES")  
 return p  
 limit 10  

This is what I get back then:

You can clearly see that I can get from Turnhout to Brussels, but then need to transfer to one of the Brussels-to-Arlon trains on the right. So... which one would that be? Let's run the following query:

 MATCH (tu:Stop {name:"Turnhout"})--(st_tu:Stoptime),  
 (ar:Stop {name:"Arlon"})--(st_ar:Stoptime),  
 st_tu.departure_time > "10:00:00"  
 AND st_tu.departure_time < "11:00:00"  
 AND st_midway_arr.arrival_time > st_tu.departure_time  
 AND st_midway_dep.departure_time > st_midway_arr.arrival_time  
 AND st_ar.arrival_time > st_midway_dep.departure_time  
 order by (st_ar.arrival_time_int-st_tu.departure_time_int) ASC  
 limit 1  

You can tell that this is a bit of a more complicated. It definitely comes back with a correct result:

At the top is the Trip from Turnhout to Brussels, and at the bottom is the Trip from Brussels to Arlon. You can also see that there's a bit of a wait there, so it may actually make more sense to take a later train from Turnhout to Brussels.

The problem with this approach is of course that it would not work for a journey that involved more than one stopover. If I would, for example, want to travel from "Leopoldsburg" to "Arlon", I would need two stopovers (in Hasselt, and then in Brussels):
and therefore the query above would become even more complicated.

My conclusion here is that

  1. it's actually pretty simple to represent GTFS data in Neo4j - and very nice to navigate through the data this way. Of course.
  2. direct routes are very easily queries with Cypher.
  3. indirect routes would require a bit more tweaking to the model and/or the use of a different API in Neo4j. That's currently beyond my scope of these blogposts, but I am very confident that it could be done.
I really hope you enjoyed these two blogposts, and that you will also apply it to your own local GTFS dataset - there's so many of them available. All of the queries above are on github as well of course - I hope you can use them as a baseline.



No comments:

Post a Comment