Bruggen Blog: Querying GTFS data - using Neo4j 2.3

Thursday, 12 November 2015

Querying GTFS data - using Neo4j 2.3 - part 2/2

So let's see what we can do with this data that we loaded into Neo4j in the previous blogpost. In these query examples, I will try to illustrate some of the key differences between older versions of Neo4j, and the newer, shinier Neo4j 2.3. There's a bunch of new features in that wonderful release, and some of these we will illustrate using the (Belgian) GTFS dataset. Basically there's two interesting ones that I really like:

using simple "full-text" indexing with the "starts with" where clause
using ranges in your where clauses

Both of these were formerly very cumbersome, and very easy and powerful in Neo4j 2.3. So let's explore.

Finding a stop using "STARTS WITH"

In this first set of queries we will be using some of the newer functionalities in Neo4j 2.3, which allow you to use the underlying full-text search capabilities of Lucene to quickly and efficiently find starting points for your traversals. The first examples start with the "START WITH" string matching function - let's consider this query:

 match (s:Stop)  
 where s.name starts with "Turn"  
 return s

In the new shiny 2.3 version of Neo4j, we generate the following query plan:

This, as you can see, is a super efficient query with only 5 "db hits" - so a wonderful example of using the Neo4j indexing system (see the NodeIndexSeekByRange step at the top). Great - this is a super useful new feature of Neo4j 2.3, which really helps to speed up (simple) fulltext queries. Now, let me tell you about a very easy and un-intuitive way to mess this up. Consider the following variation to the query:

 match (s:Stop)  
 where upper(s.name) starts with "TURN"  
 return s

All I am doing here is using the "UPPER" function to enable case-insensitive querying - but as you can probably predict, the query plan then all of a sudden looks like this:

and it generates 2785 db hits. So that is terribly inefficient: the first step (NodeByLabelScan) basically sucks in all of the nodes that have a particular Label ("Stop") and then does all of the filtering on that set. On a smallish dataset like this one it may not really matter, but on a larger one (or on a deeper traversal) this would absolutely matter. The only way to avoid this in the current product is to have a second property that would hold the lower() or upper() of the original property, and then index/query on that property. It's a reasonable workaround for most cases.

So cool - learned something.

Range queries in 2.3

I would like to get to know a little more about Neo4j 2.3's range query capabilities. I will do that by , but limiting the desired departure and arrival times. (ie. Stoptimes) by their departure_time and/or arrival_time. Let's try that with the following simple query to start with:

 match (st:Stoptime)  
 where st.departure_time < "07:45:00"  
 return st.departure_time;

If I run that query without an index on :Stoptime(departure_time) I get a query plan like this:

As you can see the plan starts with a "NodeByLabelScan". Very inefficient.

If however we put the index in place, and run the same query again, we get the following plan:

Which stars with a "NodeIndexSeekByRange". Very efficient. So that's good.

Now let's see how we can apply that in a realistic route finding query.

Route finding on the GTFS dataset

The obvious application for a GTFS dataset it to use it for some real-world route planning. Let's start with the following simple query, which looks up two "Stops", Antwerp (where I live) and Turnhout (where I am from):

 match (ant:Stop), (tu:Stop)  
 where ant.name starts with "Antw"  
 AND tu.name starts with "Turn"  
 return distinct tu,ant;

This gives me all the stops for "Antwerpen" and my hometown "Turnhout". Now I can narrow this down a bit and only look at the "top-level" stops (remember that stops can have parent stops), and calculate some shortestpaths between them. Let's use this query:

 match (t:Stop)<-[:PART_OF]-(:Stop),  
 (a:Stop)<-[:PART_OF]-(:Stop)  
 where t.name starts with "Turn"  
 AND a.name="Antwerpen-Centraal"  
 with t,a  
 match p = allshortestpaths((t)-[*]-(a))  
 return p  
 limit 10;

This gives me the following result (note that I have "limited") the number of paths, as there are quite a number of trains running between the two cities):

The important thing to note here is that there is a DIRECT ROUTE between Antwerp and Turnhout and that this really makes the route-finding a lot easier.

Querying for direct routes

A real-world route planning query would look something like this:

 match (tu:Stop {name: "Turnhout"})--(tu_st:Stoptime)  
 where tu_st.departure_time > "07:00:00"  
 AND tu_st.departure_time < "09:00:00"  
 with tu, tu_st  
 match (ant:Stop {name:"Antwerpen-Centraal"})--(ant_st:Stoptime)  
 where ant_st.arrival_time < "09:00:00"  
 AND ant_st.arrival_time > "07:00:00"  
 and ant_st.arrival_time > tu_st.departure_time  
 with ant,ant_st,tu, tu_st  
 match p = allshortestpaths((tu_st)-[*]->(ant_st))  
 with nodes(p) as n  
 unwind n as nodes  
 match (nodes)-[r]-()  
 return nodes,r

which would give me a result like this:

The interesting thing here is that you can immediately see from this graph visualization that there is a "fast train" (the pink "Trip" at the bottom) and a "slow train" (the pink "Trip" at the top) between origin and destination. The slow train actually makes three additional stops.

Querying for indirect routes

Now let's look at a route-planning query for an indirect route between Turnhout and Arlon (the Southern most city in Belgium, close to the border with Luxemburg). Running this query will show me that I can only get from origin to destination by transferring from one train to another midway:

 match (t:Stop),(a:Stop)  
 where t.name = "Turnhout"  
 AND a.name="Arlon"  
 with t,a  
 match p = allshortestpaths((t)-[*]-(a))  
 where NONE (x in relationships(p) where type(x)="OPERATES")  
 return p  
 limit 10

This is what I get back then:

You can clearly see that I can get from Turnhout to Brussels, but then need to transfer to one of the Brussels-to-Arlon trains on the right. So... which one would that be? Let's run the following query:

 MATCH (tu:Stop {name:"Turnhout"})--(st_tu:Stoptime),  
 (ar:Stop {name:"Arlon"})--(st_ar:Stoptime),  
 p1=((st_tu)-[:PRECEDES*]->(st_midway_arr:Stoptime)),  
 (st_midway_arr)--(midway:Stop),  
 (midway)--(st_midway_dep:Stoptime),  
 p2=((st_midway_dep)-[:PRECEDES*]->(st_ar))  
 WHERE  
 st_tu.departure_time > "10:00:00"  
 AND st_tu.departure_time < "11:00:00"  
 AND st_midway_arr.arrival_time > st_tu.departure_time  
 AND st_midway_dep.departure_time > st_midway_arr.arrival_time  
 AND st_ar.arrival_time > st_midway_dep.departure_time  
 RETURN  
 tu,st_tu,ar,st_ar,p1,p2,midway  
 order by (st_ar.arrival_time_int-st_tu.departure_time_int) ASC  
 limit 1

You can tell that this is a bit of a more complicated. It definitely comes back with a correct result:

At the top is the Trip from Turnhout to Brussels, and at the bottom is the Trip from Brussels to Arlon. You can also see that there's a bit of a wait there, so it may actually make more sense to take a later train from Turnhout to Brussels.

The problem with this approach is of course that it would not work for a journey that involved more than one stopover. If I would, for example, want to travel from "Leopoldsburg" to "Arlon", I would need two stopovers (in Hasselt, and then in Brussels):

and therefore the query above would become even more complicated.

My conclusion here is that

it's actually pretty simple to represent GTFS data in Neo4j - and very nice to navigate through the data this way. Of course.
direct routes are very easily queries with Cypher.
indirect routes would require a bit more tweaking to the model and/or the use of a different API in Neo4j. That's currently beyond my scope of these blogposts, but I am very confident that it could be done.

I really hope you enjoyed these two blogposts, and that you will also apply it to your own local GTFS dataset - there's so many of them available. All of the queries above are on github as well of course - I hope you can use them as a baseline.

Cheers

Rik

Bruggen Blog

Pages

Thursday, 12 November 2015

Querying GTFS data - using Neo4j 2.3 - part 2/2

Finding a stop using "STARTS WITH"

Range queries in 2.3

Route finding on the GTFS dataset

Querying for direct routes

Querying for indirect routes

No comments:

Post a Comment

Labels

Blogarchive

Metricool