- using simple "full-text" indexing with the "starts with" where clause
- using ranges in your where clauses
Both of these were formerly very cumbersome, and very easy and powerful in Neo4j 2.3. So let's explore.
Finding a stop using "STARTS WITH"
In this first set of queries we will be using some of the newer functionalities in Neo4j 2.3, which allow you to use the underlying full-text search capabilities of Lucene to quickly and efficiently find starting points for your traversals. The first examples start with the "START WITH" string matching function - let's consider this query: match (s:Stop)
where s.name starts with "Turn"
return s
In the new shiny 2.3 version of Neo4j, we generate the following query plan:
This, as you can see, is a super efficient query with only 5 "db hits" - so a wonderful example of using the Neo4j indexing system (see the NodeIndexSeekByRange step at the top). Great - this is a super useful new feature of Neo4j 2.3, which really helps to speed up (simple) fulltext queries. Now, let me tell you about a very easy and un-intuitive way to mess this up. Consider the following variation to the query:
match (s:Stop)
where upper(s.name) starts with "TURN"
return s
All I am doing here is using the "UPPER" function to enable case-insensitive querying - but as you can probably predict, the query plan then all of a sudden looks like this:
and it generates 2785 db hits. So that is terribly inefficient: the first step (NodeByLabelScan) basically sucks in all of the nodes that have a particular Label ("Stop") and then does all of the filtering on that set. On a smallish dataset like this one it may not really matter, but on a larger one (or on a deeper traversal) this would absolutely matter. The only way to avoid this in the current product is to have a second property that would hold the lower() or upper() of the original property, and then index/query on that property. It's a reasonable workaround for most cases.
So cool - learned something.
Range queries in 2.3
I would like to get to know a little more about Neo4j 2.3's range query capabilities. I will do that by , but limiting the desired departure and arrival times. (ie. Stoptimes) by their departure_time and/or arrival_time. Let's try that with the following simple query to start with: match (st:Stoptime)
where st.departure_time < "07:45:00"
return st.departure_time;
If I run that query without an index on :Stoptime(departure_time) I get a query plan like this:
As you can see the plan starts with a "NodeByLabelScan". Very inefficient.
If however we put the index in place, and run the same query again, we get the following plan:
Which stars with a "NodeIndexSeekByRange". Very efficient. So that's good.
Now let's see how we can apply that in a realistic route finding query.
Route finding on the GTFS dataset
The obvious application for a GTFS dataset it to use it for some real-world route planning. Let's start with the following simple query, which looks up two "Stops", Antwerp (where I live) and Turnhout (where I am from): match (ant:Stop), (tu:Stop)
where ant.name starts with "Antw"
AND tu.name starts with "Turn"
return distinct tu,ant;
This gives me all the stops for "Antwerpen" and my hometown "Turnhout". Now I can narrow this down a bit and only look at the "top-level" stops (remember that stops can have parent stops), and calculate some shortestpaths between them. Let's use this query:
match (t:Stop)<-[:PART_OF]-(:Stop),
(a:Stop)<-[:PART_OF]-(:Stop)
where t.name starts with "Turn"
AND a.name="Antwerpen-Centraal"
with t,a
match p = allshortestpaths((t)-[*]-(a))
return p
limit 10;
This gives me the following result (note that I have "limited") the number of paths, as there are quite a number of trains running between the two cities):
The important thing to note here is that there is a DIRECT ROUTE between Antwerp and Turnhout and that this really makes the route-finding a lot easier.
Querying for direct routes
A real-world route planning query would look something like this: match (tu:Stop {name: "Turnhout"})--(tu_st:Stoptime)
where tu_st.departure_time > "07:00:00"
AND tu_st.departure_time < "09:00:00"
with tu, tu_st
match (ant:Stop {name:"Antwerpen-Centraal"})--(ant_st:Stoptime)
where ant_st.arrival_time < "09:00:00"
AND ant_st.arrival_time > "07:00:00"
and ant_st.arrival_time > tu_st.departure_time
with ant,ant_st,tu, tu_st
match p = allshortestpaths((tu_st)-[*]->(ant_st))
with nodes(p) as n
unwind n as nodes
match (nodes)-[r]-()
return nodes,r
which would give me a result like this:
The interesting thing here is that you can immediately see from this graph visualization that there is a "fast train" (the pink "Trip" at the bottom) and a "slow train" (the pink "Trip" at the top) between origin and destination. The slow train actually makes three additional stops.
Querying for indirect routes
Now let's look at a route-planning query for an indirect route between Turnhout and Arlon (the Southern most city in Belgium, close to the border with Luxemburg). Running this query will show me that I can only get from origin to destination by transferring from one train to another midway: match (t:Stop),(a:Stop)
where t.name = "Turnhout"
AND a.name="Arlon"
with t,a
match p = allshortestpaths((t)-[*]-(a))
where NONE (x in relationships(p) where type(x)="OPERATES")
return p
limit 10
This is what I get back then:
You can clearly see that I can get from Turnhout to Brussels, but then need to transfer to one of the Brussels-to-Arlon trains on the right. So... which one would that be? Let's run the following query:
MATCH (tu:Stop {name:"Turnhout"})--(st_tu:Stoptime),
(ar:Stop {name:"Arlon"})--(st_ar:Stoptime),
p1=((st_tu)-[:PRECEDES*]->(st_midway_arr:Stoptime)),
(st_midway_arr)--(midway:Stop),
(midway)--(st_midway_dep:Stoptime),
p2=((st_midway_dep)-[:PRECEDES*]->(st_ar))
WHERE
st_tu.departure_time > "10:00:00"
AND st_tu.departure_time < "11:00:00"
AND st_midway_arr.arrival_time > st_tu.departure_time
AND st_midway_dep.departure_time > st_midway_arr.arrival_time
AND st_ar.arrival_time > st_midway_dep.departure_time
RETURN
tu,st_tu,ar,st_ar,p1,p2,midway
order by (st_ar.arrival_time_int-st_tu.departure_time_int) ASC
limit 1
You can tell that this is a bit of a more complicated. It definitely comes back with a correct result:
At the top is the Trip from Turnhout to Brussels, and at the bottom is the Trip from Brussels to Arlon. You can also see that there's a bit of a wait there, so it may actually make more sense to take a later train from Turnhout to Brussels.
The problem with this approach is of course that it would not work for a journey that involved more than one stopover. If I would, for example, want to travel from "Leopoldsburg" to "Arlon", I would need two stopovers (in Hasselt, and then in Brussels):
and therefore the query above would become even more complicated.
My conclusion here is that
- it's actually pretty simple to represent GTFS data in Neo4j - and very nice to navigate through the data this way. Of course.
- direct routes are very easily queries with Cypher.
- indirect routes would require a bit more tweaking to the model and/or the use of a different API in Neo4j. That's currently beyond my scope of these blogposts, but I am very confident that it could be done.
I really hope you enjoyed these two blogposts, and that you will also apply it to your own local GTFS dataset - there's so many of them available. All of the queries above are on github as well of course - I hope you can use them as a baseline.
Cheers
Rik
No comments:
Post a Comment