Monday 9 November 2015

Loading General Transport Feed Spec (GTFS) files into Neo4j - part 1/2

Lately I have been having a lot of fun with a pretty simple but interesting type of data: transport system data. That is: any kind of schedule data that a transportation network (bus, rail, tram, tube, metro, ...) would publish to it's users. This is super interesting data for a graph, right, as you could easily see that "shortestpath" operations over a larger transportation network would be super useful and quick.

The General Transport Feed Specification

Turns out that there is a very, very nice and easy spec for that kind of data. It was originally developed by Google as the "Google Transport Feed Specification" in cooperation with Portland Trimet, and is now known as the "General Transport Feed Specification". Here's a bit more detail from Wikipedia:
A GTFS feed is a collection of CSV files (with extension .txt) contained within a .zip file. Together, the related CSV tables describe a transit system's scheduled operations. The specification is designed to be sufficient to provide trip planning functionality, but is also useful for other applications such as analysis of service levels and some general performance measures. GTFS only includes scheduled operations, and does not include real-time information. However real-time information can be related to GTFS schedules according to the related GTFS-realtime specification.
More info on the Google Developer site. I believe that Google originally developed this to integrate transport information into Maps - which really worked very well I think. But since that time, the spec has been standardized - and now it turns out there are LOTS and lots of datasets like that.  Most of them are on the GTFS Exchange, it seems - and I have downloaded a few of them:
and there's many, many more.

Converting the files to a graph

The nice thing about these .zip files is that - once unzipped - they contain a bunch of comma-separated value files (.txt extension though), and that thee files all have a similar structure:

So I took a look at some of these files, and while I found that there are a few differences between the structures here and there (some of the GTFS data elements appear to be optional), but that generally I had a structure that looked like this:

You can see that there are a few "keys" in there (color coded) that link one file to the next. So then I could quite easily translate this to a graph model:

So now that we have that model, we should be able to import our data into Neo4j quite easily. Let's give that a go.

Loading GTFS data

Here's a couple of Cypher statements that I have used to load the data into the model. First we create some indexes and schema constraints (for uniqueness):

 create constraint on (a:Agency) assert is unique;  
 create constraint on (r:Route) assert is unique;  
 create constraint on (t:Trip) assert is unique;  
 create index on :Trip(service_id);  
 create constraint on (s:Stop) assert is unique;  
 create index on :Stoptime(stop_sequence);  
 create index on :Stop(name);  

Then we add the Agency, Routes and Trips:
 //add the agency  
 load csv with headers from  
 'file:///delijn/agency.txt' as csv  
 create (a:Agency {id: toInt(csv.agency_id), name: csv.agency_name, url: csv.agency_url, timezone: csv.agency_timezone});  
// add the routes  
 load csv with headers from  
 'file:///ns/routes.txt' as csv  
 match (a:Agency {id: toInt(csv.agency_id)})  
 create (a)-[:OPERATES]->(r:Route {id: csv.route_id, short_name: csv.route_short_name, long_name: csv.route_long_name, type: toInt(csv.route_type)});  
 // add the trips  
 load csv with headers from  
 'file:///ns/trips.txt' as csv  
 match (r:Route {id: csv.route_id})  
 create (r)<-[:USES]-(t:Trip {id: csv.trip_id, service_id: csv.service_id, headsign: csv.trip_headsign, direction_id: csv.direction_id, short_name: csv.trip_short_name, block_id: csv.block_id, bikes_allowed: csv.bikes_allowed, shape_id: csv.shape_id});  

Next we first load the "stops" without connecting them to the graph, including the parent/child relationships that can exist between specific stops:
 //add the stops  
 load csv with headers from  
 'file:///ns/stops.txt' as csv  
 create (s:Stop {id: csv.stop_id, name: csv.stop_name, lat: toFloat(csv.stop_lat), lon: toFloat(csv.stop_lon), platform_code: csv.platform_code, parent_station: csv.parent_station, location_type: csv.location_type, timezone: csv.stop_timezone, code: csv.stop_code});  
//connect parent/child relationships to stops  
 load csv with headers from  
 'file:///ns/stops.txt' as csv  
 with csv  
 where not (csv.parent_station is null)  
 match (ps:Stop {id: csv.parent_station}), (s:Stop {id: csv.stop_id})  
 create (ps)<-[:PART_OF]-(s);  

Then, finally, we add the Stoptimes which connect the Trips to the Stops:
 //add the stoptimes  
 using periodic commit  
 load csv with headers from  
 'file:///ns/stop_times.txt' as csv  
 match (t:Trip {id: csv.trip_id}), (s:Stop {id: csv.stop_id})  
 create (t)<-[:PART_OF_TRIP]-(st:Stoptime {arrival_time: csv.arrival_time, departure_time: csv.departure_time, stop_sequence: toInt(csv.stop_sequence)})-[:LOCATED_AT]->(s);  
This query/load operation has been a bit trickier for me when experimenting with various example GTFS files: because there can be a LOT of stoptimes for large transportation networks like bus networks, they can take a long time to complete and should be treated with care. On some occasions, I have had to split the Stoptimes.txt file into multiple parts to make it work.

Finally, we will connect the stoptimes to one another, forming a sequence of stops that constitute a trip:
 //connect the stoptime sequences  
 match (s1:Stoptime)-[:PART_OF_TRIP]->(t:Trip),  
 where s2.stop_sequence=s1.stop_sequence+1  
 create (s1)-[:PRECEDES]->(s2);  

That's it, really. When I generate the meta-graph for this data, I get something like this:

Which is exactly the Model that we outlined above :) ... Good!

The entire load script can be found on github, so you can try it yourself. All you need to do is chance the load csv file/directory. Also, don't forget that load csv now takes its import files from the local directory that you configure in
That's about it for now. In a next blogpost, I will take Neo4j 2.3 for a spin on a GTFS dataset, and see what we can find out. Check back soon to read up on that.

Hope this was interesting for you.




  1. Thank you for a so well explained example, it has helped me a lot.
    However I'm having some issues with the very last step of connecting the stoptimes to one another. After some minutes it always gives me an error (Error: undefined - undefined). Did you experience something similar, or any hints?
    Many thanks in advance

  2. Mmm that's weird. Never had that. What version of Neo4j? What GTFS dataset are you using? Maybe I can take a look at it next week...

  3. Hi, it seems something to do with the JVM. Sometimes it gives me the following message: "Disconnected from Neo4j. Please check if the cord is unplugged."

    I've tried with a different smaller GTFS dataset and everything went ok.

  4. Mmm possible... have you tried giving Neo4j more memory?

    1. Hi, I'm having the same issue with harregui. How can I give Neo4j more memory? I tried to do something like this but that also doesn't work:

      match (t1:Trip) with collect( AS trips
      FOREACH( trip in trips |
      match (s1:Stoptime)-[:PART_OF_TRIP]->(t:Trip),
      where s2.stop_sequence=s1.stop_sequence+1
      create (s1)-[:PRECEDES]->(s2);

    2. Usually you want to change the HEAP settings in neo4j-wrapper.conf . You should have something like

      # Java Heap Size: by default the Java heap size is dynamically
      # calculated based on available system resources.
      # Uncomment these lines to set specific initial and maximum
      # heap size in MB.

      Uncomment the last two and put something like
      # Java Heap Size: by default the Java heap size is dynamically
      # calculated based on available system resources.
      # Uncomment these lines to set specific initial and maximum
      # heap size in MB.

      So that the heap gets allocated from the start and does not need to get resized.

      Hope that helps.