The General Transport Feed Specification
Turns out that there is a very, very nice and easy spec for that kind of data. It was originally developed by Google as the "Google Transport Feed Specification" in cooperation with Portland Trimet, and is now known as the "General Transport Feed Specification". Here's a bit more detail from Wikipedia:A GTFS feed is a collection of CSV files (with extension .txt) contained within a .zip file. Together, the related CSV tables describe a transit system's scheduled operations. The specification is designed to be sufficient to provide trip planning functionality, but is also useful for other applications such as analysis of service levels and some general performance measures. GTFS only includes scheduled operations, and does not include real-time information. However real-time information can be related to GTFS schedules according to the related GTFS-realtime specification.More info on the Google Developer site. I believe that Google originally developed this to integrate transport information into Maps - which really worked very well I think. But since that time, the spec has been standardized - and now it turns out there are LOTS and lots of datasets like that. Most of them are on the GTFS Exchange, it seems - and I have downloaded a few of them:
- the Belgian rail network (actually a CUSTOMER of Neo4j - so yey!): .zip file
- the Flemish bus and tram network: .zip file
- the Dutch rail network: .zip file
- the British rail network: .zip file
and there's many, many more.
Converting the files to a graph
The nice thing about these .zip files is that - once unzipped - they contain a bunch of comma-separated value files (.txt extension though), and that thee files all have a similar structure:So I took a look at some of these files, and while I found that there are a few differences between the structures here and there (some of the GTFS data elements appear to be optional), but that generally I had a structure that looked like this:
You can see that there are a few "keys" in there (color coded) that link one file to the next. So then I could quite easily translate this to a graph model:
So now that we have that model, we should be able to import our data into Neo4j quite easily. Let's give that a go.
Loading GTFS data
Here's a couple of Cypher statements that I have used to load the data into the model. First we create some indexes and schema constraints (for uniqueness): create constraint on (a:Agency) assert a.id is unique;
create constraint on (r:Route) assert r.id is unique;
create constraint on (t:Trip) assert t.id is unique;
create index on :Trip(service_id);
create constraint on (s:Stop) assert s.id is unique;
create index on :Stoptime(stop_sequence);
create index on :Stop(name);
Then we add the Agency, Routes and Trips:
//add the agency
load csv with headers from
'file:///delijn/agency.txt' as csv
create (a:Agency {id: toInt(csv.agency_id), name: csv.agency_name, url: csv.agency_url, timezone: csv.agency_timezone});
// add the routes
load csv with headers from
'file:///ns/routes.txt' as csv
match (a:Agency {id: toInt(csv.agency_id)})
create (a)-[:OPERATES]->(r:Route {id: csv.route_id, short_name: csv.route_short_name, long_name: csv.route_long_name, type: toInt(csv.route_type)});
// add the trips
load csv with headers from
'file:///ns/trips.txt' as csv
match (r:Route {id: csv.route_id})
create (r)<-[:USES]-(t:Trip {id: csv.trip_id, service_id: csv.service_id, headsign: csv.trip_headsign, direction_id: csv.direction_id, short_name: csv.trip_short_name, block_id: csv.block_id, bikes_allowed: csv.bikes_allowed, shape_id: csv.shape_id});
Next we first load the "stops" without connecting them to the graph, including the parent/child relationships that can exist between specific stops:
//add the stops
load csv with headers from
'file:///ns/stops.txt' as csv
create (s:Stop {id: csv.stop_id, name: csv.stop_name, lat: toFloat(csv.stop_lat), lon: toFloat(csv.stop_lon), platform_code: csv.platform_code, parent_station: csv.parent_station, location_type: csv.location_type, timezone: csv.stop_timezone, code: csv.stop_code});
//connect parent/child relationships to stops
load csv with headers from
'file:///ns/stops.txt' as csv
with csv
where not (csv.parent_station is null)
match (ps:Stop {id: csv.parent_station}), (s:Stop {id: csv.stop_id})
create (ps)<-[:PART_OF]-(s);
Then, finally, we add the Stoptimes which connect the Trips to the Stops:
//add the stoptimes
using periodic commit
load csv with headers from
'file:///ns/stop_times.txt' as csv
match (t:Trip {id: csv.trip_id}), (s:Stop {id: csv.stop_id})
create (t)<-[:PART_OF_TRIP]-(st:Stoptime {arrival_time: csv.arrival_time, departure_time: csv.departure_time, stop_sequence: toInt(csv.stop_sequence)})-[:LOCATED_AT]->(s);
This query/load operation has been a bit trickier for me when experimenting with various example GTFS files: because there can be a LOT of stoptimes for large transportation networks like bus networks, they can take a long time to complete and should be treated with care. On some occasions, I have had to split the Stoptimes.txt file into multiple parts to make it work.Finally, we will connect the stoptimes to one another, forming a sequence of stops that constitute a trip:
//connect the stoptime sequences
match (s1:Stoptime)-[:PART_OF_TRIP]->(t:Trip),
(s2:Stoptime)-[:PART_OF_TRIP]->(t)
where s2.stop_sequence=s1.stop_sequence+1
create (s1)-[:PRECEDES]->(s2);
That's it, really. When I generate the meta-graph for this data, I get something like this:
Which is exactly the Model that we outlined above :) ... Good!
The entire load script can be found on github, so you can try it yourself. All you need to do is chance the load csv file/directory. Also, don't forget that load csv now takes its import files from the local directory that you configure in neo4j.properties:
That's about it for now. In a next blogpost, I will take Neo4j 2.3 for a spin on a GTFS dataset, and see what we can find out. Check back soon to read up on that.
Hope this was interesting for you.
Cheers
Rik
Thank you for a so well explained example, it has helped me a lot.
ReplyDeleteHowever I'm having some issues with the very last step of connecting the stoptimes to one another. After some minutes it always gives me an error (Error: undefined - undefined). Did you experience something similar, or any hints?
Many thanks in advance
Mmm that's weird. Never had that. What version of Neo4j? What GTFS dataset are you using? Maybe I can take a look at it next week...
ReplyDeleteHi, it seems something to do with the JVM. Sometimes it gives me the following message: "Disconnected from Neo4j. Please check if the cord is unplugged."
ReplyDeleteI've tried with a different smaller GTFS dataset and everything went ok.
Mmm possible... have you tried giving Neo4j more memory?
ReplyDeleteHi, I'm having the same issue with harregui. How can I give Neo4j more memory? I tried to do something like this but that also doesn't work:
Deletematch (t1:Trip) with collect(t1.id) AS trips
FOREACH( trip in trips |
match (s1:Stoptime)-[:PART_OF_TRIP]->(t:Trip),
(s2:Stoptime)-[:PART_OF_TRIP]->(t)
where s2.stop_sequence=s1.stop_sequence+1
create (s1)-[:PRECEDES]->(s2);
)
Usually you want to change the HEAP settings in neo4j-wrapper.conf . You should have something like
Delete# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
#dbms.memory.heap.initial_size=512
#dbms.memory.heap.max_size=512
Uncomment the last two and put something like
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
dbms.memory.heap.initial_size=2046
dbms.memory.heap.max_size=2046
So that the heap gets allocated from the start and does not need to get resized.
Hope that helps.
Rik