I actually have already published an GraphGist about this already. But this post is more about the making of that database - just because I - AGAIN - learnt something interesting while doing it.
The Source Data
My dear Marketing colleague Claudia gave me a source spreadsheet with the schedule. But of course that was a bit too... Marketing-y. I cleaned it up into a very simple sheet that allowed me to generate a very simple CSV file:
I have shared the CSV file on Github. Nothing really special about it. But let me quickly explain what I did with it.Choosing a model
Before importing data, you need to think a bit about the model you want to import into. I chose this model:The right hand part is probably pretty easy to understand. But of course I had to do something special with the days and the timeslots.
- The days are part of the conference, and they are connected:
- And the timeslots within a day are also connected:
So how to import into that from that simple CSV file. Let's explore.
The LOAD CSV scripts
You can find the full load script - which actually loads from the dataset mentioned above - on github too. It's pretty straightforward: most commands just read a specific column from the csv file and do MERGEs to the graph. Like for exampleload csv with headers from "https://gist.githubusercontent.com/rvanbruggen/ff44b7dc37bb4534df2e/raw/aed34a149f04798e351f508a18492237fcccfb62/schedule.csv" as csvNice and easy. There's a couple of commands that are a bit more special, as they have to check for NULLs before you can do a MERGE. But nothing really complicated. There's two sets of import commands - one for each day - that is a bit more interesting: how do you import the timeslots and create a structure like the one above, where all timeslots are nicely ordered and connected in an in-graph index. That's not that trivial:
merge (v:Venue {name: csv.Venue})
merge (r:Room {name: csv.Room})
merge (r)-[:LOCATED_IN]->(v)
merge (d:Day {date: toInt(csv.Date)})
merge (tr:Track {name: csv.Track});
- loading the timeslots is easy with MERGE
- sorting them is of course also easy
- but creating the appropriate FOLLOWED_BY relationships between the timeslots to create the in-graph index/timeline, is not that easy.
Luckily I found these two blogposts by Mark Needham that shows me how to do it. Here's the query:
match (t:Time)--(d:Day {date: 20150506})What this does is the following:
with t
order by t.time ASC
with collect(t) as times
foreach (i in range(0,length(times)-2) |
foreach (t1 in [times[i]] |
foreach (t2 in [times[i+1]] |
merge (t1)-[:FOLLOWED_BY]->(t2))));
- you match all the timeslots that are part of the specific day that you want to order.
- you pass the ordered timeslots to the next part of the query
- you collect the ordered timeslots into a collection
- you iterate X times (where X is the length of the collection -2, so excluding the start and end position) through the collection with FOREACH
- you iterate through the starting positions (i) and the ending positions (i+1) in the collection
- every time you iterate, you MERGE a FOLLOWED_BY relationship between the starting position and the ending position
And that's it. Job done. We do this for the second day ("20150507") as well, of course.
Hope you enjoyed this as much as I did, and hope to see you at the conference!
Cheers
Rik
No comments:
Post a Comment