Sunday 19 April 2015

The making of The GraphConnect Graph

Next month is GraphConnect London, our industry's yearly High Mass of Graphiness. It's going to be a wonderful event, and of course I wanted to chip in to promote it in any way that I can. So I did the same thing that I did for Øredev and Qcon before: importing the schedule into Neo4j.

I actually have already published an GraphGist about this already. But this post is more about the making of that database - just because I - AGAIN - learnt something interesting while doing it.

The Source Data

My dear Marketing colleague Claudia gave me a source spreadsheet with the schedule. But of course that was a bit too... Marketing-y. I cleaned it up into a very simple sheet that allowed me to generate a very simple CSV file:
I have shared the CSV file on Github. Nothing really special about it. But let me quickly explain what I did with it.

Choosing a model

Before importing data, you need to think a bit about the model you want to import into. I chose this model:
The right hand part is probably pretty easy to understand. But of course I had to do something special with the days and the timeslots.



  • The days are part of the conference, and they are connected:
  • And the timeslots within a day are also connected:
So how to import into that from that simple CSV file. Let's explore.

The LOAD CSV scripts

You can find the full load script - which actually loads from the dataset mentioned above - on github too. It's pretty straightforward: most commands just read a specific column from the csv file and do MERGEs to the graph. Like for example
load csv with headers from "https://gist.githubusercontent.com/rvanbruggen/ff44b7dc37bb4534df2e/raw/aed34a149f04798e351f508a18492237fcccfb62/schedule.csv" as csv
merge (v:Venue {name: csv.Venue})
merge (r:Room {name: csv.Room})
merge (r)-[:LOCATED_IN]->(v)
merge (d:Day {date: toInt(csv.Date)})
merge (tr:Track {name: csv.Track});
Nice and easy. There's a couple of commands that are a bit more special, as they have to check for NULLs before you can do a MERGE. But nothing really complicated. There's two sets of import commands - one for each day - that is a bit more interesting: how do you import the timeslots and create a structure like the one above, where all timeslots are nicely ordered and connected in an in-graph index. That's not that trivial:


  • loading the timeslots is easy with MERGE
  • sorting them is of course also easy
  • but creating the appropriate FOLLOWED_BY relationships between the timeslots to create the in-graph index/timeline, is not that easy.
Luckily I found these two blogposts by Mark Needham that shows me how to do it. Here's the query:

match (t:Time)--(d:Day {date: 20150506})
with t
order by t.time ASC
with collect(t) as times
  foreach (i in range(0,length(times)-2) |
    foreach (t1 in [times[i]] |
      foreach (t2 in [times[i+1]] |
        merge (t1)-[:FOLLOWED_BY]->(t2))));
What this does is the following:
  • you match all the timeslots that are part of the specific day that you want to order.
  • you pass the ordered timeslots to the next part of the query
  • you collect the ordered timeslots into a collection
  • you iterate X times (where X is the length of the collection -2, so excluding the start and end position) through the collection with FOREACH
  • you iterate through the starting positions (i) and the ending positions (i+1) in the collection
  • every time you iterate, you MERGE a FOLLOWED_BY relationship between the starting position and the ending position
And that's it. Job done. We do this for the second day ("20150507") as well, of course.

Hope you enjoyed this as much as I did, and hope to see you at the conference!

Cheers

Rik

No comments:

Post a Comment