Wednesday 26 April 2017

Graphconnect Europe 2017 schedule graph

Countdown has begun! Two weeks from now we'll be bringing together the entire European Graph Community in London again, for the annual Graphconnect Conference. Every year, it's something to really live up to, to rally our customers and users to attend as we really believe in the "power of relationships" that are formed and strengthened at conferences like this.


So of course, we had to pull out the old trick (started at Oredev 2014 actually - so quite some time ago!) of creating a "Conference Schedule Graph" for everyone to explore.

The Mother of all Schedules: a Google Sheet

Of course we needed to structure the data a bit before we could create a graph from it. The go-to tool for that for me is a Google Spreadsheet, and our marketing team helped me pull this one together:

It's got two tabs: one for the speakers, and one for the talks. Really simple.

Creating a graph model

From that spreadsheet, I of course had to derive a more "graphy" data model - nothing fancy, just something that would allow me to query the dataset in an intuitive way. This is what I came up with, using the Arrows tool:

Nothing really special about this - just a highly normalised model where the timeline at the bottom is probably a bit more counterintuitive to what you would do in most databases.

So let's see how we would import the data into that model.

Importing the schedule into my model

So the important is nothing really special, in the sense that I use the standard Load CSV tools that are part of Neo4j's Cypher to pull the data from the publicly available version of the two CSV file versions of the spreadsheet:

I use these two URLs in different parts of an Import script that I have also put on Github over here.

There's different parts to the script, which are clearly highlighted and commented:
  1. add the speakers and the companies: this picks up the first CSV file, and then creates (actually: MERGES so as not to create duplicates) the Person and Company nodes, and once that is done, it also passes the CSV file to a second query that also MERGEs the relationships between the Person and Company nodes.
  2. then I MERGE the Floor, Room and Track nodes that I find in the second session CSV file
  3. then I add the Time nodes and use a nested FOREACH query to connect them up to one another and create the timeline
  4. then I get to the bulk of the session data and create the Session nodes and connect these to the Person, Room, Time and Track nodes. Note that I also add the comma-separate list of tags that is in the CSV file as a property on the session. This is a temporary step - see next operation.
  5. In the final import/update query I look at all these tag properties on the Session nodes, and extract these into Tag nodes: I grab the property, split it based on the commas, trim it, and then iterate over the collection using UNWIND. I also remove the tag property from the Session nodes once this is done.
And that's it. I get 277 nodes and 452 relationships to be added to the database, and get the following model representation when I call db.schema().



So now we can query the dataset really easily. You can just browse through it and look at the timeline:

 Or start exploring a bit and just grab a session that you are interested in and explore its connections:
But of course that's not how you would normally proceed.

Querying the GraphConnect Europe 2017 schedule

Here's a couple of example queries that you could run really easily.

Query 1: looking for the sessions and connected information along the timeline

This query would look like this:

match (t:Time)<--(s:Session)--(connections)
return t,s,connections
limit 50

and you would get a result like this to then further explore.


Query 2: Look at the links between two people

One of my favourite queries for graph databases is the "pathfinding" between different nodes in the graph. So let's assume that you would want to understand the links between Axel Morgner (of Structr fame) and Jim Webber (of Science fame), then you would do something like:

match (p1:Person), (p2:Person),
path = allshortestpaths( (p1)-[*]-(p2) )
where p1.name contains "Morgner"
and p2.name contains "Webber"
return path

and the result would show the linked path immediately:


Query 3: look at the links between a company and a person

Very similar to the above, of course:

match (c:Company {name:"GraphAware"}), (p:Person {name:"Jim Webber"}),
path = allshortestpaths( (c)-[*]-(p) )
return path

but the path seems to be a bit more elaborate:


And then last but not least:

Query 4: look at sessions with more than one speaker

I always find this interesting:

match (s:Session)-[r:SPEAKS_IN]-(p:Person)
with s, collect(p) as person, count(p) as count
where count > 1
return s,person

And the result is something worth exploring, always.



Of course there are PLENTY of other queries that you come up with - but that's for you to do in the next two weeks as you get ready for the conference :) ...

Also available as a Gist or Guide

Before we wrap up this blogpost, I would also like to add that you can look at the above data in a Graphgist and a Browser Guide:

  • The Graphgist is available over here.
  • and if you just type

    :play http://portal.graphgist.org/graph_gists/graphconnect-europe-2017-schedule-graph/graph_guide

    in the Neo4j Browser you will see it inside the browser immediately, including the above queries to add the dataset to your own Neo4j Instance and everything. Note that you do need to allow the browser to fetch guides from other, trusted URLs, and therefore you need to add this property to the neo4j.conf file:

    browser.remote_content_hostname_whitelist=http://portal.graphgist.org
    and then you get this:

All of the above is hosted on Github - so please head over there if you want to play around with it some more.

Hope this was useful and interesting.

Cheers

Rik

No comments:

Post a Comment