Wednesday, 26 April 2017

Graphconnect Europe 2017 schedule graph

Countdown has begun! Two weeks from now we'll be bringing together the entire European Graph Community in London again, for the annual Graphconnect Conference. Every year, it's something to really live up to, to rally our customers and users to attend as we really believe in the "power of relationships" that are formed and strengthened at conferences like this.


So of course, we had to pull out the old trick (started at Oredev 2014 actually - so quite some time ago!) of creating a "Conference Schedule Graph" for everyone to explore.

The Mother of all Schedules: a Google Sheet

Of course we needed to structure the data a bit before we could create a graph from it. The go-to tool for that for me is a Google Spreadsheet, and our marketing team helped me pull this one together:

It's got two tabs: one for the speakers, and one for the talks. Really simple.

Creating a graph model

From that spreadsheet, I of course had to derive a more "graphy" data model - nothing fancy, just something that would allow me to query the dataset in an intuitive way. This is what I came up with, using the Arrows tool:

Nothing really special about this - just a highly normalised model where the timeline at the bottom is probably a bit more counterintuitive to what you would do in most databases.

So let's see how we would import the data into that model.

Importing the schedule into my model

So the important is nothing really special, in the sense that I use the standard Load CSV tools that are part of Neo4j's Cypher to pull the data from the publicly available version of the two CSV file versions of the spreadsheet:

I use these two URLs in different parts of an Import script that I have also put on Github over here.

There's different parts to the script, which are clearly highlighted and commented:
  1. add the speakers and the companies: this picks up the first CSV file, and then creates (actually: MERGES so as not to create duplicates) the Person and Company nodes, and once that is done, it also passes the CSV file to a second query that also MERGEs the relationships between the Person and Company nodes.
  2. then I MERGE the Floor, Room and Track nodes that I find in the second session CSV file
  3. then I add the Time nodes and use a nested FOREACH query to connect them up to one another and create the timeline
  4. then I get to the bulk of the session data and create the Session nodes and connect these to the Person, Room, Time and Track nodes. Note that I also add the comma-separate list of tags that is in the CSV file as a property on the session. This is a temporary step - see next operation.
  5. In the final import/update query I look at all these tag properties on the Session nodes, and extract these into Tag nodes: I grab the property, split it based on the commas, trim it, and then iterate over the collection using UNWIND. I also remove the tag property from the Session nodes once this is done.
And that's it. I get 277 nodes and 452 relationships to be added to the database, and get the following model representation when I call db.schema().



So now we can query the dataset really easily. You can just browse through it and look at the timeline:

 Or start exploring a bit and just grab a session that you are interested in and explore its connections:
But of course that's not how you would normally proceed.

Querying the GraphConnect Europe 2017 schedule

Here's a couple of example queries that you could run really easily.

Query 1: looking for the sessions and connected information along the timeline

This query would look like this:

match (t:Time)<--(s:Session)--(connections)
return t,s,connections
limit 50

and you would get a result like this to then further explore.


Query 2: Look at the links between two people

One of my favourite queries for graph databases is the "pathfinding" between different nodes in the graph. So let's assume that you would want to understand the links between Axel Morgner (of Structr fame) and Jim Webber (of Science fame), then you would do something like:

match (p1:Person), (p2:Person),
path = allshortestpaths( (p1)-[*]-(p2) )
where p1.name contains "Morgner"
and p2.name contains "Webber"
return path

and the result would show the linked path immediately:


Query 3: look at the links between a company and a person

Very similar to the above, of course:

match (c:Company {name:"GraphAware"}), (p:Person {name:"Jim Webber"}),
path = allshortestpaths( (c)-[*]-(p) )
return path

but the path seems to be a bit more elaborate:


And then last but not least:

Query 4: look at sessions with more than one speaker

I always find this interesting:

match (s:Session)-[r:SPEAKS_IN]-(p:Person)
with s, collect(p) as person, count(p) as count
where count > 1
return s,person

And the result is something worth exploring, always.



Of course there are PLENTY of other queries that you come up with - but that's for you to do in the next two weeks as you get ready for the conference :) ...

Also available as a Gist or Guide

Before we wrap up this blogpost, I would also like to add that you can look at the above data in a Graphgist and a Browser Guide:

  • The Graphgist is available over here.
  • and if you just type

    :play http://portal.graphgist.org/graph_gists/graphconnect-europe-2017-schedule-graph/graph_guide

    in the Neo4j Browser you will see it inside the browser immediately, including the above queries to add the dataset to your own Neo4j Instance and everything. Note that you do need to allow the browser to fetch guides from other, trusted URLs, and therefore you need to add this property to the neo4j.conf file:

    browser.remote_content_hostname_whitelist=http://portal.graphgist.org
    and then you get this:

All of the above is hosted on Github - so please head over there if you want to play around with it some more.

Hope this was useful and interesting.

Cheers

Rik

Tuesday, 25 April 2017

Autocompleting Neo4j - part 4/4 of a Googly Q&A

In the firstsecond and third posts in this series, I got round to finally aswering some of the more interesting "frequently asked questions" that Google seems to be getting on the topic of Neo4j.
Today, we'll continue the last part of that Q&A, and answer two more questions which - funnily enough - are kind of related. They both deal with the query language that people use to interact with their graph database. Neo4j has been pioneering openCypher of course, but clearly there are alternatives out there - and people need to make an informed choice between query languages, of course.

Monday, 24 April 2017

Autocompleting Neo4j - part 3/4 of a Googly Q&A

In the first and second post in this series, I explained and started to explore some of the more interesting "frequently asked questions" that seem to surround Neo4j on the interwebs.
Today, we'll continue that journey, and talk about Lucene, transaction support, and SOLR. Should be fun!

2. Does Neo4j use Lucene

This one is a lot simpler to answer - luckily - than the scale question that we tackled in the previous post. The answer is: YES, Neo4j does indeed leverage the (full-text) indexing capabilities of Lucene to create "graph indexes" on specific node-label-property combinations.

Friday, 21 April 2017

Autocompleting Neo4j - part 2/4 of a Googly Q&A

So in the previous post, I explained my plan of doing a series of blogposts around the most frequently asked Google questions as recorded and suggested by Google's Autocomplete feature.
We'll start this week with the most asked question of all - which I get all the time from users and customers - and it's the inevitable "scale" question. Let's do this.

1. Does Neo4j Scale

Let's start at the beginning, with the first question that lots of people ask is: "Does Neo4j scale?" Interesting. Should not surprise anyone in an age of "big data" right? Let's tackle that one.


To me, this is one of the trickiest and most difficult things to answer - for the simple reason that "to scale" can mean many different things to many different people. However, I think there are a couple of distinct things that people mean with the question - it least that's my experience. So let's try to go through those - noting that this is by no means an exhaustive discussion on "scalability" - just a my 0,02 Euros.

Thursday, 20 April 2017

Autocompleting Neo4j - part 1/4 of a Googly Q&A

As you can probably tell from this blog, I have been working in the wonderful world of Graphs for quite some time now - Neo4j remains to be one of the coolest and inspiring products I have ever seen in my 20 odd years in the IT industry, and it certainly has been a thrill to be part of so many commercial and community projects around the technology in the past 5 years. Not to mention the wonderful friends and colleagues that I have found along the way.

One thing that does keep on amazing me in working with Neo4j, is the never ending
  • stream of use cases, industries and functional domains where graphs and graph databases can be useful
  • stream of new audiences that we continue to educate and inform on the topic. Every time we do a meetup or an event, we seem to tap a new source of people that are just starting their journey into the wonderful world of graphs - and that we get to talk to and work with along the way.
When dealing with these new audiences, it's also pretty clear that we ... keep on having the same types of conversations time and time again. Every new graphista that gets added to the community, is asking the same or similar kinds of questions... and most likely, they are going to google for answers.

This leads me to the topic of this blogpost, which is both fun and serious at the same time: we are going to try and autocomplete neo4j :) ...

Autocompleting? What's that again?

When we talk about autocomplete, we talk about this amazing technology that Google has built into it's search functionality, that completes your search query as you type - often times "guessing" what you will be looking for most likely before you even thought about it... it can be pretty interesting, even eerily scary sometimes...