Wednesday, 26 April 2017

Graphconnect Europe 2017 schedule graph

Countdown has begun! Two weeks from now we'll be bringing together the entire European Graph Community in London again, for the annual Graphconnect Conference. Every year, it's something to really live up to, to rally our customers and users to attend as we really believe in the "power of relationships" that are formed and strengthened at conferences like this.

So of course, we had to pull out the old trick (started at Oredev 2014 actually - so quite some time ago!) of creating a "Conference Schedule Graph" for everyone to explore.

The Mother of all Schedules: a Google Sheet

Of course we needed to structure the data a bit before we could create a graph from it. The go-to tool for that for me is a Google Spreadsheet, and our marketing team helped me pull this one together:

It's got two tabs: one for the speakers, and one for the talks. Really simple.

Creating a graph model

From that spreadsheet, I of course had to derive a more "graphy" data model - nothing fancy, just something that would allow me to query the dataset in an intuitive way. This is what I came up with, using the Arrows tool:

Nothing really special about this - just a highly normalised model where the timeline at the bottom is probably a bit more counterintuitive to what you would do in most databases.

So let's see how we would import the data into that model.

Importing the schedule into my model

So the important is nothing really special, in the sense that I use the standard Load CSV tools that are part of Neo4j's Cypher to pull the data from the publicly available version of the two CSV file versions of the spreadsheet:

I use these two URLs in different parts of an Import script that I have also put on Github over here.

There's different parts to the script, which are clearly highlighted and commented:
  1. add the speakers and the companies: this picks up the first CSV file, and then creates (actually: MERGES so as not to create duplicates) the Person and Company nodes, and once that is done, it also passes the CSV file to a second query that also MERGEs the relationships between the Person and Company nodes.
  2. then I MERGE the Floor, Room and Track nodes that I find in the second session CSV file
  3. then I add the Time nodes and use a nested FOREACH query to connect them up to one another and create the timeline
  4. then I get to the bulk of the session data and create the Session nodes and connect these to the Person, Room, Time and Track nodes. Note that I also add the comma-separate list of tags that is in the CSV file as a property on the session. This is a temporary step - see next operation.
  5. In the final import/update query I look at all these tag properties on the Session nodes, and extract these into Tag nodes: I grab the property, split it based on the commas, trim it, and then iterate over the collection using UNWIND. I also remove the tag property from the Session nodes once this is done.
And that's it. I get 277 nodes and 452 relationships to be added to the database, and get the following model representation when I call db.schema().

So now we can query the dataset really easily. You can just browse through it and look at the timeline:

 Or start exploring a bit and just grab a session that you are interested in and explore its connections:
But of course that's not how you would normally proceed.

Querying the GraphConnect Europe 2017 schedule

Here's a couple of example queries that you could run really easily.

Query 1: looking for the sessions and connected information along the timeline

This query would look like this:

match (t:Time)<--(s:Session)--(connections)
return t,s,connections
limit 50

and you would get a result like this to then further explore.

Query 2: Look at the links between two people

One of my favourite queries for graph databases is the "pathfinding" between different nodes in the graph. So let's assume that you would want to understand the links between Axel Morgner (of Structr fame) and Jim Webber (of Science fame), then you would do something like:

match (p1:Person), (p2:Person),
path = allshortestpaths( (p1)-[*]-(p2) )
where contains "Morgner"
and contains "Webber"
return path

and the result would show the linked path immediately:

Query 3: look at the links between a company and a person

Very similar to the above, of course:

match (c:Company {name:"GraphAware"}), (p:Person {name:"Jim Webber"}),
path = allshortestpaths( (c)-[*]-(p) )
return path

but the path seems to be a bit more elaborate:

And then last but not least:

Query 4: look at sessions with more than one speaker

I always find this interesting:

match (s:Session)-[r:SPEAKS_IN]-(p:Person)
with s, collect(p) as person, count(p) as count
where count > 1
return s,person

And the result is something worth exploring, always.

Of course there are PLENTY of other queries that you come up with - but that's for you to do in the next two weeks as you get ready for the conference :) ...

Also available as a Gist or Guide

Before we wrap up this blogpost, I would also like to add that you can look at the above data in a Graphgist and a Browser Guide:

  • The Graphgist is available over here.
  • and if you just type


    in the Neo4j Browser you will see it inside the browser immediately, including the above queries to add the dataset to your own Neo4j Instance and everything. Note that you do need to allow the browser to fetch guides from other, trusted URLs, and therefore you need to add this property to the neo4j.conf file:

    and then you get this:

All of the above is hosted on Github - so please head over there if you want to play around with it some more.

Hope this was useful and interesting.



Tuesday, 25 April 2017

Autocompleting Neo4j - part 4/4 of a Googly Q&A

In the firstsecond and third posts in this series, I got round to finally aswering some of the more interesting "frequently asked questions" that Google seems to be getting on the topic of Neo4j.
Today, we'll continue the last part of that Q&A, and answer two more questions which - funnily enough - are kind of related. They both deal with the query language that people use to interact with their graph database. Neo4j has been pioneering openCypher of course, but clearly there are alternatives out there - and people need to make an informed choice between query languages, of course.

Monday, 24 April 2017

Autocompleting Neo4j - part 3/4 of a Googly Q&A

In the first and second post in this series, I explained and started to explore some of the more interesting "frequently asked questions" that seem to surround Neo4j on the interwebs.
Today, we'll continue that journey, and talk about Lucene, transaction support, and SOLR. Should be fun!

2. Does Neo4j use Lucene

This one is a lot simpler to answer - luckily - than the scale question that we tackled in the previous post. The answer is: YES, Neo4j does indeed leverage the (full-text) indexing capabilities of Lucene to create "graph indexes" on specific node-label-property combinations.

Friday, 21 April 2017

Autocompleting Neo4j - part 2/4 of a Googly Q&A

So in the previous post, I explained my plan of doing a series of blogposts around the most frequently asked Google questions as recorded and suggested by Google's Autocomplete feature.
We'll start this week with the most asked question of all - which I get all the time from users and customers - and it's the inevitable "scale" question. Let's do this.

1. Does Neo4j Scale

Let's start at the beginning, with the first question that lots of people ask is: "Does Neo4j scale?" Interesting. Should not surprise anyone in an age of "big data" right? Let's tackle that one.

To me, this is one of the trickiest and most difficult things to answer - for the simple reason that "to scale" can mean many different things to many different people. However, I think there are a couple of distinct things that people mean with the question - it least that's my experience. So let's try to go through those - noting that this is by no means an exhaustive discussion on "scalability" - just a my 0,02 Euros.

Thursday, 20 April 2017

Autocompleting Neo4j - part 1/4 of a Googly Q&A

As you can probably tell from this blog, I have been working in the wonderful world of Graphs for quite some time now - Neo4j remains to be one of the coolest and inspiring products I have ever seen in my 20 odd years in the IT industry, and it certainly has been a thrill to be part of so many commercial and community projects around the technology in the past 5 years. Not to mention the wonderful friends and colleagues that I have found along the way.

One thing that does keep on amazing me in working with Neo4j, is the never ending
  • stream of use cases, industries and functional domains where graphs and graph databases can be useful
  • stream of new audiences that we continue to educate and inform on the topic. Every time we do a meetup or an event, we seem to tap a new source of people that are just starting their journey into the wonderful world of graphs - and that we get to talk to and work with along the way.
When dealing with these new audiences, it's also pretty clear that we ... keep on having the same types of conversations time and time again. Every new graphista that gets added to the community, is asking the same or similar kinds of questions... and most likely, they are going to google for answers.

This leads me to the topic of this blogpost, which is both fun and serious at the same time: we are going to try and autocomplete neo4j :) ...

Autocompleting? What's that again?

When we talk about autocomplete, we talk about this amazing technology that Google has built into it's search functionality, that completes your search query as you type - often times "guessing" what you will be looking for most likely before you even thought about it... it can be pretty interesting, even eerily scary sometimes...

Friday, 17 March 2017

(Another) Podcast Interview with Alistair Jones, Neo Technology

Just before ending the week, I thought I would publish another great episode on our Graphistania podcast. Ever since the launch of Neo4j 3.1, I had been wanting to do an episode about the new Neo4j clustering architecture. It's so innovative, new and a great piece of engineering - we just had to sing its praise :) ... So who better to invite back to the podcast than Alistair Jones, who was one of the lead engineers at Neo Technology to pursue the effort. Here's our chat:

Here's the transcript of our conversation:
RVB: 00:02.563 Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology. And here I am again recording the second podcast of this year. I know it's only two months into the year, so I've been slacking, but I [laughter]--
AJ: 00:15.346 You've been picking up the pace again.
RVB: 00:16.082 Yeah, picking up the pace again. And for the second episode, I have invited a returning guest to our podcast, and that's my friend and colleague, Alistair Jones, from the Neo Technology engineering team. Hi, Alistair.
AJ: 00:28.380 Hi, Rik.
RVB: 00:29.089 Hey, thank you for making the time. I know you're a busy man these days, so thanks for taking the time. Alistair, the reason why I invited you back is because I know you've been hard at work in the engineering team, on some of the really big, new features in Neo4j. 3.1 was released at GraphConnect San Francisco last year. Or, no, it was actually announced and was released a little bit later, but one of the biggest new features in Neo4j 3.1 was the new clustering architecture, right?
AJ: 01:03.036 Yep.
RVB: 01:03.119 And that was what you and your team were working on?
AJ: 01:05.054 Yeah, it was a really big thing for us, actually. So I've been working on this area for nearly two years, actually, on this new clustering architecture. And as you know, Neo4J is a clustered database designed to run over multiple servers. And we've had clustering in place for six or seven years in Neo. This is the biggest change we've ever made, by miles. It's a huge, huge upgrade of all the technology around the clustering.
RVB: 01:40.760 Wow. I remember like in version 1.8 it was like Zookeeper that was doing some of the work.
AJ: 01:46.629 Yeah, we had a small change in the 1.9 release back in the day.
RVB: 01:54.312 Back in the day, yes.
AJ: 01:55.544 This is a much bigger release in 3.1.
RVB: 02:00.387 So what's it all about?
AJ: 02:01.882 So the first part of it is getting up to date. So the world around us has moved on, and one of the great things about Neo is that we can take research from academia and actually apply it. So reasonably recent stuff that, if you read the academic papers and blog about it, we read all those, and some of those things we can put them fairly quickly into the product. So, for us, this time, it was doing the Raft protocol, which is a consensus algorithm. So what that means is getting agreement between participants, so computers in this case. We--

RVB: 02:52.316 Members of the cluster, right?
AJ: 02:53.396 Yeah. So, in this case, it can be different services in the class are getting consensus between those servers, when the servers themselves and the communication between servers is potentially unreliable. So you need to account for the unreliability in the design. Now, we know a little bit about consensus algorithms because previously, back in that 1.9 release, we implemented Paxos. And, at the time, that was the kind-of state-of-the-art thing to do. Raft, you could argue, at some theoretical level, is the same thing, but it's much more clearly structured. Raft is--
RVB: 03:38.960 You mean a consensus protocol, right?
AJ: 03:40.209 Yeah, yeah, exactly. So it's from Diego Ongaro, who's the lead researcher in this area, and it's really impressive how it's described.
NOTE: Diego reacted to this part of the podcast with a super-cool tweet:
It's actually aimed to be simple to understand and to explain. And that makes it really good to implement because you can be very clear about what you've done. You can see the direction that you've gone in. So we've changed from one consensus algorithm to another.
RVB: 04:13.411 Yep. Which is a big change [crosstalk].
AJ: 04:14.871 Which is a big change, but architecturally it's totally different, because previously we were using Paxos to agree on membership of the cluster. So actually a very small amount of data. Not that many servers. They don't go that often. Now what we're doing is we're using Raft, and we're using it for every single transaction in the database. So every single node, relationship, property you create in the database it goes through the Raft protocol. You've got consensus across the cluster. And what that means is that every single change is agreed to by a majority of the cluster, so no matter what happens in terms of loss of connectivity or failure of the minority of the servers, still, the cluster as a whole agrees on what the state is as you move forward, so--
RVB: 05:06.995 Sounds a bit like open heart surgery to me.
AJ: 05:09.230 Yeah, it's quite a major change, but it's actually really nice. Once you've got that super solid foundation, you can build a whole load of things on top of it. So it's extremely solid for-- it's like the most reliable we could make it, and it stores every single transaction in this replicated log across all members of the cluster. And also as the membership changes, that's agreed to with protocol as well, so you know every time who the people were, who the servers were. People were allowed to [inaudible] transactions and to get them committed. So the whole thing's very tightly integrated into the core of the clustering.
RVB: 05:52.520 So I would never claim that I understand everything about it, but what I've read is that it's very different architecturally in terms of-- previously we had masters and slaves, now we talk about cores and edges, right?
AJ: 06:05.636 Yeah. The second part of this is that we were aiming to have much larger clusters than people had previously been running in Neo. Neo's been around for a long time. And, previously, people used to think of having 3, 5, 10 servers being a large database cluster. Now people want to run hundreds of servers, and we have customers and users running 200 servers in a database cluster. We want to be able to get higher than that, and the consensus algorithm that we were using before, the design of it, or perhaps the membership, yeah, it had a sort of limit on the-- or do we say kind of--? It was hard to get to that scale.
AJ: 06:55.530 And the reason is that all of the servers had to be aware of each other and what they were doing at any stage to basically make sure that they hadn't disappeared. So that led to heartbeats going from every server to every other server, and that ultimately gets very expensive when you have a large number of servers. It also gets very difficult when you're committing across the majority of the servers because you have to wait for a large number of them to come back before you can say, "Yes, this is now safely committed."
AJ: 07:30.229 So just having one huge cluster of Raft servers is not a good design for that kind of hundreds of servers or thousands of servers. So we came up with a new architecture. And what we do now is we divide the cluster into two groups. We mark some of the servers as being in what we call a call. Call servers participate in Raft and they are about safety. They're about storing your data durably. Secondly, we have a lot of potentially much larger group of read replicas. And these are servers that are for running your queries on, and--
RVB: 08:14.895 Read queries, not write queries.
AJ: 08:16.298 Yeah, yeah, read queries. So you don't have to worry about safety here, and the idea is these are about-- they're disposable, where you can scale them up and down; when your web traffic is high a certain time of day, have more and more of them.
RVB: 08:30.725 Just have more of them, yep.
AJ: 08:30.721 [inaudible] your cloud instances when it's quieter, and you can adapt to the shape of your traffic with the read replicas. What's interesting is that the name read is that we're doing more service than reading. Why does that make sense in a--? How does that help you in a database, have more read only things? Surely you need them more so to write. Well, that's because of the shape of graph data. It's because, actually, when we look at the-- I'll show you, because it's a nice slide [laughter] with audio only. You're looking at a slide that shows kind of how we see people do stuff with graphs, and what you notice is that the right [inaudible] updates tend to be quite small.
RVB: 09:17.172 Local [crosstalk]?
AJ: 09:17.605 Yeah, very, very local. Like, two or three nodes in relationships, up to maybe 100 things in a transaction, whereas on the read side - the whole point of graphs is to really fast, and people go a long way - they traverse along the graph in a read transaction. So they're doing hundreds of thousands of relationships in one transaction. Now, that's very fast, but it still takes resources. It takes memory bandwidths, it takes CPU to run these queries. And that's what people are really hammering their graph with, thousands of these, each very big, queries. And that's an enormous amount of computational load. We want to spread that across a lot of servers, and this is a way to do it - have loads of re-replicas that can handle that traffic for you. So it is really helping you in the kind of [inaudible] applications. It's a very specific architecture to the type of system that we're building.
RVB: 10:13.028 Pretty cool. And so, as I understand it, the core is-- so they're all about the safety, and about writing to the graphs, and the age servers are all about reading. Is there any downside to this? Is this good news show all the way around, or are there some things that we should take care with?
AJ: 10:34.656 So there's one thing that's just like-- a challenge here for people when they're deploying these type of applications, is that the transaction's being pushed out from the core, out to the B replicates, and there's some delay in that happening. It's very small, but there is some delay. So people call this eventual consistency, and this is something that we're aware of. And lots of modern sort of web systems that you get into this kind of eventual consistency situation. An example of this that could kind of catch you out is, say you're a user, you create an account, or you make a booking, that's a right transaction. It updates the graph. Then when you come to refresh your page, you try another operation and it's a read only operation, maybe you hit a read replica that hasn't quite seen your update, so, as a user, it almost appears like the thing you just did has disappeared, like you've gone back in time. There's a bit of a--
RVB: 11:45.015 It's [crosstalk] read your own writes problem.
AJ: 11:46.385 Yeah, I can't really-- so what we did at the same time as this, is we actually added a whole new feature that became the name of the whole clustering architecture. So this is what I like to call causal clustering, because we added in a feature of causal consistency.
RVB: 12:08.446 Tell me more about that, because I don't know what that means [laughter].
AJ: 12:10.836 Okay, Rik. So causal consistency. So it's actually something that's been-- again, from research, there's some academic and industry research in this area, but it's not very commonly implemented. There are only a handful of other implementations out there, and what it's about is trying to represent what causally has happened in the user's application. So the cause and effects of the changes that you've made.
AJ: 12:45.088 Practically, it's very easy to use. What happens is that when you update the graph or when you touch the graph in any way, the database can give you a bookmark. And this bookmark represents the latest thing that you've changed or the latest thing that you've seen in the database. And then when you make another request to any other server in the cluster, you can supply that bookmark that's saying bookmark, and the database will make sure that it has at least as up-to-date a state as the bookmark represents. So the bookmark is just a little string and it comes back to your database driver into your application code. You can store it in your application server, or you can hold onto it temporarily while you make another inquiry, or you can send it all the way back to the client. You can send it back to your web browser or your mobile device, and route it back, ultimately, to the database.
RVB: 13:46.373 So that basically assures that the client of the database always takes into consideration everything that it calls [crosstalk]?
AJ: 13:54.484 Yeah, it prevents you from going back in time--
RVB: 13:56.751 Ah, yeah, that's it.
AJ: 13:56.890 --is what it does. And it supports a totally stateless architecture - everything between the user and the database. The database is storing state. Why should you need to store it anywhere else? So this is [inaudible]. Your sessions, you don't need to worry about sophisticated routing. Just have stateless application servers, pass your bookmark around, and you get causal consistency. That's the idea.
RVB: 14:28.017 Wow.
AJ: 14:28.699 And we've tried to make this even easier to use by building some of the primitives. The kind of passing backwards and forwards keeping track of things is built into the database drivers. So in 3.0, we introduced--
RVB: 14:42.424 BOLT drivers, right?
AJ: 14:42.490 Yeah, the BOLT drivers. So they initially supported native language drivers in your [crosstalk]--
RVB: 14:48.561 Right. And so the new version of the driver supports this bookmarking--
AJ: 14:51.515 Exactly, yeah.
RVB: 14:52.721 --and that gives us the causal consistency.
AJ: 14:54.338 The causal consistency, yeah. Exactly.
RVB: 14:56.476 So let's talk a little bit about the future. What's coming up? What are you working on now, and what keeps you up at night, and [laughter]---?
AJ: 15:03.404 Yeah. Well, [crosstalk]. I mean, it's kind of following on logically from where we are now, so the next stage of this is to be-- it's that kind of how people actually deploy this stuff. And these days, not just a cluster of servers that are using it to run a database. It's also servers across multiple data centres and multiple regions around the world. Around the country, all around the world. So that's what the cloud environment's been very easy to do, to have geographic distribution. And we are taking account of that feature in the product, or that server usage in the product. So what we're going to do is make the clustering aware of data centres and how they're organised, and allow the client to give hints about how might be the best way to serve it. So that means that you can do your reads from a server that's very close to you, with a low latency, and you can support fault tolerance across data centres when one of them goes away, or explicitly recover in a disaster recovery zone. All of these different operational scenarios. So--
RVB: 16:24.512 Is that something that's coming up in the next couple of versions of Neo4j or--?
AJ: 16:27.024 Yeah, yeah. So in the next couple of versions, that's the stuff that's going on. And, again, it's to be seamless all the way through the driver, so you write your application once for Neo4j on your laptop, and then it should move forward [inaudible].
RVB: 16:46.117 That's very cool. I have one more question. Don't you miss the visualisation stuff that you were doing before [laughter]?
AJ: 16:52.896 Yeah. So I always miss the visualisation. I try to devote my spare time to get back into it every now and then, so--
RVB: 17:03.147 Very cool. Well, thank you so much for spending your time, Alistair. I mean, we want to keep these podcasts fairly short, but I'm sure we'll include a bunch of links to the documentation and the blog post that we wrote about this topic. I really appreciate you making the time, and look forward to seeing what's up next.
AJ: 17:21.907 Thanks very much.
RVB: 17:23.060 Thank you. Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Monday, 6 March 2017

Podcast Interview with Kristof Van Tomme, Pronovix

Last month I had one of those cool encounters of the graph kind at the Belgian Beerfest that we have been organising a couple of times in the the last few years at the occasion of Fosdem - the amazing open source conference that's taking place in Brussels every year. This year, I got talking to a fellow countryman that has been doing some amazing work on integrating the Drupal content management system with Neo4j - something that has a lot of potential in a lot of areas, I think. So - we just HAD TO have a chat :) ...

Here's the transcript of our conversation:
RVB: 00:03.346 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo Technology. And here I am again the third time in two days, this is wonderful, I'm on a roll here, recording another podcast for our Neo4j Graphistania podcast. And today I have a fellow Belgian on the other side of this Skype call, and that's Kristof Van Tomme from Pronovix. Hi, Kristof. 
KVT: 00:27.466 Good morning Rik. How are you? 
RVB: 00:29.593 I'm really well, and I hope the Skype gods bear with us, because we've had some trouble in the past couple of minutes, but I'm sure it will fine. Hey, Kristof, we met each other at the FOSDEM conference, which was a great experience, and I loved the Beer Fest afterwards [laughter]. But yeah, you told me about some really great stuff that you guys are doing with graph databases. So, first of all, let's start from the beginning, who are you, what do you do and what's your relationship to the wonderful world of graphs? 
KVT: 01:07.202 So I'm a bit of a weird duck because I'm actually a bioengineer who ended up in IT through a biotech startup that did research in schizophrenia. It's a whole other life. But I got involved in the Drupal community a little over 10 years ago when we started making websites for biotech companies. 
RVB: 01:35.332 Okay. Drupal is like a content management system, right? 
KVT: 01:38.557 Yes, Drupal the open source content management system. The other really good Belgian product after beer and chocolates [laughter]. And I got really strongly involved in that community 10 years ago. I helped organise one of the big European conferences, and then we built a consultancy around that. Then, about five years ago, I got really excited about documentation, and reuse of documentation specifically, and how to deliver it and reuse bits and pieces so that you could build deliverables that can easily reuse between different channels. And that's how I got excited about graph databases, and Neo in specifically. 
RVB: 02:32.949 When you say documentation, you mean technical recommendation for software, right? 
KVT: 02:35.667 Yes. Yes, I do. The thing that everybody's like, "Ooh, documentation." 
RVB: 02:41.017 Ah, damn it. Yeah, exactly. 
KVT: 02:44.417 So that's how I got involved in-- because we had one of our colleagues, a long time ago, I think six years ago or something, started playing with graph databases, and actually, he built a first connector for Drupal for Neo. And he's like, "Kristof, I did this thing, and I'm really excited about graph databases, and I think it's cool. Can we do something with this?" And I was like, "I have no idea." So that was the first connector for Neo for Drupal, and then that kind of died because there was-- technically it was there, but then there were no further implementations, and I was not sold, and people didn't figure out how to use it. But then because of the documentation thing, I actually started seeing what you would use a graph database for and that's when I got really excited. 
RVB: 03:46.370 Super cool. Because documentation, I don't know if you notice, but this is where Neo4J started as well, as an open source project, 15 years ago, Viking hackers in a garage. They were all about content management at the time as well because they were working for a media company that was managing digital assets. So it's funny that there's this convergence or link between the two worlds, right? What is the use case all about? How does it work? 
KVT: 04:20.696 So I've been thinking-- I've got this DITA, which is another of those words. It's a standard that's fairly popular in the technical writing community for writing reusable documentation. It's like an XML standard. Some people scratch their heads when they hear about it, and other people are raving mad about it. So in the DITA community, I've been doing talks about consult management systems and open source and things like that. I think two years ago, I started thinking about personalisation and embedding information. What I dream about is this; instead of having a manual that the documentation system knows who you are and serves you the right information when you need it. I did a talk about that at the DITA conference here, I think it was in Europe, and I was thinking, "So how would you do that?" And then I started thinking yeah, actually, probably it wouldn't really work with a relational database because you need to start collecting a whole lot of information and start analysing for patterns. And that's how I started thinking about Neo and graph databases more in general. 
RVB: 05:48.382 So as a personalisation engine for documentation, right? So you wouldn't need to search for documentation as much, but you would have a recommended set of documentations that would be served to you semi-automatically. 
KVT: 06:04.195 Yeah. So it's the idea that, for example, you're in an application, you're in a web app, and you can't find that one damn button that you know is somewhere-- 
RVB: 06:16.996 We've all been there.
KVT: 06:18.043 Yeah, we've all been there. So you're clicking around, and you're going through settings, and I don't know, connections, so you keep going circles and circles and circles because you can't find the damn button. And at that point, the system would say, "This looks a lot like what people do when they're looking for this thing," and then you would get a little pop-up saying, "Are you maybe looking for this?" And similarly, if you're using a certain feature and you're doing something really weird and other people have done that, and then they went through the documentation and found some other feature, then you could shortcut that and skip a few jumps in that graph and immediately serve them the information that they're looking for. So it's kind of like analysing patterns of behaviour that people have inside of a web application and then serving them-- that's patterns of behaviour that they normally do just before going to documentation sites and then serving them that documentation that people normally will find when they go to documentation site after they've done a certain thing, and then serving that information to them. So that's one of the really cool things that I would like to do. 
RVB: 07:32.834 Yeah, I understand. So why is that such a good use case for a graph database? Is that because of the pattern recognition, or what's the secret sauce? 
KVT: 07:44.973 So it's the pattern recognition. So I think CMSs are really good at storing data in a-- storing similarly structured information because most of CMSs use SQL databases and they're pretty good at that, just building up a content model and then reusing that over and over again. But being able to recognise behaviour-- well, that's not something that we are normally doing in the CMS space. We have some very basic things, like there's some recommendation based on the content and shared keywords and things like that, but behaviour analysis is not one of the things that you normally find in the CMS. So for that, we need different technology because in a SQL database you would have to do so many joints to even figure out what's going on, yeah, that I don't think that it would make sense to do it that way. And ideally, it would be a system that you don't have to program everything but that it can start looking for patterns on its own eventually. And that you build this graph of interactions and content and kind of like a graph that combines those two to do things with that. So yeah. 
RVB: 09:04.602 So where are you guys with this? How far along that path are you? I know you've done some prototyping already, right? 
KVT: 09:11.567 Yeah. So we are very, very early. So our main business right now is developer portals. So two years ago we started working-- well, a year and a half ago we started working with APG, that's now part of Google, and they have a developer portal that we are customising for their customers. And we built this whole business around documentation, specifically about APIs, so that's where our core focus is right now. And so the AI and personalised documentation is something that we're doing research on. So the thing we've done currently is we've built a connector for Drupal for Neo - I did a talk about that at FOSDEM - and that was-- 
RVB: 10:02.991 I went to that one, yeah. 
KVT: 10:04.291 Yeah. So that talk was not just about this use case. It was about what could you do if you combine a CMS and a graph database and looking at it from an added-value perspective, rather than a replacement perspective. Because I know that in the DO community people are like, "Just get rid of the stupid SQL databases [laughter]." They're worthless and graph databases can do everything so much better. I think--
RVB: 10:37.056 That's a pipe dream in my opinion. 
KVT: 10:38.368 Probably. You could build a CMS graph database, and I think that could work. But I think that there's so much existing technology already where it's a large amount of extensions and huge communities that it would make more sense to create an add-on instead of a replacement because if you replace it then you have to rewrite everything. 
RVB: 11:05.120 I couldn't agree more. 
KVT: 11:06.124 Yeah. So that's why I think that's their sweet spot for Neo in the CMS community but I think there's two stress facts to this. One is the sweet spot for neo in the CMS community, and that could be recommendation and pattern finding. But then there's also the inverse that you could think about and that's what if you were to put an open source CMS like Drupal in front of a graph database and we use it as an interface to manipulate the graph and to add, maybe, some structured objects into your graph? And then use the CMS to build reports about those objects and the graph to find out which ones you're going to put into your reports. So that was my talk about. 
RVB: 11:57.922 Well, you've already touched on my last question, which is what does the future hold [laughter]? What could we do in the future? And I know that we'll be doing some meet-ups together and I'm really looking forward to those, but where does this go, Kristof? What's in your crystal ball? 
KVT: 12:21.174 So I love thinking about a future. I really love Kevin Kelly's book, The Inevitable. And in that book, he talked about-- I think this is the basic pattern that got me thinking about this, also. He talks about flowing and it's a very, very interesting concept that we're moving from an Internet where we used to have documents to an Internet where we have pages today, where we'll have flows of information tomorrow. And this idea of going from having an object that's structures and it has a context-- has a manual context, or a book context, or a document's context where you put all the information in context of the rest of the book into a very rigid structure. That's how we used to do things. That's how books and manuals were built, even when printing press-- even before the printing press was invited. And what the Internet has been doing, and what search engines have been doing, is that we've been moving towards pages where you can just dive into any object-- sorry, any document, any book, and just find out one page where a certain concept is explained. So you can just jump in. You don't have to read the whole book to be able to understand something. And that's where we are today. But I think that's the next step in this process, and it's also what Kevin Kelly talks about is flows, where you have a flow of information that's much more personalised, and we're just constantly dipping in and out of these information flows around us that are serving us the documentation that we need at a certain time to be able to do what we need to do and that are aware of our contexts so that we don't have to adjust to the context of the documentation, but the documentation adjusts to our own personal context, and I think-- yeah? 
RVB: 14:31.872 So what I'm hearing is you see this graph database integration and everything that you guys are building as a means to that end, to get there somewhere, somehow, to get closer to it. 

KVT: 14:44.560 Yeah. So we have a first customer where I've been talking about this concept, and-- they're an SaaS company. So what I imagine is that we could track users, the administrators as interacting with the software, and then basically serve them the contents this way where you look at their whole experience inside of your tool, and then you serve them the information they need to be able to interact better and get more value out of your system. So it's kind of like the idea-- the way that I describe it going from the context of the manual to the context of the one, like the one person, one single user and how they are interacting with the system. This is very, very-- there's a lot of work to get here [laughter]. But I think that we can take baby steps, start with first implementation. Start with building a graph of the behaviour and how people interact with documentation and with the tools that are documented by the documentation and then use that to start recommending content. And yeah, I'm really excited about it. We started a mailing list about it at one of the meet-ups where I was presenting. We actually had one of the people that worked on the Clippy years and years ago at Microsoft who was also really excited about the idea. Because I think this is actually what Clippy wanted to do, or wanted to be, but it was not possible. And I think that graph databases could be the piece of technology that enables the dream of Clippy [laughter]. 
RVB: 16:40.452 Well, I think on that bombshell [laughter], I think that's a great time to kind of wrap up this podcast. Thank you so much for coming online, Kristof, and we'll be publishing some more details around your work and also the talks that you've been doing with the transcription of the podcast so people can read up about it. And I look forward to seeing you at one of our meet-ups, right? Because we'll be doing some community work together in the next couple of months as well. So really looking forward to that. 
KVT: 17:12.653 Likewise. 
RVB: 17:13.552 Thank you so much. Have a nice day, Kristof. 
KVT: 17:16.253 Yeah, you too. 
RVB: 17:16.990 Bye. 
KVT: 17:17.383 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Thursday, 23 February 2017

Podcast Interview with Gábor Szárnyas, Budapest University of Technology and Economics

Waw. That was probably the longest stretch that I went without publishing blogposts or podcasts over here. I have no real excuse - the start of 2017 has just been super busy and interesting - with a lot of travel that does not really help with quiet "writing" time. But it's all great fun - I just need to get back into the rhythm - and today is the start of that.

Today's podcast is actually super cool. It started at a beautiful Brussels bar after Fosdem. At this conference, there have been "graph devrooms" hosted for the past couple of years - and this year it was a really nice lineup.  One of the speakers, Gábor, did this really interesting talk about "Graph Incremental Queries with OpenCypher", which is really cool. So after the conference, it turned out we share a passion for cycling too - and we decided to get together for a nice recording. Here it is:

Here's the transcript of our conversation:
RVB: 00:04.202 Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology and I must confess I feel very, very guilty now because this is the first time that I'll be recording a podcast in 2017, so happy new year. In spite of the fact that it's Valentine's Day. But yeah, I was slacking a little bit but I want to bring the podcast back to life and I've lined up a bunch of people to help me with that. And today I've invited someone who I've who only met like two weeks ago at the FOSDEM Conference in Brussels. And that's Gábor Szárnyas from Budapest. Hi Gábor. 
GS: 00:42.680 Hi Rik. Nice to be here. 
RVB: 00:43.500 Hey. Thank you for joining me. It was a great time meeting you in Brussels over some Brussels beer, but yeah we talked to each other about your work and I thought it would be great to have you on the podcast. So my first question is going to be who are you, and what do you do? What's your relationship to the wonderful world of graphs? 
GS: 01:10.158 Okay. So I'm a researcher at Budapest University of Technology and Economics. And also visiting researcher at McGill University in Canada. Now I'm working on finalizing my PhD, so hopefully I will be finish it within a year or a half. And I worked basically on graph- related topics in my PhD. 
RVB: 01:33.134 Oh, very cool. And don't forget you share another passion with me. 
GS: 01:38.380 Yeah, I'm also a cyclist. 
RVB: 01:40.152 Yes, exactly. 
GS: 01:40.729 So I started road cycling three years ago and it absolutely wondered me. I really like cycling-- 
RVB: 01:49.279 Same for me...
GS: 01:50.351 --and that's my main passion. 
RVB: 01:51.948 Same for me. We have a couple of other graphistas that are super passionate about cycling so we'll have to do a ride sometime. But tell us-- 
GS: 01:59.412 I agree. 
RVB: 01:59.558 --a little bit more about your work with graphs. What's it all about, what's your PhD about, and what are you working on? 
GS: 02:07.503 Okay. So my PhD revolves around three topics that are related to graphs. The first one is how to incrementally query graphs. So imagine that you have a complex query and you have a huge graph. Now obviously, it's very difficult to evaluate a query on the graph at a very short amount of time. So basically, as a workaround, we do incremental queries, which means that if your graph changes slightly then we maintain the result sets. And this is useful for a number of scenarios. You can use it for static analysis of code bases, you can use it for runtime modelling, you can use it for fraud detection, and so on. There are many use cases that present this scenario. 
GS: 02:52.025 The second topic of my PhD is how to benchmark an incremental graph query engine. Because, obviously, once you have an incremental graph query engine, you would like to have some feedback on its performance. And you would like to use that to continuously improve your query engine. So, with my research group, we designed and implemented a framework that allows users to do just that. Compare incremental graph query solutions to each other and to other competitors. 
GS: 03:22.765 And the third one-- yes? 
RVB: 03:22.870 Is that related to the LDBC work, the Linked Data Benchmarking Council, is that related to that? 
GS: 03:30.529 So basically they have similar goals. I was actually at Walldorf last week at LDBC Technical User Community Meeting. And LDBC has a couple of benchmarks, but currently none of those covers incremental graph queries and complex graph pattern matching. I talked to the LDBC guys and also attended the talks, and it seemed that there will be a new LDBC benchmark, which will have similar goal than my benchmark. And that will be called the Business Intelligence workload for the Social Network Benchmark. And the problem with that is that it's not yet ready. So I talked to it's core developer, Alex Averbuch, and he said that it will be ready within half a year but they are still heavily working on it. 
RVB: 04:29.082 Okay. But you had said that you had three goals, right? You had the incremental queries and then the benchmarking and what was the third one? 
GS: 04:34.976 The third one is closely related to network theories. A network theory is something that came up in the late '90s in the early nodes when people started to analyze graphs. So they took a graph of people where the nodes were the people in a community and the relationships were if they were friends or not. Or they took the graph of the World Wide Web where the nodes were the web pages and the relationships were the links between the web pages. So they took all these graphs and started to analyze them, and they derived very interesting properties, chief among which was the scale-free property of graphs. There are many papers on scale-free networks, and they discovered that this is very common in biology, in sociology, also in physics and other sciences. 
RVB: 05:28.488 What does that mean, scale-free networks? What does that mean?
GS: 05:30.744 So basically scale-free network means that the degree of distribution of the nodes follow the so-called power law. So you have very few central hubs. And basically, if you remove these hubs from the network then your network will break down to smaller components. And they discovered that this is how societies are organized, this is how citation networks work, and this is how power grids work as well. 
RVB: 06:00.783 Oh wow. Just like a universal structural characteristic of lots of networks. 
GS: 06:06.958 Yes, lots of networks. Obviously you cannot apply to all of the networks but it was a very big surprise to the scientists who worked on it that a lot of networks exhibited this property. So how does my PhD research relate to that? Well interestingly, there wasn't much work performed on tide graphs. So if you see Neo4j graphs, you obviously see that you don't only have people and websites and books, but you have all these inner single graphs. So you have tide graph, and they also have different relationships between them. And only in the last five to ten years have been there research about how to characterise these graphs. These have many interesting names. Some people call them the multiplex networks, others call them the multidimensional networks or multilayered networks. Analysing these is very tricky because obviously you have another dimension of complexity by having to deal with all the types of the nodes and the relationships in the graphs, but it's kind of a green area and you can do a lot of interesting work in it. I actually applied it to engineering models, so my research group works in model driven engineering. And there are engineering models for software, hardware, state machines, system design and so on. And basically we took all these models and analyzed them and we looked for some interesting properties. 
RVB: 07:58.123 Wow. 
GS: 07:59.168 We didn't find any huge results so we didn't find that these models are scale-free or they follow some very famous distribution. But we did have some interesting results on how to characterize these models. 
RVB: 08:18.190 Wow, very cool. So could you tell us a little bit more about how you got into the graph business, or the graph science if I may call it that way? How did you get into it, and why did you get into? 
GS: 08:35.661 Okay. Well, that's an interesting question. I think it started in 2011 when I had to pick my first individual research topic at my university, and my roommate
suggested that I should give a try to node secure databases. I was already very interested in anything that's related to databases, relational or not. So I started to work on node secure databases. And then I soon discovered Neo4j and the property graph data model. And I think what really struck me is how intuitive the graph data model is. There is actually a paper by Marko Rodriguez, who was the implementer of the TinkerPop framework, and he said that graphs are very intuitive because they describe the way that people use when thinking about the world. So people tend to abstract the world as things that are somehow connected. And you can perfectly describe this with graph nodes and graph relationships. So this is something I really like about graphs. And that's something that you also mentioned in this podcast, I think a couple of times, that you can use a whiteboard and then just start brainstorming, and having ideas, and drawing a graph. And you can use pretty much the same graph in your applications as well. So that's my favourite thing. 
RVB: 10:07.046 Jokingly, I always talk about my own acronym, which is WYDIWYS, what you draw is what you store. 
GS: 10:14.439 Yeah, that's a catchy acronym actually. 
RVB: 10:18.913 It's been repeated so many times on this podcast but it is a very big strength of graphs, right? The model is so intuitive and so descriptive, so rich, really. That makes a whole lot of difference, right? So I'm reading that that's also how you got into it, right? That's also why you think it's very valuable? Is that right? 
GS: 10:43.860 Yes. So basically after I got a bit familiar with the topic, I started my master's at university. And already during my master's I was working on the incremental query engine that I'm still working on today. So it's quite a long project. I've been doing this for five-plus years. And I really liked my experience during the master's so I joined the PhD and I just finished PhD school three weeks ago. So now it's only-- 
RVB: 11:11.500 Congratulations [laughter]. 
GS: 11:13.087 Thank you. So it's only up to me to publish some more papers and polish a dissertation. 
RVB: 11:21.283 So what does the future hold, Gabor? Where is it going for you personally? Where is your research taking you, but also how do you look at this taking ground in the broader industry? What's the future hold if you had a crystal ball? 
GS: 11:36.571 So, I would really like to be an academic. I really enjoy working at university because you have so many positive experiences with students. You can pretty much follow your own dreams and do research in almost whatever interests you the most. Obviously you have to fit within your grant proposals and your funding but this still gives you a lot of way to be creative and I would like to be a university lecturer and researcher in the future. So that's my kind of dream career. And-- yes? 
RVB: 12:17.317 And is it lecturing and teaching about graphs then or is it on a broader topic or is it computer science or what will be the topic then? Or topics? 
GS: 12:26.893 Well, I'm pretty much happy to teach anything relates to computer science, so I've taught topics from database theory to automata theory, system modelling, and software engineering topics, and also some laboratories on actual technologies. So our university is a bit of a mix between computer science and computer engineering. So we teach both theoretical and practical stuff and this is something that I also really enjoy. 
RVB: 13:01.647 Super. And what about the wonderful world of graphs and graph databases, is there anything like that in your future you think? 
GS: 13:10.251 Yes. So I really would like to get a version of my graph query engine that can be used by other researchers. I obviously understand that implementing production-grade software is not really possible within the limits of a PhD. But I would like to release a system that can be used at least by other researchers, both in academia and both in industry. I talked to a lot of people about this and it seemed that people would actually be interested in trying such a system, or benchmarking such system, and see how it works for their use cases. 
RVB: 13:49.818 Super. So final question, what's your favourite cycling destination? 
GS: 13:54.706 Ooh, that's a tricky question [laughter]. 
RVB: 13:56.737 Curveball for you. 
GS: 13:56.958 But actually, it's not a very common answer. I live next to the Hungarian-Austrian border, so I do go a lot to Austria because Austria has the best roads in Europe, and also most of the country is the Alps. So I live next to the lower Alps section, but even there you have very nice hills, and drivers are really polite, and you have these super flat tarmac all over the country. And that's what I really enjoy and I'm really looking forward to the summer. So I just usually disappear from the university for a couple of weeks and then go home and cycle. 
RVB: 14:38.375 Excellent. So no cobblestones for you? Unlike Flanders Classics or something like that? 
GS: 14:44.387 I actually really like riding the [inaudible], so I live in the inner historical district of Budapest and we still have a lot of cobblestone roads. And when I just started cycling in Budapest just to get to work and commute I usually tended to avoid those sections. But since I'm more into cycling I just go for the most cobblestoney sections [laughter]. This is something that you learn to enjoy or at least you think you enjoy it. 
RVB: 15:16.963 Yeah, yeah. Exactly. Very, very cool. All right. Well, I hope we get to ride one day together, that would be great. I really enjoyed this conversation. Thank you for taking the time. And I look forward to meeting you again someday, at FOSDEM or somewhere else. 
GS: 15:32.360 Thank you, for an invitation and we should definitely go for a ride. 
RVB: 15:36.138 Absolutely. Thank you, Gábor. 
GS: 15:38.717 Thanks. Bye

Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best