Tuesday 28 April 2015

Podcast Interview with Lorenzo Speranzoni, Larus Business Automation

Here's another great conversation with another great citizen of Graphistania: Lorenzo Speranzoni from Larus Business Automation in Italy. I got to know Lorenzo from his unbelievably detailed work on the "Van Gogh" graph: take a look at it over here and go "WAW!" Both of us like Cycling a lot - so we probably could have talked a lot longer - but let's start with this podcast:

Here's the transcription of our conversation:
RVB: Good morning everyone. My name is Rik. Rik Van Bruggen from Neo Technology, and here we are again recording a session of our Neo4j graph database podcast. And today I've got a guest joining me remotely from lovely Venice in Italy, and that’s Lorenzo Speranzoni from LARUS. Hi, Lorenzo. 
LS: Hi, Rik. Yes, good morning everybody. 
RVB: Hey, good morning. Good to have you on the podcast. We got to know each other a little bit through some of the graphgives, I think. Right, Lorenzo? You wrote some really interesting graphgives, and some tweets, and all that wonderful stuff. Do you mind introducing yourself a little bit to our audience? 
LS: Yes, sure. My name is Lorenzo Speranzoni. I come from Venice, and I’m the CEO and founder of a small company which is called LARUS. It’s the Latin name for seagull. We are expert in developing customer software, and we really love graphs. 
RVB: Okay. Cool. How did you get into graphs? I've seen some of your work on Van Gogh's Journey, and The Cycling, and all those wonderful things, but how did you get into graphs, Lorenzo? 
LS: Well, to be honest, it all started with frustration about NOSQL. About SQL, sorry. And we were tired to write those very big queries and enjoying after an afternoon spent optimizing those SQL queries. So we started looking at the SQL landscape. We also, apart from graphs, we also started looking at the all the technologies that then the NOSQL landscape offers. But at a certain point, we just focused on graph databases because we really love the fact that, it's a fully ACID compliant database, and it is also normalized so we love to model that way. 
RVB: Okay. One of your experiments has been with Van Gogh's journey, right? I'm going to put it in as a link in the podcast for sure. How did that go about? 
LS: Thank you. 
RVB: Tell me about that a little bit. 
LS: I really love art also because my uncle was a history art professor, so I can remember the time we spent together talking a lot about art. When I started to look at Neo4j, I decided to write the use case based on art, especially about Van Gogh, which is my favorite artist. What I wanted to do was modelling his journey across Europe in Neo4J graph database. I wanted to understand during his trip, the persons he met, and the other artists he met, and what influenced his opera. 
RVB: That's fantastic. I've been reading it. It's very long ravages graphgist. But I really love it. It's very well done. Thank you for doing that. 
LS: It's a pleasure. 
RVB: What kind of use cases are you seeing for graph databases in your environment? What kinds of things are you working on? And also maybe you could elaborate a little bit on where do you think it's going in the future? 
LS: Well, we are actually really enthusiastic because we are running lots of demos, especially for bank insurance, and telco companies which is our main field of expertise. And what we are observing is also a great enthusiasm from the other side. People that attend to these demos are really excited about starting using Neo. 
RVB: And then what do they want to do with it? What excites them most you think? 
LS: Well, lots of them is looking at the typical use cases you often talk about. I mean watching the recommendation engine, watching at the fraud detection, a typical use case for banking and insurance. 
RVB: Okay. Cool, very cool. What do you think is coming? Where is this going, you think? Both for you guys personally as for Neo and graph databases in general. What does the future hold? 
LS: Well, let me say, people feel the desire to have some more powerful technologies to solve the pressing business problem they have. Everybody is trying to solve their problems with our relation on database, but they know it's too hard and too complex. I know they need something else and when we show Neo4j during our demos people start feeling more comfortable about the opportunity to have something that can really help them. 
RVB: Yeah, I think [inaudible]. I said in the beginning, it's like really, really, really hard, but then it gets easier and easier after that right? Is that also what you are seeing? 
LS: Yes, absolutely. That's absolutely true. 
RVB: Yeah. Very cool. Thank you for coming on the podcast, Lorenzo. It was great having you here. I'll put some links on the blog post with the podcast as well, of course. It's been great having you in our community and I hope to see a lot more of your wonderful work in the next couple of months. 
LS: Thank you so much, Rik, and to everybody. 
RVB: Cheers. Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Friday 24 April 2015

Podcast interview with Michal Bachman, GraphAware

Here's another great conversation for you in our Neo4j Graph Database podcast series. I met up with Michal Bachman of GraphAware, one of the awesome Neo4j partners out there. Michal and I have been working together on different projects, presentations and beer-tastings - and I am happy to say that his capabilities, visions and strategies when it comes to Graphs and Neo4j are WAY better than when it comes to his taste of beer :) ... Listen to the interview and find out why:

Here's the transcript of our conversation:
RVB: Hello, everyone. This is Rik. Welcome, again, to one of our Neo4j graph data-base podcasts. It's another remote session. I'm joined today by Michal Bachman of GraphAware. Hi Michal. 
MB: Hi Rik, thanks very much for inviting me. 
RVB: Yeah, absolutely. It's great to have you on the podcast. Michal maybe people don't know you yet, so why don't you introduce yourself? Who are you? 
MB: Sure. My name is Michal Bachman and I'm the founder and managing director of a company called GraphAware, which is a London based company dedicated to Neo4j consultancy, training and development. Being based in London we are in a great position to travel around the whole world pretty much and help people succeed with Neo4j. That's what we do for a living. 
RVB: Absolutely. I can hear the London Police in the background [laughter]. That's absolutely great. Thanks, Michal. How did you get to graphs and how did you get to Neo4j? Tell us a little bit about that and what attracted you. What do you love about graphs? 
MB: I started with Neo as a user, pretty much, about four or five years ago. I was involved in a few projects, in fact, that used Neo. One was a recommendation engine, and another one was an impact analysis solution for one of the large telcos. And I really liked the experience as a user, and I then went on and took a bit of a break, and did a master's degree at Imperial College, London where I wrote a thesis on graph databases. 
RVB: Oh, yeah? 
MB: Yeah. Quite inspired by Jim Webber and his idea. That was great, and I loved it. I loved the experience as a user. I loved doing research about it, so the natural next step was to start my own company that will focus only on Neo4j. That's how I pretty much started, and it's been everyday [chuckles]. 
RVB: Absolutely. What attracted you most? What did you like most about working with graphs and Neo4j, specifically? 
MB: The actual thing that I liked the most is, surprisingly not a technical thing. It's the fact that when you introduce people to graphs, and we are doing that every day, you can see the moment - the "Ah" moment - in their eyes. 
RVB: The lights come on [chuckles]. 
MB: Yeah. The lights come on, and then they're like, why haven't I used this before? This is not just like another 10% better way of storing data. This is a complete game changer, and people seemed to get it immediately, and it's applicable to every domain out there. There's a huge potential, and I just liked the fact that you know when people get it. They just fall in love with it. 
RVB: Just to follow onto that, one of my Dutch community members, or community members in the Dutch graph database community, once told me, "Once you start working with graphs, relational databases feel like a youthful sin," [chuckles]. 
MB: [laughter] Yeah. And it makes so much sense if you think about it. Most people work with object-oriented languages, and objects are graphs. Everything is a graph, so it just feels so natural after you've made that transition. 
RVB: Tell me a little bit more about GraphAware now. You guys have a wonderful graph framework these days, right? The GraphAware Framework. Tell me a little bit more about it. 
MB: We've doing two things, really. We've been doing consultancy, as you know. We are involved in projects, very hands-on, helping customers develop software with Neo4j. And as we are gaining more experience about what the use cases are and what people need, we're distilling some of those ideas and experiences into open source extensions for Neo4j. That's the two things, and the third one of course is training. We're running also community events, but also we run public trainings. In the future, we're seeing doing more of the actual extension development and open source software built on top of Neo as the way to go. 
RVB: What are some of the functionality of the framework, in just two minutes? 
MB: One that we recently released, and that's getting quite popular, and running meetups around it as well, is a recommendation engine extension that allows people to build quite complex high-performance engines on top of Neo. That's one. And the other ones are quite technical. We've got modules for representing time as time series data in Neo4j as a tree and easy creating, and there's loads of other modules to main specific use cases. 
RVB: I'll put a link to the repo on the blogpost to go with the podcast so people can take a look at that. Let's maybe move on a little bit. Where is it going, Michal? Where are you guys going as GraphAware, but also where do you see the industry going? Any perspectives that you want to share? 
MB: Absolutely, I think we're going to see, and we are going to see quite soon, this technology being adopted by large enterprises in a massive scale. And as that's happening, I'm seeing some enterprise features, more of the enterprises features being developed, whether the part of the core product or as extensions, so that companies like banks and trans companies, and so on, find it easier to use. I'm talking about security, auditing and things like that. And I see people starting to build whole platforms around the graph use cases, include graph-compute engines in them, include other great software to build whole data analytic platforms, where the graphics is the center of the game. And extensions for impact analysis, fraud detection, recommendations, complete solutions, I think it's what we're going to be seeing in the near future. 
RVB: Very cool. Okay. One more question for you, and it's the most important one. What do you prefer best, Belgian beer or Czech beer? 
MB: [laughter] I have to be honest with you, I prefer Czech beer [laughter]. 
RVB: Oh my God, I can't believe that! All right, thank you so much for coming on the podcast Michal, it was great having you. 
MB: Thank you, Rik, for inviting me. I want to say one last thing. We're of course going to be present at the Graph Connect. We're sponsoring the conference, 7th of May, we're going to be there, so if anyone's interested in having a chat with us, please come to Graph Connect in London, and we'll see you there. 
RVB: Yeah. Absolutely. Thanks a lot, Michal. Talk to you soon, man. 
MB: Thanks, Rik. Bye-bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Tuesday 21 April 2015

Podcast Interview with Peter Neubauer, Mapillary (and co-founder of Neo4j)

Today is a special podcast episode. I got a chance to talk to one of Neo4j's founders, Peter Neubauer, again - which is always fun. I remember one of the first conversations I had with Peter, where he was explaining something to me over a skype call - and he was at the same time pushing a core Neo4j bugfix to github. Just to say that he is pretty awesome and incredible at multi-tasking :)) ... Peter left Neo last year as an active team-member, to start a new project that "looks just as impossible", Mapillary. Take a look at it for sure - but listen to the podcast first:

And here's the transcription of the chat:
RVB: Good morning everyone. My name is Rik, Rik Van Bruggen, from Neo Technology, and here we are again recording a remote session for our graph database podcast. I'm joined today from Sweden. Peter Neubauer is on the other side of the line. Hi, Peter 
PN: Hi, Rik. Nice to meet you. 
RVB: Yes, good to be on the phone with you again. It's been a while. Peter, if you don't mind - most people will know you - but, would you mind introducing yourself a little bit for people that don't know you yet? 
PN: Yes. Regarding Neo4j, I'm one of the three founders of Neo4j together with Emil and Johan who are currently working on Neo Technology. Actually, it was us three who came up with the first version back in 2002 and wrote the first version that went into production. 
RVB: That's a long time to go, huh? 
PN: That's long time ago, yes - 13 years. 
RVB: Absolutely. It's been quite a ride. How did it start Peter? How did you guys get into Neo and where did it come from? 
PN: It did start with us having written or gone into content management systems. And we are at that point managing images, and one of the major problems there was that every photographer, every picture agency, had their own rights management for when the image could be licensed as to whom, in which country, and so on, so there wasn't a lot of business logic about that. Modelling that in, what then was, Informix, and it was a server, our database that was one of the most capable database engines at the time - and object relational database engine-- 
RVB: Didn't they get acquired by IBM, Informix? They did, right? 
PN: Yes, they did. And this was the last version, 9.14, that came out before they were acquired. And we were trying to model our business logic in that engine and it just didn't scale. We got into like five, six joins, and even with the, at the time, modest data in that database, it was just taking ages, like minutes, to get answers back, and that's not good enough for a backend that needs to surf on the web. At that point, we examined our system architecture and found out that the database was the bottleneck. And we also at that point had some of the Datablades or plugins to Informix, and one of them was dealing with the semantic words for translation, namely WordNet, a semantic initiative to structure the English language. And that kind of projected in network model off these connected words, like hyponyms and synonyms and concepts into the database. And we saw that and thought like, "This is a very interesting approach to model data." It's very close to the UML diagrams and so on, if you translate it to our domain. We tested that plugin for our domain just to see if the model fit, and it did. However, it was still slow, so since it was such a beautiful match, we then went about and actually wrote, at that point, a Java Enterprise JavaBeans, 1.0 implementation that modeled that kind of structure what is now-- and that actually had the first kind of Java version of what is now the Neo4j API, all these fleshed out. 
RVB: Did you call it a graph API at the time, or do you call [crosstalk]? 
PN: No, no, no. 
RVB: What did you call it at the time? 
PN: We called it a network database or network engine, and that's where Neo partly comes from. Of course, the matrix is very popular but it also stands for network engine of course, so we had to make it work [laughter]. 
RVB: Fantastic, okay. I don't know if I've ever told you that, but one of the projects that I've first worked on was a project for DHL which was also using Informix Datablades. 
PN: It was a very good database. 
RVB: Super. Where are you now, Peter? What are you doing now? You're working for a new startup, right, Mapillary? 
PN: Yes. I left Neo Technology last year, mostly because I found-- I'm and early startup guy and Neo4j has a big group of followers now and there's so much activity around it so my feeling was that I can't invest my time in something that is, again, almost impossible, so I joined Mapillary as a co-founder and the vision there is to do a visual representation of the whole earth possibly even a 3D model connected to it. So that's what we're doing. People are submitting basically thousands and thousands of images taken by their smartphones or action cameras and so. And we in the background do a lot of computer vision and analytics on this data and we connect the images into what could be described a big, global, giant graph of visual information. So Neo4j is an essential part of  the architecture there. 
RVB: Oh, is it? What do you use it for? What do you use Neo for? 
PN: We use it for connecting the analyzed images both in space for instance so you have actually this connection between one image and the nearest images in different directions, and then even connect computed visual connections. For instance, one image overlapping another image so if two images look at the same view of the turning torso, then we will know it and we will actually create a connection in that image graph in-- from this image, you can translate the object, turning torso, into something that can merge into the others, so we know how to project and we even store the 3D point cloud of these objects in Neo4j [inaudible] references too in Neo4j. The whole navigational logic, if you then want to construct, for instance, a street view from these millions of images, it's done in Neo4j but because it's a perfect use case for Neo4j. Basically fetch all the connected images in the vicinity of say a connected images that 3 or 4 or 30 if you are going to fast forward then prefetch these into a local graph in JavaScript and do that along certain rules while you are traversing the backend graph. For instance, time filtering, or filtering just certain color, shades, or certain directions or what not. 
RVB: Super. That's really interesting. I think people can get involved with the Mapillary project as well, right? There's like an app that they can download and then you can participate in the project, right? 
PN: Yes. Anyone can submit pictures and anyone can help improve the data. It's like OpenStreetMap of Wikipedia, so you can improve even, for instance, street sign detections and object detections that we do in the images and feedback to for instance the OpenStreetMap project and to Wikimedia. 
RVB: Cool. Peter, maybe one more question because we keep these podcasts quite snappy. Where is this going? Where is Mapillary going? Where are graphs going? Any vision on that? Do you mind sharing that? 
PN: I think the concept of connected data is growing a lot and people are expecting and willing to put in much more effort into making data connected. That is not just on the global linked data initiative level, but even on pragmatic in-system level. So where I see graphs going is that they approach enterprise. Enterprise Connected data are even in normal installations, and as we see it here with a lot of developments in virtualizing hardware and so, you can partly build bigger monolithic kind of graph blobs. In Mapillary we have now over 1 billion properties in the database within one year, and that is one thing. The hardware is letting you scaling up these installations quite a lot, so you can scale up quite easily. And the other thing is that sharding graphs will be the forefront of data science. That is one of the remaining kind of challenges with graphs. They're very easy to query and so on, but sharding them is not trivial. 
RVB: So difficult. Yeah, yeah. 
PN: It's difficult, yeah. 
RVB: I'm sure you've heard of the work that Jim Webber and Co have been doing on that, and we are really in the middle of starting that project again and making some good progress there. 
PN: Yeah, I'm really excited about it. Right now, in Mapillary, we will go for bounded boxes and shard by geography, and if you have a domain that lets you [shard?] it in a kind of interesting way, then you can do this already now, but auto-sharding would be awesome. 
RVB: Super. Peter, thank you so much for coming on the podcast. It was pleasure to talk to you again. I really appreciate it. I wish you so much luck and pleasure and drive at Mapillary, and thanks again. I look forward to speaking to you soon. 
PN: No problem. Nice to talk to you too, Rik. 
RVB: Cheers.

Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Sunday 19 April 2015

The making of The GraphConnect Graph

Next month is GraphConnect London, our industry's yearly High Mass of Graphiness. It's going to be a wonderful event, and of course I wanted to chip in to promote it in any way that I can. So I did the same thing that I did for Øredev and Qcon before: importing the schedule into Neo4j.

I actually have already published an GraphGist about this already. But this post is more about the making of that database - just because I - AGAIN - learnt something interesting while doing it.

The Source Data

My dear Marketing colleague Claudia gave me a source spreadsheet with the schedule. But of course that was a bit too... Marketing-y. I cleaned it up into a very simple sheet that allowed me to generate a very simple CSV file:
I have shared the CSV file on Github. Nothing really special about it. But let me quickly explain what I did with it.

Choosing a model

Before importing data, you need to think a bit about the model you want to import into. I chose this model:
The right hand part is probably pretty easy to understand. But of course I had to do something special with the days and the timeslots.

  • The days are part of the conference, and they are connected:
  • And the timeslots within a day are also connected:
So how to import into that from that simple CSV file. Let's explore.

The LOAD CSV scripts

You can find the full load script - which actually loads from the dataset mentioned above - on github too. It's pretty straightforward: most commands just read a specific column from the csv file and do MERGEs to the graph. Like for example
load csv with headers from "https://gist.githubusercontent.com/rvanbruggen/ff44b7dc37bb4534df2e/raw/aed34a149f04798e351f508a18492237fcccfb62/schedule.csv" as csv
merge (v:Venue {name: csv.Venue})
merge (r:Room {name: csv.Room})
merge (r)-[:LOCATED_IN]->(v)
merge (d:Day {date: toInt(csv.Date)})
merge (tr:Track {name: csv.Track});
Nice and easy. There's a couple of commands that are a bit more special, as they have to check for NULLs before you can do a MERGE. But nothing really complicated. There's two sets of import commands - one for each day - that is a bit more interesting: how do you import the timeslots and create a structure like the one above, where all timeslots are nicely ordered and connected in an in-graph index. That's not that trivial:

  • loading the timeslots is easy with MERGE
  • sorting them is of course also easy
  • but creating the appropriate FOLLOWED_BY relationships between the timeslots to create the in-graph index/timeline, is not that easy.
Luckily I found these two blogposts by Mark Needham that shows me how to do it. Here's the query:

match (t:Time)--(d:Day {date: 20150506})
with t
order by t.time ASC
with collect(t) as times
  foreach (i in range(0,length(times)-2) |
    foreach (t1 in [times[i]] |
      foreach (t2 in [times[i+1]] |
        merge (t1)-[:FOLLOWED_BY]->(t2))));
What this does is the following:
  • you match all the timeslots that are part of the specific day that you want to order.
  • you pass the ordered timeslots to the next part of the query
  • you collect the ordered timeslots into a collection
  • you iterate X times (where X is the length of the collection -2, so excluding the start and end position) through the collection with FOREACH
  • you iterate through the starting positions (i) and the ending positions (i+1) in the collection
  • every time you iterate, you MERGE a FOLLOWED_BY relationship between the starting position and the ending position
And that's it. Job done. We do this for the second day ("20150507") as well, of course.

Hope you enjoyed this as much as I did, and hope to see you at the conference!



Friday 17 April 2015

Querying the SNAP Beeradvocate dataset in Neo4j - part 3

In the previous part of this blog post series of three posts, I imported the the SNAP Beeradvocate dataset into Neo4j. All good and well, and we now have the following meta-graph:

So now we can start querying the dataset. It's a bit different from the "Belgian Beer" dataset that I worked on previously - this one is a lot bigger, and also a bit more US-focused. But still - we can do some nice queries on it. Let's start with somethng nice and simple:

//where is Duvel 
match (b:Beer {name:"Duvel"}) return b 
//where is Duvel and surroundings 
match (b:Beer {name:"Duvel"})-[r]-()return b,rlimit 50
The result is interesting:

Then we try the same for Orval.
match (b:Beer {name:"Orval"}) return b 
does not return anything. So let's see if we can find it some other way:

The following query:
match (b:Beer)
where left(b.name,5) = "Orval"
return b  
tries to find Beers that have their name START with the word Orval. And yes indeed, we find it immediately.

And I am very happy to report that this query actually taught me something new about Orval. Even though I am a big fan and have been to the brewery multiple times - I had never drank the "Petite Orval", a lighter trappist beer that is only brewn for the monks. See wikipedia for more details. 

So let's take a look at some interesting paths. Here's a path between Duvel and Orval:
match (d:Beer {name:"Duvel"}), (o:Beer {name: "Orval Trappist Ale"}),  path = shortestpath((d)-[*]-(o))  return path  
that gives an (*one*) interesting path:
I really need to find that "Duvel Single" beer and taste it.

Or look at some additional paths, and run this query:
match (d:Beer {name:"Duvel"}), (o:Beer {name: "Orval Trappist Ale"}),
path = allshortestpaths((d)-[*]-(o))
return path
limit 10  
I have put in place the LIMIT to not make the browser blow up. The result is interesting:

There seem to be quite a few REVIEWERS that are reviewing both beers. That's interesting of course, but let's say that I would not want the reviewers/reviews be part of the path? Well, turns that "excluding" nodes from a shortestpath function is not that easy in Cypher - you are better of including the relationship types that you want to have included. Like this:

//link between Duvel and Orval discarding the reviewers and reviews 
match (d:Beer {name:"Duvel"}), (o:Beer {name: "Orval Trappist Ale"}), path = allshortestpaths((d)-[:BREWS | HAS_STYLE*]-(o))  return path  
This query gives yet another interesting result:
Turns out a the brewery seems to be brewing a number of similar beers to Orval! Not sure how true this is - but worth an investigation!

Last but not least, is the search for my favourite beer style - the Trappist beers. Now, this is kind of tricky, as the dataset that we are working with here is kind of American focused - and not all the beer brands or styles are as I would expect them to be. On top of that, we currently don't have "fulltext" search capabilities in the wonderful new "Schema indexes" (we do have them in the legacy indexes, but I am not using those here), so we have to work around that with a regular expression in the query. Let me show you:
//Find the Trappist beers, their breweries and styles 
match (br:Brewery)--(b:Beer)--(s:Style)
where b.name =~ ".*\\bTrappist\\b.*"
OR s.name =~ ".*\\bTrappist\\b.*"
return b,br,s;  
gives us all the beers, breweries and styles that have the word "Trappist" in their beer or style names. It gives us a really interesting subgraph to take a look at:

Seems like the good news is that I have quite a bit of beer exploration to do!

That concludes this third part of the Beeradvocate network dataset exploration in Neo4j. There's a lot of other stuff that we could do with this dataset - but I hope you already found it as interesting as I did - and as always, please send me your feedback!



PS: Here are the links to

Thursday 16 April 2015

Podcast Interview with Chris Gioran, Neo Technology

In this podcast I have been able to have lots of great conversations with lots of great, interesting people. This episode is another one of those conversations to remember: I chatted to Chris Gioran, one of Neo's software engineers working from Athens, Greece. Chris is special, as you will learn in this podcast. Not in the least because he was one of the lead engineers developing Neo4j's HA system that is part of the Enterprise Edition of Neo4j. Here's the conversation:

And here's the transcription of the chat:
RVB: Hello, everyone. This is Rik from Neo and here we are again recording another episode for our podcast. Tonight I've got a wonderful guest from all the way over in Greece. It's the ever so lovely, Chris Gioran. Hi, Chris. 
CG: Hi, Rik. 
RVB: Good to have you on the podcast. 
CG: It's good to be here. 
RVB: Chris, most people won't know you, at least not-- they might know your work but they may not know you as a person, so would you mind introducing yourself? 
CG: Sure. I'm Chris Gioran. I come from Athens, Greece. I'm a software engineer and I have been in the boiler room of Neo4j working in the kernel, deep within-- very close to a disk for the past four years - four years and something now. I've been in the kernel team since pretty much the beginning, and I've also moved over occasionally to the HA component, and I'm currently one of the two primary authors of the current HA offering when we moved from Zookeeper to HA, to PAXOS, basically. 
RVB: Chris, how did you get to Neo4j? Can you tell us a little bit of the history there? 
CG: Sure. When I was fresh out of university, one of my primary research interests was databases as an undergraduate and I worked a bit in the industry - working as a database optimizer, I guess. I was working with relational databases trying to make them work faster, especially because most of the ORM solutions, like Object Relational Mapping solutions that people used to create websites, are not that efficient with the SQL they produce. And being into databases, being really into software engineering and Java, I started looking into the NOSQL solutions that were popping up at that time, and Neo4j drew my attention because it was written in Java and it was - and it still is of course - a true ACID database and I really wanted to get into the actual implementation of the thing instead of just being a user. So I started reading the source code and I wrote the page describing how Neo4j manages its transactional aspects, how it stores information to the disk, how it ensures locking, isolation, all the good things that we've come to expect from databases. 
RVB: You were not working for Neo at the time, right? 
CG: No, no. 
RVB: You were just a community member, right? 
CG: I was just a community member - if even that. I was just a guy interested in how it works, basically. 
RVB: Very cool. 
CG: I wrote those articles and Peter Neubauer picked them up along with the rest of the team - which was very young at the time - and we got to talking about it. I got the opportunity to write some code. Basically, it was around getting external transaction managers to work with Neo4j so that you can have real two-phased commit between, for example, Neo4j and another database like MySQL. And we integrated that into the kernel and as they say, everything is history after that. 
RVB: Yes, the rest is history, absolutely. And then you started working for Neo as a software engineer. What were some of the main things that you worked on? You mentioned the HA implementation, right? 
CG: Right. Like I said, I started off working in the kernel, basically. So my first big task was moving into the new property store that we use right now which is more compressed than the original versions, taking up less space, and it's also more efficient because it takes up less memory and you can read more from disk with one go. After that though, I started moving to HA. Initially I tried to optimize the way that we used Zookeeper to make big cluster offerings work more efficiently, but then we saw the shortcomings of that approach. And me and Rickard Øberg, we got down and we rewrote the way that HA works and we moved away from Zookeeper which, great software as it may be, it wasn't fit for our purpose. And we wrote, from scratch, a Paxos implementation which does pretty much the same thing but in a much more controlled fashion in a way that we can debug it and maintain it and making it finally performant. 
RVB: And Paxos, Chris? So just for our listeners - Paxos, that's a protocol, right? It's a high availability protocol--? 
CG: Basically, it's a distributed consensus mechanism. It's one of the primary protocols used for atomic broadcast, and in simple words it means that it makes sure that all the machines in the cluster know exactly the same things, even in the face of partial failures or complete failures. And that's what we use. 
RVB: That's what Neo4j, the current version - 2.2 - uses, right? 
CG: Yes, that's exactly right. Since 1.9, basically. 1.9 had both Zookeeper and HA as an offering. You could switch between the two, or close anyway. But since 2.0 HA, the current Paxos offering has been the only thing that we use. 
RVB: Maybe I can sort of quickly zoom out a little bit. You mentioned that you were interested in databases already, but was there anything specific about the graph model or the graph database model that attracted you to Neo? What did you like about Neo at the time when you started using it? 
CG: My first interest in Neo was mostly the technology behind it. It was a grassroots database that had ACID guarantees, and that's what drew me to it. It wasn't the model, to be honest, but very soon I came to realize getting involved in all the ecosystem and seeing how people used this both as community members and as large deployments that we had at the time, that even though it was such a small code base, and it really wasn't as mature as most of the relational offerings, it offered very similar guarantees but insanely faster performance. That was the thing that struck me first. So it was the lack of joins, basically. 
CG: The other was the lack of impedance mismatch between the object-oriented paradigm of programming and the way that you store stuff in a graph. Because when you use a relational database, you have a round peg that you try to fit in a square hole, basically. But when you have a graph, you can map pretty much one to one all your domain objects onto the disk and you will never know the difference. And testament into that has been the Spring framework which was essentially the effort of just one person, Michael Hunger, who singlehandedly provided an ORM from Spring onto Neo4j. Whereas, if you see solutions for the corresponding relational problem, they are insanely complicated and have lots of shortcomings. 
RVB: Chris, where is it going? Maybe we can zoom in on that one a little bit. I know that you are taking on some new adventures personally. You can talk about that if you want, but where do you think the graph space or the graph database space is going as well? What's your opinion on that? 
CG: Well, judging from the start that we have seen that the market wants - like our customers as well as our long research interests and the way that we want to take things - I can see that I see two trends. One is graph processing, global graph processing. That's something that looks really, really interesting, which is-- whereas most of the new SQL solutions right now, perhaps they are better suited for online transaction processing. We also want to move into graph global queries and do application batch processing of very, very, very big graphs. Provide functionality like a graph compute engine or huge graphs that you can process really fast and do data mining or do interesting calculations. 
RVB: Like distribution of queries and all those types of things? That's what you're thinking of? 
CG: Exactly. And that actually leads us nicely to the second thing that I'd like to see which is really, really big graphs. Right now, there is no real offering for having graphs that are practically unlimited in size. This is something that we are looking into for Neo4j. We've been doing so for a long time - semi-publicly - and we really want to move into that direction. I'd really like to see clusters of thousands of machines processing huge amounts of data with the ease that we've come to know from Neo4j when it comes to single instance data. So it's not only performance, it's also the ability that you gain, the kinds of stuff that you can do very easily when you have that technology. 
RVB: Chris, you personally, you are going to do some new interesting adventures, right? Do you want to talk about that? 
CG: Yeah, sure. 
RVB: Or do you want to mention that? 
CG: I can talk briefly about it. Apart from my software engineering interests and graphs in particular, I'm also very, very interested in doing some work in journalism. For the past two months, I think, I've been working as a data journalist in a new venture, like an NGO in Athens, Greece, where we try to do that sort of work. Currently I'm the only junior data journalist on staff. 
RVB: Does it have a name already? Does the agency have a name already? 
CG: Yeah. The name is The Aeneosis. Which sounds like a Greek word, but it really isn't. It's a portmanteau. It's a concatenation of two words from Greek. 
RVB: I'll put a link to it on the blog post that goes with the podcast, maybe that's [crosstalk]-- 
CG: Sure, when we launch because we don't have a site right now. We're launching mid May. But that's one of the things that-- data journalism is also a domain that can gain from graphs, by the way, and we already have projects starting that will be using Neo4j basically for ontology processing to start with. 
RVB: Chris, I think we're going to wrap up. We want to keep these podcasts reasonably short. Thank you so much for spending time with me. I really appreciate it. And good luck with your ventures both at Neo and with your agency. Thank you for coming online, Chris. I appreciate it. 
CG: Thank you for having me, Rik. Thank you for doing this, and thank you for everything. 
RVB: Cheers, bye. 
CG: Cheers.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Wednesday 15 April 2015

Importing the SNAP Beeradvocate dataset into Neo4j - part 2

After the previous post on the SNAP Beeradvocate dataset, we were ready to import the dataset into Neo4j. We had 15 .csv files, perfect for a couple of runs of Load CSV.

The first I needed to do was to create a graph model out of my CSV files. Here's what I picked:
So then I would need to create a series of Load CSV commands to import these. And this is where it got it got interesting. I created the Cypher queries myself, and found that they worked fine - except for one part. This was the part where I had to add the reviews to the graph. This was my query:
 using periodic commit  
 load csv with headers  
 from "file:/Users/rvanbruggen/Dropbox/Neo Technology/Demo/BEER/BeerAdvocate/real/ba7.csv" as csv  
 fieldterminator ';'  
 with csv  
 where csv.review_profileName is not null  
 match (b:Beer {name: csv.beer_name}), (p:Profile {name: csv.review_profileName})  
 create (p)-[:CREATES_REVIEW]->(r:Review {taste: toFloat(csv.review_taste), appearance: toFloat(csv.review_appearance), text: csv.review_text, time: toInt(csv.review_time), aroma: toFloat(csv.review_aroma), palate: toFloat(csv.review_palate), overall: toFloat(csv.review_overall)})-[:REVIEW_COVERS]->(b);  

On some of the import files (remember I had 15) this query would fail - run out of Heap space. Now this is a very tricky thing to troubleshoot in Neo4j, so I had to call for help. My colleagues all immediately volunteered, and of course within the hour Michael had reengineered everything.

The first thing Michael asked was my query plan (using the EXPLAIN) commando: this was particularly interesting. Michael saw that there was a step in there that was called "Eager". Mark has blogged about this elsewhere already, and it was clear that we had to get rid of this.

Here's the query that Michael suggested:
 //the query below is NO LONGER PROBLEMATIC  
 using periodic commit  
 load csv with headers from "file:/Users/rvanbruggen/Dropbox/Neo Technology/Demo/BEER/BeerAdvocate/real/ba15.csv" as csv fieldterminator ';'  
 with csv where csv.review_profileName is not null  
 create (r:Review {taste: toFloat(csv.review_taste), appearance: toFloat(csv.review_appearance), text: csv.review_text, time: toInt(csv.review_time), aroma: toFloat(csv.review_aroma), palate: toFloat(csv.review_palate), overall: toFloat(csv.review_overall)})  
 with r,csv  
 match (b:Beer {name: csv.beer_name})  
 match (p:Profile {name: csv.review_profileName})  
 create (p)-[:CREATES_REVIEW]->(r)  
 create (r)-[:REVIEW_COVERS]->(b);  
 // takes 13s  

You can find the two import scripts on github:
  • this is my old version (which DID NOT WORK, at least not always)
    UPDATE: in the original version of this blogpost, I was working with version 2.2.0 of Neo4j. Recently, 2.2.1 was released - and guess what: the queries run just fine. Apparently the team had made some change to how Neo4j handles composite merge updates - and it now just flies through all queries, even with my old, sub-optimal version of the queries.. Kudos!
  • this is Michael's version (which, of course, WORKS)
    UPDATE: I would still recommend using this version of the queries :) 
Let's explore some of the differences.
  1. Michael's version included the same indexes as mine - but also included a UNIQUENESS CONSTRAINT. This seems to be a good idea because it makes the MERGE-ing of the data unnecessary - you can just CREATE instead.
  2. Michael's version does "one MERGE at a time". Rather than merging in entire patterns like
merge (b)-[:HAS_STYLE]-(s:Style {name: csv.beer_style})

you instead do

merge (s:Style {name: csv.beer_style})
and then merge (b)-[:HAS_STYLE]->(s)
  1. you reorder certain parts of the query to come earlier in the sequence. I noticed that he did the CREATE of the review first,  then transferred that result into the next part of the query with WITH, and then did two matches (for Beers and Profiles) to connect the Review to the appropriate Beer and Profile. To be honest, this seems to have been a bit of a trial and error search - but we found out after talking to the awesome devteam that this should no longer be necessary as from the forthcoming version 2.2.1 of Neo4j. 
The result is pretty awesome. After having run the entire import (which means running the same import 15 times - see over here for the complete script) I got a pretty shiny new database to play around with:
In the last part of this blog-series, I will be doing some fancy queries. Really looking forward to it :))

Hope you found this useful.



PS: Here are the links to

Tuesday 14 April 2015

Podcast Interview with Brian Underwood, Neo Technology

I recently got the chance to talk to Brian Underwood, currently working as a Developer Evangelist at Neo. Brian is a long time member of the Neo4j ecosystem, and has contributed a lot to the Ruby gems for Neo4j.  He's currently travelling the world with his family, but found some time between flights to talk to me about his work. Here is the conversation:

Here's the transcript of what we talked about:
RVB: Good morning everyone. My name is Rik, Rik Van Bruggen. I'm from Neo Technology, and here we are again recording another episode for our graph database podcast. I'm joined today by Brian Underwood, welcome Brian. 
BU: Yeah, thank you. 
RVB: Great to have you on the podcast. So, these series are quite short and snappy, right, Brian. Do you mind introducing yourself here, what's your background? 
BU: My background is mainly as a full stack web developer. A lot of rails, a little node. And I'm currently working as a developer evangelist for Neo4j. I'm also one of the maintainers of the Neo4j.rb project, the ruby gems for Neo4j. 
RVB: Wow, you do all that while traveling the world, right? 
BU: I do, yes. I'm currently traveling the world with my family. 
RVB: That's fantastic. Brian is going to be in Stockholm—I’m in Antwerp, so it's great to have you in the podcast. What got you into the world of graphs, Brian? Do you mind telling a little bit about the history, and how and why you got into it? 
BU: Yeah, for sure. A couple of years ago I was working for Couchsurfing.com, and I had one of my colleagues there, sort of had heard about Neo4j, and told me about it, and it sounded interesting. While I was there-- it's a major social networking site, and so I thought I would play with graph database and see if I could import a little bit of data. I don't think I made too much progress at the time, I was doing other things, and working on the side. But I played with it, I really liked it. I like Cypher. I remember being really excited about the 2.0 beta, and I was like, "I have to"-- being excited about labels particularly, and so I had to use the beta, not the 1.9 [chuckles]. 
RVB: At your own risk, right [chuckles]? 
BU: Yeah, exactly. There was-- I think I might get too. But it worked quite well. So I did that for a little while, that fell to the way side, and six months later-ish I was looking for an open source project to contribute to, or something to spend my time with. So I thought about the Neo4j gem, which was at the time maintained by Andreas Ronge. So I contacted him and said, "Hey, can I help out?" So I've been helping out ever since. 
RVB: Okay, fantastic indeed. It's a very active project, I think. The ruby wrapper for Neo4j? 
BU: Yeah, definitely. We're very responsive on Github and Stackoverflow to people's questions. We really like helping people out, because we are really excited about the project. And we are actually-- right now we're trying to sort of push for a new release for the gem. So I have been putting together a list of issues that-- things that we want to see in there. 
RVB: So what attracted you to Neo4j? What made it interesting for you? What was the key thing that you loved about it? 
BU: I really-- I don't know if you're familiar with active record? In the Ruby world… 
RVB: That's not my forte for sure. 
BU: Just very briefly, active record is a wrapper around SQL databases, so it works with Postgres, MySQL or SQLLite. And it offers a higher level abstraction, on top of those-- a modeling abstraction, and it's very, very powerful, and very, very deep. It took-- I spent years getting used to it, and getting into it. I considered myself an active record expert, and very into databases. And so using Neo4j I was like, "Oh, I really wish there was something like active record for Neo4j." But also Neo4j, to me-- it seemed like active record was smoothing away a lot of the awkwardness of SQL. Whereas with Neo4j, you didn't really have to work against the database, as far as modeling and data abstraction. That's what really I think what attracted me to Neo4j, was the smoothness, and that's also why I was wanting to work on the Gems, because I think you could not only provide that same abstractness for active record, but do even more than… it was really exciting. 
RVB: I'm going to quote you on that one. I don't have to fight the database any more. That's a great quote. I like that one. So-- 
BU: Go ahead. 
RVB: No, no, no. I said it's a great way of putting things. If you don't mind, where do you see the ruby wrapper going, and where do you see graphs going? What's your vision for the future, if you don't mind talking about that? 
BU: Yeah, I sort of see ruby and rails as providing a framework, where you don't have to do a lot of the busy work that you'd normally need to do to get things done. I see Neo4j as sort of being in that same vein, where it lets you work at a higher level, and you don't have to think about the details as much. And so that you can get things done faster, but you can also get things done that you might not have considered before, that might have been really difficult before. Rather than just doing web applications, where it's just like, "Update this object, and create this new object", it's like, let's have a complex data structure that we're working on that makes a really cool web application that you could do some things that you couldn't do before. 
RVB: Is that going to be more possible in the future, you think? Is that the type of evolution that you see coming in the future? 
BU: I think so, yeah. I think we  - certainly in the gem - I think we're-- I have my mind vaguely on this, even though we're not quite there yet, but I definitely want to make it really easy to load complex structure of data, with one go, to make that something easy, to create an API pointer, a web page generated from that data, a lot easier than before. 
RVB: Cool. Well, it will be great to keep following that, I'm looking forward to that. We're going to keep it at that, if that's okay for you, Brian? Unless there's anything else that you want to communicate to the rest of the world? 
BU: I don't think so. That was great. 
RVB: Fantastic. Thank you for coming on the podcast, I really appreciate it, and I'll put the links to the Github repo, and everything on the blog post that goes with this. Thank you, Brian. 
BU: Great, thank you. 
RVB: Thank you.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Monday 13 April 2015

Importing the SNAP Beeradvocate dataset into Neo4j - part 1

As you may or may not know, I am a big fan of Beer. And I am a big fan of Neo4j. I did a talk about this at GraphConnect two years ago ...
- and I have been doing that demo a lot with lots of users, customers and meetups. Up to the point where I was really working up a reputation as a "Beer Guy". Not sure if that is good or bad...

So some time ago I learned about the SNAP: Stanford Network Analysis Project, who mention an example dataset that is really interesting: 1.5 million Beer-reviews from Beeradvocate between January 1998 and November 2011. Here are some stats:
And then at the bottom of the page it says:
Disappointment!!! But then one simple Google search led me to this blob and a 445 Mbyte download later, we were in business. Unzip the thing, and we have a 1.5Gbyte text file to play with.

Importing the data? Not quite, yet!

Once I had downloaded the data, I found that there were two big "bears" on the road. Two problems that I would have to solve.
  • Problem 1: the size of the data. I would not really call this file huge, but on my little laptop (I "only" have 8gbyte of RAM on my machine) it can get kind of tricky to work with biggish files like that... Based on experience - I just know I will get into trouble, and will need to work around that.
  • Problem 2: the structure of the data. Here's a look at the file:

This definitely does not look very Neo4j-import friendly. I need to do some transformations there to make the thing easy to import.

So how did I go about solving this? Read on.

Solution 1: splitting the text file

The first thing I set out to do was to split the text file into different parts, so that it would be easier to handle and transform along the way. I looked around for a while, and then found the split bash command - by far the simplest, most performant and straightforward option.  Played around with different options (different sizes for the splits - I first had 6 files, then 3, and ended up choosing to split into 15 100 Mbyte files), and eventualy used this simple command:

split -b 100m beeradvocate.txt

This was the result
Nice - 15 perfectly manageable files! Once done that, needed to assure that the split was not completely arbitrary, and that "whole records" were included in every file. Easy peasy with some simple copying and pasting from the end of each file to the beginning of the next file - done in less than 5 minutes! Job done!

Solution 2: transforming the .txt files

Then I needed to get these manageable files into a .csv format that I could import into Neo4j. This was more complicated. I needed to go 
  • from a structure that had records in blocks of 14 lines, with a property field on every line
  • to a .csv file that had a row per record, and fields in comma-separated columns
That's quite a transformation, for which I would need some tooling. I decided on trying OpenRefine. I had heard about this tool a couple of years ago already, but never made any serious use of it. Now I thought it would come in handy, as I would have to go through every step of the transformation 15 times - once for every .txt file that we generated above.

So I fired it up, and created the first of 15 "projects" in Refine. This is what it looked like before the transformation :

After one parsing operation using the "line-based text file" transformation, I already got a preview that looked like this:
Already looks kind of right - at least I already have a row per record now.

Now I would need to do some manipulations before the text file became usable. Google refine has this really wonderful transformation tool that allows you to create manipulation steps that you can execute and process step after step. The main steps were:
  • extracting the "field name" from every cell, and just leave the data in there. Every cell currently was structured as "<<field name>>: <<field value>>", and I would want every cell to just contain the "<<field value>>".
  • renaming the columns so that my .csv file would have nice workable headers. This is not mandatory - but I found that easier.
When you do this manually, it can be a bit cumbersome and repetitive - so you definitely don't want to do this 15 times. That's where Refine is so nice: you can extract a .json file that specifies all the operations, and then apply these operations to forthcoming "projects" afterwards time and time again.

So you do it once, and then you can extract that .json so that you can use it later on the other 14 files - which is exactly what I wanted.

That .json file is also on github.

The result was exactly what we wanted: a comma-separated file that would be ready for import into Neo4j.
Download the file and we are done with the import preparations for this file. Of course you would need to do this 15 time, but thanks to the .json file with the operations steps, that was really easy.

You can find the sample files (of both the .txt file, the Refine operations .json and the resulting .csv file) are on github.

That was it for part 1. We now have everything ready to do an import. In part 2 we will do the actual import into Neo4j and start playing around with it.

Hope you enjoyed this.



PS: Here are the links to

Friday 10 April 2015

Podcast interview with Philip Rathle, VP of Products at Neo Technology

As you may have heard here and there, Neo Technology released this amazing new version of Neo4j - version 2.2, about 2 weeks ago. Perfect time for me to go to Philip Rathle, the VP of Products of Neo, to talk about where we are and where we are going. Great conversation, as you might expect - and also a bit longer, as you also might expect. We had so much to talk about:
Here's the transcript of what we talked about:
RVB: Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology and here we are again recording another great session for our Neo4j Graph Database podcast. It's a remote session again over Skype and all the away across the Atlantic this time because I've got Philip, Philip Rathle of VP of Products for Neo on the other side of this Skype call. Hi, Philip.

PPR: Hi Rik and hello to your lovely listeners.

RVB: Yes indeed. There's a little bit of a delay on the line, I think, but we'll manage. Thanks for joining us and do you mind introducing yourself a little bit, Philip, so that our listeners know who you are exactly.

PPR: Sure. So I'm Philip Rathle. I work out of San Mateo, California headquarters here in Silicon Valley and I've been with Neo for just almost three years and essentially I do all things product management.

RVB: Yeah, absolutely. As I recall, you started about the same time as me at Neo, didn't you?

PPR: I think so. I think we were within a couple months of each other.

RVB: Absolutely, yeah. So Philip, this podcast is always nice and short and sweet. There's two big topics that I want to cover with you. The first one really is, what do you love about graphs, and what attracted you to graphs? You've got a long career in databases behind you already. What brought you to the graph, and what do you love about them?

PPR: You're right. I had spent most of my career - I guess close to 20 years at that point - working with data, and working with databases, and of course all of that was relational for the most part. And some of that was doing consulting, some of that was doing TPA and data modeling work and some of that was actually doing product management around tooling for data modeling, and database administration, database development long before it was popular. Thankfully, it's gotten popular these days. I've been fascinated with this idea that there's all this information and that there's an opportunity to better interact with the world, be more effective, use time better, give people what they need, what they want faster by leveraging information in good ways.

PPR: And there's this one distinction we'd always talked about, was the difference between data and information. Data is this raw stuff that's use at your own risk, and then you synthesize it and come to understand it and model it, and use it, and it becomes information and therefore valuable. And I was fascinated with this idea that you could have a data model that reflected where the logical and physical model were the same, meaning that where the way that a business person viewed data and viewed information could actually be much, much closer - if not almost identical - to the technologist's view. And one of the big disconnects that's always happened with projects, and one of the big costs, and a lot of the frustrations come out of the fact that business and IT are misaligned.

PPR: But I don't think that's the root cause. I think that's a symptom of the fact that the view-- the business and the IT were viewing things through-- viewing the same thing through a very different lens, and so it became hard to communicate. That was the thing that really hooked me to start, and as I started digging more and seeing the kinds of other things that people can do with it and the kinds of performance you can get out of a native graph database, and the kinds of scheme of flexibility, and not having to spend months and wait for migration windows, where you would do this huge all or nothing thing, and maybe spend half an evening rolling it forward and half and evening rolling it back, those were pretty nice side benefits you could say.

RVB: Wow. Yeah, absolutely. I mean, the model is something that has been coming over this podcast time and time again. It's something that people really love about the graph. Maybe we can turn a little bit to what about Neo4j specifically, right? We've released this beautiful new version 2.2 this week, congratulations.

PPR: Super happy about that.

RVB: Yeah, absolutely. Everyone is, I think, and the feedback has been super. But what do you love about that? What's so great about 2.2, maybe a couple of minutes on that?

PPR: Well at the time I joined Neo-- and by the way I felt so great about the stuff, I actually joined on my birthday, can you believe that? At the time, and this was in mid 2012, I saw that this had amazing potential, not only potential it was actually being used for really serious stuff by some big companies, by some cool startups. And the observation though is that it's really, really amazing, but it takes a little bit of hacking and wiring and working around things to get it to work. So, it was an amazing technology if you invested some amount of time getting it working.

RVB: Right. That's where I get my gray hair, by the way [laughter].

PPR: Yeah. I believe you [laughter].

RVB: The early versions-- I mean the early versions of Neo were difficult, right? I mean, they were much more difficult than these than we have right now.

PPR: Yeah, this is what happens when you have brilliant engineers who are focused on the really, really hard stuff, which is building a database engine that is reliable and fast and scalable, and that's such a gargantuan task that it can be easy over time to forget the easy stuff. And it's not easy actually, user experience and defining the right surface and access methods and tooling isn't easy, but it's a very different mindset. And so we-- around the time I joined, all of the work-- well, it's an ongoing thing, but there have been so much work done to create a database that was solid and fast and could scale, that there hadn't been very much investment and actually taking that technology and making it more broadly accessible and easily usable. And so since the time I've joined, it's been an ongoing journey of - and we can talk about the different release themes and how we shift from released to release - what our focus is on, but it's steadily evolved to become something that's not only approachable, but I think really pleasant in a lot of ways. I mean, geez, graph-- karaoke, how many databases do you see doing that?

RVB: Absolutely, I'm a big fan [laughter]. No absolutely. So I mean it's been a fantastic journey I think both in terms of usability the 2.X series of Neo4j, but in 2.2 I think it's amazing what we're seeing in terms of performance right?
Yeah. So with 2.0 the focus was-- let's see I joined before we just started working on 1.9. I think 1.8 had just come out and 1.9 was all about improving the infrastructure used to do the clustering so you didn't have to run a zoo keeper cluster alongside a Neo4j cluster. And then with 2.0 the shift with the major version was we're going to focus on the Cypher query language and-- as opposed to essentially native Java APIs, which if you're not a Java developer or you're not into writing lots of imperative code is no where near as approachable and convenient as writing a declarative query, particularly one that has these characteristics of compactness and readability with your notes enclosed in parentheses, and your relationships with your arrows and so on.

PPR: And to do that, we found we actually needed to change the fundamental model and add this thing called labels. And we also decided-- had this observation that the user interface that we'd had up until then, while we considered it, at least I considered it, really, really limiting because I came from a tooling background, turned out to be something that people really, really loved and appreciated. And that told me anyway, that what they appreciated was the power of being able to actually visualize the graphs. That's a unique aspect of the model. We focused on those three areas, came out with a release that was much more consumable. As you often do with these things, you swivel the chair. You work on features, and then you swivel the chair back and you say, "Okay, I'm going to take the whole thing, but particularly the ensemble in cleaning these new features and I'm going to make it perform even better in every way." And perform means latency, i.e. response time. It means response time under high load, because response time of one query at a time is maybe what you notice when you're trying the technology out. But that's ultimately not working. You use it for production, you're going to have lots of things happen at all at the same time.

PPR: And actually, one of the things that wasn't headlined was a huge investment in quality. We have, on any given day, dozens to hundreds of Neo4j instances on cloud hardware, physical hardware, clusters, not clusters, big clusters, small clusters, doing all sorts of tests, long running tests, and stress tests, and load tests, and let's pull the plug tests. That's ultimately what a database needs to be. It needs to be resilient across a whole range of edge cases where any given person is going to be dealing throughout the course of their life of their application with hundreds or thousands of those edge cases. So, there are hundreds of thousands of tests that run internally on a daily basis and just hammering the database. That's always happened, but it's happened even more, like significantly, significantly more with 2.2, so I feel really, really great about that.

RVB: Yeah absolutely, I think the initial feedback has been absolutely fantastic. It's been a really proud event this week. Maybe I can switch gears a little bit and ask you one last question, Philip, if you don't mind. And that's you're VP of Products, so what does the future hold? Where do you see this going maybe short term but primarily also long term? Where do you see Neo and graph databases go?

PPR: Let me answer that, maybe start with a different place than you might expect which is to talk about where the market is going because the product needs to reflect where the market wants to go and where the market can go even though it doesn't realize it yet [chuckles] and it doesn't know it. It's a secret old--

RVB: the Steve Jobs approach, right? [chuckles].

PPR: I might see an application and say, "Okay, it'd be really convenient to have a big red button here." But actually, for someone, it may be that the best solution isn't to give me a big red button, it's to address something two or three levels back where that screen or that interaction doesn't even need to happen. And so where the market seems to be headed is there's a wider range of use cases where businesses are finding it valuable to not just use the data and look at data as things in isolation, and not just view joining data as reconstituting something that you needed to break apart because the relational model required it, but to actually understand the causality and the relationships and the effects between related things. That's a world we live in, and we've oversimplified it for a long time I think maybe because we didn't know better, but actually more because of the technology limitations we've had. You can't have a high performing native graph database without very fast, random IO, because you're hopping. You're doing pointer chasing. And in the days where you had very little memory and spinning disk, tape, and punch cards or whatnot, that just wasn't feasible and so--

RVB: That's probably why the old Codasyl databases failed, isn't it [chuckles]?

PPR: Well, I think they’re few reasons for that, I think the other is you didn't have the model flexibility either. You're still put things in buckets and so rather than having individual data items have a relationship with another individual data items so, Rik colleague with Philip, you actually were creating structures into which you-- which our generic buckets and you throw those in and that's not dissimilar from relational databases. Of course you have an equivalent on the logical side. There's a conceptual meta model with a graph but then the data itself actually looks just like the meta model you're relating physically individual things not buckets of things. And so, the technologies evolve to make more things possible and it's now just a question of how fast, and in what directions, that wave will grow different industries and different use cases where there's an appreciation and discovery of what are the new things that I can do or what are the existing things that I can maybe do in real-time instead of batch pre-compute?

RVB: So what might be your personal favorite use case? Do you mind pulling one out?

PPR: Yeah. Geez, there's so many. People talk a lot about Internet of things these days but what would that be without the connections? I think of it as Internet of connected things. That's certainly one. Identity and access management is one that maybe isn't immediately intuitive to people, but it's-- I have a content hierarchy on one side and a person to group, to group, to group hierarchy on the other side. And of course these things aren't always strict hierarchies, which if you have just one top to bottom or side to side or—

RVB: They're multidimensional right?

PPR: Yeah, then it becomes a graph effectively, and then connecting these two hierarchies adds another dimension. That's actually a really good one. Sometimes we see ones that are really unexpected and fun and that's part of what's made this whole journey really interesting. I love going to the graph gist page-- I think it's gist.neo4j.com or maybe its neo4j.com/gist.

RVB: http://Graphgist.neo4j.com, I believe.

PPR: Where people will just come up with wild and crazy but often times it-- of course that's a really good fit. Weighing an airplane was a surprising one. I didn't expect that. Turns out you can do it much faster with a graph.
Cool. What I'm hearing is lots of more beautiful use cases to come up, right? So that's the most important thing you see coming up?
So that then drives the features, and I think, "Where does it create demand?" It creates demand for more convenience, so there'll be more of that, improvements in the developer experience, the ops experience and so on, as well as continued improvements in scale and performance. Those are really the themes we track and then quality and reliability underlying all of that.

RVB: Cool, Phillip. We've already gone 18 minutes, so I'm going to wrap up if you don't mind, because we want to keep these reasonably short.

PPR: Let's do it.

RVB: Thank you so much for coming on the podcast. I really appreciate it. If people want to know more about Neo4j, there's only one place to go. You go neo4j.com or @neo4j on Twitter. If you want to reach out to us, I'll put the email addresses on the blog post with the podcast. Thank you so much Philip. It was great talking to you.

PPR: Bye Rik.

RVB: Thank you, bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best