Wednesday 30 September 2015

Podcast Interview with Tim Nash, independent "Wordpress & more" developer

Yeah - I love doing these podcasts. I really do - even though I sometimes need to scramble to find the time to do them - but it's totally worth it. Because very often, I get to talk to some really, really interesting people. This conversation is another one of those occurrences - the conversation with Tim Nash. Tim has been working in the Wordpress world for a long time, and seems to be infected/affected by the Graph Bug - and is preaching the graph gospel to the Wordpress aficionados. Super cool - here's the conversation we had:

And of course here's the transcript of our conversation:
RVB: 00:02 Hello everyone, my name is Rik, Rik Van Bruggen from Neo Technology, and here we are again recording our Neo4j Graph Database podcast. Today I'm joined by someone who is probably in a much nicer location than I am, out in the Yorkshire Dales, Tim Nash. Hi Tim.
TN: 00:20 Hi Rick.
RVB: 00:21 Hey.
TN: 00:21 Thank you for letting me come on your show.
RVB: 00:23 Well, thank you for joining us. It's always great when people make the time. I know everyone's busy. Thanks for coming on. Tim, the first question I always ask everyone is who are you, what do you do, and what's your relationship to graph databases? Do you mind telling us a little bit more about that?
TN: 00:39 My name's Tim Nash. I primarily run training and consulting development services on timnash.co.uk, get the plug in early. I work-- [crosstalk] [chuckles]. I'm primarily an E-commerce and security consultant. However in the last few years I've been drawn into the world of Wordpress, which is a content management system, and isn't my natural home, and isn't the natural home for many developers, so I've been working to try and change that and show the development opportunities there are within Wordpress, and the content management system in the community. I do stuff, is what I describe to people who ask me what I do. And in reality, I used to write code and now I tell people mainly how to write code.
RVB: 01:27 Very very cool. How did you get into graph databases? What's your relationship to that wonderful part of the industry?
TN: 01:34 About six months ago, I was sitting watching a video from a conference about in-game economics, and it was all being modelled using Neo4j. I watched it and thought, "Do you know what? That's really cool, and I want to have a play with it." So I had a little side project that I was mucking about with, and started to just play around with Neo4j, and thought, "This is really really useful and really interesting," and it sort of developed from there. The more I played with it, the more I thought there's opportunities to use this in a wider platform, not just inside my little projects, and started thinking about how we can use that potentially alongside Wordpress projects.
RVB: 02:19 That video that you watched, that's from Yan Cui from Gamesys I think. He's been doing that talk on a number of conferences among which at Qcon in London this spring, and he was actually one of the first guys that came on the podcast, so there's an interesting link there.
TN: 02:38 In fact, it was that conference video that I was watching.
RVB: 02:42 Was it [chuckles]?
TN: 02:43 A nice full circle.
RVB: 02:45 Fantastic. So where's the link with Wordpress then? How do you see that link then
TN: 02:52 For those people who don't know Wordpress, as I say, is a content management system. It's built on a pretty standard LAMP stack for most parts. In particular it's using MySQL and PHP. Now, MySQL, while lovely as a relational database, bizarrely in Wordpress, isn't really used as a relational database. They have this idea of posts and pages of custom post types, except they all sit in the same table, posts and post meta. And individual posts and custom post types of pages have no natural way within Wordpress to actually model relationships.
TN: 03:31 So you're sitting there going, "Right, I need to link post A with post B and say what that relationship is." There's no way to do that within Wordpress. Lots of developers have built their own hacks around, but as you can imagine, when you've got pretty much a flat structure like that, any attempt to develop this is going to result in a hack, and you're going to end up with a huge pile of extra data sitting in the database that really doesn't need to be there. And it's just really CPU and intensive to try and get this data out.
RVB: 04:04 Especially on larger sites, I suppose, right?
TN: 04:07 Yeah. If you're talking a few hundred posts, it's fine. But if you're talking company's like Wired and the New York Times and similar, who are using it as a traditional publishing platform, they have tens of thousands of articles. Then you've got sites that are using it as just a pure database storage system. One of the projects-- a side project from one of the Human Genome people, actually published their data through Wordpress, so they have not tens of thousands, but hundreds of thousands of posts. And very quickly, those databases creak.
TN: 04:41 So I've started looking at the graphing side of things and thought, "Well, what if we keep the relational database, but just push this data into the graph database, and then go-- we can query the graph database with the relationship and then pull that concept back out of the relational database." And found that worked really well. And as I'd already started playing with Neo, I stuck with it.
RVB: 05:05 Very cool. I think it's a super interesting way of doing things right? We see that all the time. That people actually use the graph database as a compliment to their existing relational database systems. It's a really nice fit right? So you went about this about six months ago you started this, and then where is that project now? Is it going anywhere?
TN: 05:28 Six months ago I started playing with it, and at the time I didn't have a big project to hoist this upon, so it was very much my own playing around, so when you don't have a project that you can actually immediately use it on, what you do is you jump up and down and shout a lot about something until someone gives you a project that you can use it on. So I spent the last six months or so, mainly at conferences and going around and just generally talking about this idea of graph databases.
TN: 06:01 I've been quite lucky in some respects that while perhaps boo and hisses from the Neo community, obviously Facebook's announcements with their own graph database bits has peaked the interest of some people. So I've been able to piggy back on that, and say, "But look, here's something that's actually working and already exists." So at the moment, it's been very much about looking at how we can do this sort of thing inside Wordpress, how we can push that data across, and getting some of the larger projects to at least consider this.
TN: 06:35 Now, there is a lot of precedents, because two years ago if you uttered the word ElasticSearch at a Wordpress conference, they would just look at you and go, "What sort of black-magic is this?" And now, things like ElasticSearch, is considered a almost mainstream compliment to a Wordpress site. So we're hoping to be able to just start introducing our graph databases and use the same sort of way that we interact with ElasticSearch, and bringing that to compliment with graph databases. And hopefully that means that Neo will be sitting at the top of the list, as it's the one that's first through the door, if you like.
RVB: 07:14 So how would it work then? Every time you would update something on the Wordpress site it would automatically propagate into the graph database? Or would it be kept in sync in some way? What's the idea there?
TN: 07:27 Wordpress uses actions and filters. So basically, every time you do an action on the site, there's probably some sort of event listener that's attached to it. So with that event listener, we can push that data across. You'd probably have it as a-- in what I've been testing, it's always been a one-way push. So we're pushing data into Neo, and then querying Neo. So when you hit publish on a post or even when you hit save on a post, that data and associated post meta, so think, important bits and pieces that you want to query, get pushed across over into Neo. And then rather than querying Wordpress via its normal search and loop system, you actually make the queries directly to Neo, which would return you back a little bit of the information that you've already given it. But more importantly, will give you back the post ID, which you can then call through your relational database.
RVB: 08:22 I think there's also some interesting applications afterwards right? If you would, for example, use the data in Neo to figure out the relationships between the documents, recommends new documents, new pages. All that kind of stuff becomes possible once it's in Neo, right?
TN: 08:39 Yeah. The very first example I did of this within Wordpress was for related posts. And to be able to define related posts and let users define what they considered to be a related post. So some people might think that because the posts are written by the same author, that means that they're related, which obviously it could be. But others could be looking for more complicated relationships between various tags, or even are these posts as part of series, which actually is something that you think should be really obvious and easy to build into Wordpress-- even the ability to say, these four posts belong to this series and should always be grouped together, is something you can't easily do. But when you push this into a graph database, saying these things are in a group just becomes super easy.
RVB: 09:33 Cool. So let me ask you a little bit about the future. Where do you think this is going then? Where do you see this evolve within the Wordpress world, or maybe even beyond that?
TN: 09:46 Hopefully within the Wordpress world, we'll start seeing some actual production uses of graph databases taking data from Wordpress and manipulating using them. There are some limited experiments going on at the moment. There is a couple of small production sites that are making use of it, but it's very on a small scale at the moment, and we haven't got our big publishers doing this yet, but there's a lot of interest in it. So, one of the things that really helps adoption particularly in the Wordpress world is through the use of plug-ins. So the next step is to get a group of us together to actually build a common framework for plug-ins, so that we are not basically having to recreate the push and the sync aspect of this every single time. Again, there's good examples of this from the work that's been done with ElasticSearch through several companies who operate inside Wordpress communities. So hopefully we'll see that develop and we'll be seeing much more sites that are making use of graphic databases in general, but in particular Neo, over the next few months, and into the next year. As for that, who knows.
RVB: 10:57 [chuckles] Exactly. Well you know, maybe we can get you to talk about your work at some point, at one of the meetups as well. That would be lovely as well, if you find the time at one point.
TN: 11:08 That'd be great.
RVB: 11:08 Sure. Cool Tim. We're 11 minutes into the podcast so I think we're going to wrap up. We want to keep these things short and snappy. So thank you for taking the time, really appreciate it.
TN: 11:20 No worries.
RVB: 11:21 I hope to meet you at some point in the future.
TN: 11:25 Okay. Lovely to meet you.
RVB: 11:27 Thank you, bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Monday 28 September 2015

Part 3/3: Querying the Global Terrorism Database (aka my POLE database) in Neo4j

In part 1 and part 2 of this blog series, I talked about how we at Neo4j are seeing more and more customers using graphs to model, store and manage their Person-Object-Location-Event relationships. It's a really great use case, and I have been toying with the Global Terrorism Database (GTD) to try and illustrate what graphs bring to the table there. In the first post I explained some of the concepts, and in the second post I was able to import the GTD into Neo4j without too many hickups. So now, in this third and final post, we are ready to do some POLE querying. Excited to see what we can find out.

Start Simple - in Antwerp

You will find all of these queries on Github, of course. But let's take you through some of my experiments, one by one.

We start out by trying to find my hometown, Antwerp, Belgium, in the GTD. In order to do that, we just want to make sure that we have all the indexes in place on our database:

That looks ok. We have indexes on :city(name) in place, so we can do something like this:

match (c:City {name:"Antwerp"})-[r]-()
return c,r;


Looks pretty simple enough, but the result was a bit disturbing. Look at the graph below:
Seems like there are multiple "Antwerps" in the Database, and that is a bit of a pity.

This caused me to look a bit into some of the data quality aspects of the GTD, and I did uncover a bit of an issue. If I run this query:

match (c1:City)--(ps1:ProvState)-->(c:Country),
(c2:City)-->(ps2:ProvState)-->(c)
where c1.name = c2.name
and ps1.name <> ps2.name
return distinct c1.name, c2.name ps1.name, ps2.name, c.name;

Then I unfortunately get this HUGE resultset with 27274 rows of essentially badly entered data.

As part of this blog series I did not have the time or inclination to try and correct the data in any way, but it feels like we could do some data cleansing there.

So let's proceed to find some more terrorist events in my home country:

match (e:Event)-->(ci:City)-->(ps:ProvState)-->(c:Country {name:"Belgium"})
return c.name, e.id;


Good! Only 21 terrorist events in 44 years of history - that's ok. Happy to live in a safe place like that, especially if I compare it to some of the other countries: let's group them and count the number of events per country:

match p=(e:Event)-->(ci:City)-->(ps:ProvState)-->(co:Country)
return distinct co.name as Country, count(e) as NRofEvents
order by NRofEvents DESC

Clearly Iraq and Pakistan are not the safest places, but I was actually surprised to see the UK, the USA and the Philippines in the "top 10". Interesting.

Then of course you can also look at some interesting stats.

Terrible Terrorism stats - the graph does not lie

Let's say we would like to know more about specific dates and find out which dates had the MOST terrorist events, or which had the deadliest events. Querying for that is pretty easy, if we understand that the event-ID that is part of the event-nodes is structured as YYYYMMDDxxxx where xxxx is a sequentially incremented event id... Let's look at a query tp find the deadliest day in this GTD:

match (e:Event)
return left(str(e.id),8), sum(e.nkill), sum(e.nwound), count(e)
order by sum(e.nkill) desc
limit 10

The result is

Or if I just want to look at the sheer number of events, then I just change the ordering:


So it's pretty darn clear that June 14th and June 15th of 2014 were pretty darn special. Let's take a closer look.

match (e:Event)
where left(str(e.id),8)="20140615" and e.nkill is not null
return e
order by e.nkill desc
limit 1

Exploring this a bit further in the Browser gives me:

Now I don't want to bore you with too many details, but... this one is quite amazing: the Camp Speicher massacre, committed by ISIS in Tikrit, Iraq. The story is unbelievable and sad, as told by one of its survivers here and here. Warning: shocking stuff.

After that depressing find, I actually started to dig around with some specific attacks in mind. So let's look at these.

Digging into some specific terrorist events

So the first one I decided to take a look at - not sure why, but anyway - was the Oklahoma City Bombing. I don't know that much about that event, but I do vividly remember the pictures on TV and the unbelievable story behind it - as if it was taken from a road movie. So let's look that one up.

First we try to find the city:

match (c:City {name:"Oklahoma City"}) return c;

By then using the Browser we can very quickly identify the event (which has an id of 199504190004 ) and look at some of the details:

match (e:Event {id: 199504190004}) return e;

And then we can start to zoom into the surrounding graph:

Interesting, and intuitive.

So then, of course, we can take a look at another event - one of the most hideous and cruel attacks ever: 9/11. To do this, we need to take another look at the data and find the event. Using the same mechanism as above, I will take a look at the specific events on that sad day:

match (e:Event)
where left(str(e.id),8)="20010911"
return e

So very quickly, we then find out that there were a number of different attaches on that day:
So let's focus on the ones that happened in the USA by simply expanding our pattern:

match (e2:Event)
where left(str(e2.id),8)="20010911"
with e2 as Event
match p = allshortestpaths ((Event)-[*..3]-(c:Country {name:"United States"}))
return p

Immediately we see the 4 related attacks on the same day:
Now last but not least, let's see if we can find interesting connections. Because of the data quality issues that we uncovered above, the connections highlighted here may not be perfect or that interesting, but the principle I think could be super interesting in a true, well-maintained POLE database.

match (e2:Event)
using scan e2:Event
where left(str(e2.id),8)="20010911"
with e2 as Event
match p = allshortestpaths ((Event)-[*..3]-(c:Country {name:"United States"}))
With Event
limit 1
match p2 = allshortestpaths((e1:Event {id: 199504190004})-[*]-(Event))
return p2
limit 10

As you can see below, you do get an interesting view and a number of "suggested" links between different events - look at the "similar" events in the middle for example.
No doubt that in a true POLE database application, that would be a great source of inspiration for further exploration.

That's about it what I have to share on my journey with POLE databases, the Global Terrorism Database and Neo4j. I hope it was useful for you - I certainly found it a very interesting excercise.

As always - feedback and comments would be very welcome.

Cheers

Rik

Friday 25 September 2015

Part 2/3: loading the Global Terrorism Database into Neo4j

In the previous post in this series, I explained what we were trying to do by taking the Global Terrorism Database and using it to create a Neo4j graph database based POLE database.

As you may remember, the GTD is a really big Excel file. So the first thing that I did was to do a minor clean-up operation on the XL file (there were a few columns in the file that did not make sense to import, and that were really causing me a lot of pain during the import process), and then I basically converted it into a nice CSV file that I could use with Cypher's LOAD CSV commands.

Creating the import script

Now, I have done this sort of thing before, and I can tell you - importing a dataset like this is never trivial. One thing that I have learned over the years with lots of trial and error, is that it is usually a good idea to import data in lots of smaller incremental steps. Rather than trying to do everything at once, I essentially for this particular GTD import task created 50 queries: 50 iterations through a csv file, piecing together the different parts of my graph model with different columns of my CSV file.
The entire import statement is on github, you can run through it pretty easily.

Running the import script

Now, you could run this import script very nicely in the Neo4j browser. There's one problem though: the browser only accepts one statement at a time, and that would mean a lot of copying and pasting. That's why I very often times still revert to the good-old neo4j-shell. Copying and pasting the entire import statement works like a charm:
Although I really should mention one thing that I learned during this excercise. The thing is, that because I actually import the data piece by piece, run by run, I had originally also created the relevant Neo4j indexes right before every query. Waw. That turned out to be kind of a problem.

Schema await!!!

The thing is, that the Neo4j Cypher query planner relies on these indexes to create the most efficient query execution plan for an operation. And what I noticed in this excercise is that my import queries were sometimes literally taking F-O-R-E-V-E-R, in simple situations where really they should not. I really had to do a bit of digging there and ask a little help from my friends, but I ended up finding out that the problem was simply that the INDEXES, which I had created in the statement right before the import operation, where not online yet at the moment when I was doing the import. The Cypher query planner, in the absense of index information, the proceed to do the import in the most inefficient way possible, doing full graph scans time and time again for the simplest of operations.

Two really easy solutions to this:

  • create your indexes all the way at the top of your import script, not between import statements. Probably a best practice.
  • add a very simple command after you create the index, to force for it to be "ONLINE" before the next statement is executed: 
neo4j-sh (?)$ schema await;

That was a very useful trick to learn - I hope it will be useful for you too.

The result of the Import


After running this script, we end up loading data into the model that we sketched earlier.
After a few minutes of importing, the meta-graph is exactly the same as the model we set out to populate, and is now ready for querying.
That's what we will do in the next and final blogpost of this series: do some real POLE database queries on the Global Terrorism Database. Should be fun. Already looking forward!

Hope this was useful for you.

Cheers

Rik

Monday 21 September 2015

Part 1/3: Experimenting with a POLE, the Global Terrorism Database, and Neo4j

In the past couple of weeks and months, I have been having a lot of fun at Neo4j working with different clients. One thing struck me however (maybe it's a coincidence, but still): we have come across an impressive amount of customers that all had very similar requirements: they were looking to use Neo4j as the foundation architecture for a next-generation POLE database. A what? A P-O-L-E database.

What is a POLE, exactly?

I guess everyone has their own definition and wants to create yet-another-vague-acronym, but the common case seems to be that it's like a "case management" tool for specific types of government agencies that want to look at the links between Persons, Objects, Locations, and Events. Some of the cases are to be found in police forces, government (tax / social service) agencies, immigration authorities, etc ... They all have that same requirement of being able to analyse and link different entities together, like so (or similar):
Naturally, most of these clients are not about to share their privacy-sensitive data with us very often. And I would still want to have some kind of a story and demonstration to explain how we could help. So I went looking for some interesting datasets, and ... before I knew it I found something really interesting.

The Global Terrorism Database

As mentioned above, one of the key areas where people will try to understand the connections between the POLEs, is in police/intelligence work. In fact, we have noticed that many of the Neo4j use cases that we have worked on are in this domain. So where to find interesting data around topics like that...

Like in so many cases I can't exactly reconstruct how I got there, but in the end I found the Global Terrorism Database (GTD). They seem to be very strict about their ownership of the data, so here's some legalese for you:
the data was provided by the National Consortium for the Study of Terrorism and Responses to Terrorism (START). (2015). Global Terrorism Database [Data file]. Retrieved from http://www.start.umd.edu/gtd.
And I must say: they did an unbelievable job. The interface below is super interesting to play around with in the first place.

Then after some playing around I quickly noticed I could actually download the dataset from this page over here.



As you can see, it provides a couple of different documents. The most important ones are

  • a big, tall and wide Excel file. 
  • a Codebook that explains the meaning of the different data elements in the Excel file.

Opening up the file takes a bit longer than on average, but works fine on my machine. It's about 140000 lines long, and I-don't-know-how-many columns (a lot) wide.

So that's when I started to take a few good looks at the data, and found that actually it is a pretty great example of a POLE database. It contains information about

  • Events: the 140000 terrorist attacks from 1970 until 2014.
  • Objects: the weapons / systems / objects used during these attacks
  • The Persons / Groups of persons (usually) performing the attacks
  • The Location of the attacks (by region, country, province/state, city, gps-coordinates)
And actually a bit more than that. So the data is actually a bit more than a "simple" POLE, and so I thought that it would be an even better fit for a a potential Graph Model then.

Creating a GTD POLE model for Neo4j

So after a bit of examination and experimentation in Excel, I ended up drawing out the following Graph Model for the Global Terrorism Database:



As usual, with a graph, it feels like a very natural and simple way to talk about the data. So then all I needed to do was to convert the Excel file into a Neo4j database. That should be interesting. So in part 2, we will attempt to load this data into Neo4j.

Hope this was interesting so far!

Cheers

Rik

Friday 18 September 2015

Podcast Interview with Axel Morgner, Structr

Here's a podcast episode that was long overdue: I finally got to speak to a long time evangelist and community member, Axel Morgner. We met a long time ago in front of an illustrious London Pub, and have been working together on a number of projects, activities and ... stuff. Over the years, Axel always impressed me with his unbelievable achievements on Structr, his fantastic enthusiasm and just sheer impressive expertise on all things Neo4j. Listen to the story here:

Here's the transcript of our conversation:
RVB: 00:02 Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology, and here we are again recording yet another episode for our Graph Database Podcast. Today I'm joined by someone that I should have invited a long time ago. It's probably one of the first people that I met in our wonderful community:  Axel Morgner from Germany. Hi Axel. 
AM: 00:24 Hi Rik, how are you [chuckles]? 
RVB: 00:25 Hey [chuckles]. Again, I need to apologise. I should have invited you a lot earlier, but you know how these things go, right [chuckles]? 
AM: 00:32 No worries. I think now it's the perfect time for a podcast. I have some news and I'm relaxed after vacation. 
RVB: 00:42 [chuckles] Super. Axel, we've known each other since I think FOSDEM in Brussels two or three years ago. That's the first time we met each other. But many people might not know you, so do you want to introduce yourself a little bit and tell us who you are and what do you do? 
AM: 00:58 Yes, sure. But I think we met in front of a pub in London after a meetup. 
RVB: 01:05 [chuckles] That sounds likely as well [chuckles]. 
AM: 01:10 [chuckles] Yeah. I'm Axel Morgner from Germany. I'm living and working in Frankfurt, and I'm the founder of Structr, our software project and also our company. 
RVB: 01:25 And Structr, that's been around for quite some time. I remember the days when people thought it was a content management system. But it's much more than that now, isn't it? 
AM: 01:37 Yes. Yes, it has become more a graph application platform. That's what we call it now. It has started out as a content management system when we first had the idea of creating something new based on a graph database on Neo4j in 2010. The first attempts were made back then. But over time, we saw the potential in graph databases and Neo4j in particular to create much more. So the short story of the evolution of Structr: It was we wanted to build a content management system in the first place. Then we saw that we could do a lot more stuff if we make the back-end very flexible. 
AM: 02:32 And then we came up with this flexible schema or data-modeling tool, which was just added for our own projects - to gain speed in the projects - and then we just generalised that into a-- yeah, we called it Data CMS and now we call it Graph Application Platform because you can do a lot of things in terms of application programming without having to code as much. So you just put the data model in the graph as well, and there's components in Structr which create a [restful?] API and all the stuff out of it. So we have a lot of things that make your lives easier. 
RVB: 03:22 Well, I've seen you do the demo, and there's a couple of recording even, aren't there? How to build an application in a couple of minutes. It's really impressive actually. I'll put the link to that on the publication of the podcast as well.


AM: 03:38 Thanks. 
RVB: 03:39 So how did you get into the graph story, Axel? And why did you get into it? Can you tell us a little bit more about that? 
AM: 03:48 Sure. The history or the-- my background is I'm a physicist, and for me, software are tools to solve problems. And after my studies, I started working at Oracle, doing all this relational-database stuff and so on. So that was in 2000, and just for two years. After two years, I was kind of fed up with all this, let's say, heavyweight-proprietary-enterprise-database stuff. But nonetheless, we had a very smart team in one of these projects and we decided to create enterprise content management system based on Oracle, and we founded a little company. And after some years, I thought it's too heavy and too boring and too proprietary, and I wanted to do something new. 
AM: 04:52 And there were some NoSQL databases around and I started looking around and thought, "Okay. If I want to do something with content management where we have trees - so page trees, five hierarchies, organisational trees - let's try it at graph database. And Neo4j was in Version 1.0 - this is the version I started with - was major and stable enough to give it try. And it was embeddable in Java. So I'm kind of a Java hacker. And that was the story. It was a perfect fit to just map hierarchies and trees in a graph database. 
RVB: 05:41 But as I recall it from the content management days - if I can call it like that - it was also about the performance, right, that you were able to get out of it. Because I remember you telling me that in old-style content management systems you needed caches and all that wonderful stuff and [chuckles] that basically, if you just store it as a graph, you don't need that anymore. 
AM: 06:05 Exactly, exactly. So it has turned out that it was the best decision I or we ever made, in technical terms, to go with a graph database and with Neo4j as well. Not only in technical terms; communities and the people are wonderful too. But the performance is a very important thing. So we store everything in the graph. For example, take the page tree. Normally, if you want render a webpage, you store HTML, basically HTML in your database. If you have a content management system, you split up the HTML page into small pieces. And the more flexibility you want to have, the smaller the pieces have to be. But if you then have very many small pieces of HTML and have to join them together for rendering dynamic pages, you have to do a lot of joins in a classic or a relational database. 
AM: 07:14 And we all know that this gets slower and slower the joins you have to do and [the?] more data you have in your database. If you do it in a graph, you just start at the page note and just traverse the page tree over some relationships, and you just take the [wave?] through the page tree - your requests parameters tell you - and then you just render the page output of this together and you can do that in a couple milliseconds. In a classic content management system you can't do that. You can only cache portions of your page to get a reasonable speed, but if you for example have a protected page - so you have users who log in and everyone sees different content - then you can't do that anymore; you can't cache the dynamic content for each person. 
RVB: 08:13 yeah that makes sense... 
AM: 08:15 So it's much quicker than a classic content management system. 
RVB: 08:19 Super interesting. And I encourage everyone to take a good look at Structr if you're doing content and web application development; it's really lovely. So Axel, where is this going? What's this story both from a general graph-industry perspective and from a structure perspective? How do you see the future? 
AM: 08:43 I personally see a very bright, full future for not only Structr and Neo4j as I think the best graph database on the market, but also with the graph database's space in general. Because it's my belief that graphs or graph databases are the best-- not tool, but technology to really map the reality in an electronic structure, if you want ... 
RVB: 09:23 Reality is a graph [chuckles]. 
AM: 09:24 Yeah, reality is a graph. The best abstraction of reality is graphs. And so the best tool to map this abstracted reality to software or to memory - that's basically what we're doing - and calculate on that data is a graph database. So we have an interesting roadmap. We're about to announce our upcoming release, 2.0, which will contain very interesting features. Like on one hand, we're expanding a little bit into the enterprise content management market, so we're creating a much better file interface: a completely revamped files and folder management user interface. And it comes with the possibly to use SCP or SSH to just lock into Structr and do file operations in it. We are currently implementing the CMIS: Content Management Interoperability Services, I think it's called. It's a very broad industrial standard for interacting with content management repositories. We are implementing that based on our layers. So that will put us into the-- let's say we're approaching the larger ECM vendors or systems like Documentum and Alfresco, and so and so on. 
AM: 11:07 And on the other hand, our mission, our long-term goal is to make application development much easier. So we want to reduce the friction you have as a creative person or as a manager in your company to create an application with the knowledge and the data you have. So the friction is introduced by developing hurdles like you have to choose a programming language, you have to set up the thing. Or maybe you can't program or you don't like to program; then you have to find developers, you have to pay them, and so on. It takes times and it's expensive. We want to fill the gap between the content management system and the development framework with Structr. So make it easier to create applications just by drag and drop, and put in some data, so that everyone can create mobile and web applications. That's our long-term vision, I would say. 
RVB: 12:16 I mean, that's like a wet dream [laughter]. 
AM: 12:19 [chuckles] Yes, it is. I know, but I mean-- 
RVB: 12:22 It sounds really great. And you know what? I can't-- 
AM: 12:23 --you have to have ideas like this to keep you motivated over such long time. So we won't stop [chuckles]. 
RVB: 12:31 I agree. I agree. That's great. That's great. Wonderful. Well, Axel, it's been really great talking to you, and as you know, we want to keep these podcasts digestible length so I'm going to wrap up now. I really want to thank you for coming online and talking to us. And I look forward to seeing you at one of the future events and community events, right? 
AM: 12:58 Thank you, Rik, for this wonderful podcast series and the opportunity to talk to you. I think we will see us at the latest at the graph connect in San Francisco, I hope. 
RVB: 13:08 Absolutely. We have to meet up there [chuckles]. 
AM: 13:12 Yes. [crosstalk] 
RVB: 13:13 All right. Thank you, Axel. Have a nice day. 
AM: 13:16 Thank you too, Rik. Bye bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Monday 14 September 2015

Graphs are awesome!

Some time earlier this year, one of my buddies posted this crazy video on facebook or wherever: People are Awesome!


I really enjoyed watching it. Turns out there's a whole collection of videos like that and that you can look at most of these collections on this website. Hours of fun :) ...

In massive anticipation of GraphConnect next month in San Francisco, I thought I would use one of these songs (Heroes, by Alesso ft. Tove Lo)  for some Graph Karaoke. So here we go. Play it loud!


Hope you enjoyed it as much as I did :)

Cheers

Rik

Thursday 10 September 2015

Podcast Interview with Petra Selmer, Neo Technology

A couple of years ago, we were organising meetups in London for Neo, and we had a really nice audience of people that showed up in every other meetup. One of these people was a really interesting lady with a bit of a funny accent, who told me that she was doing a PhD. on Graph Databases. Interesting. And now, a couple years later, we have her on our team - Petra Selmer works on our Cypher team in our Engineering department advancing the worlds greatest declarative query language :)) ... So I decided to invite her to the podcast, and lo and behold, after some interesting "summer planning", we had a great chat. Here it is:

Here's the transcript of our conversation:
RVB: 00:03 Hello everyone. My name is Rik, Rik van Bruggen from Neo Technology, and I'm very excited to be doing another podcast recording today.  My guest on the podcast today is Petra Selmer. Hi Petra.
PS: 00:17 Hi Rik.
RVB: 00:17 Hey, hey, it's good to have you on the podcast. Thank you for making--
PS: 00:20 Thank you.
RVB: 00:20 --the time. So Petra, we've known each other for a couple of years. I think I got to know you first in the London community, right?
PS: 00:30 Yes, that's right. That's right, it's been quite a while, yeah.
RVB: 00:33 It's been quite a while, but maybe you can take some time to introduce yourself to our listeners. That might be useful.
PS: 00:40 Sure. Well, as you said, my name is Petra Selmer, and I'm actually a developer at Neo Technology, specifically working with a Cypher team. That's the team that actually develops, implements, designs Cypher, the query language. I'm also a member of the Cypher language group. This is something that we started about six months ago, to ensure that we kept up the momentum of adding new features, new operators, new keywords, new semantics to Cypher keeping things rolling forward in that way. So I'm also a member of that group, and we essentially just try, and make sure that we move the language forward. I also do a little bit of work as well in the biotechnology community. That is just trying make contacts with scientists and other people in the biology, chemistry, physics communities, to try and get them enthused about graphs, that it's a really tool to help solve their problems because they've got really, really complex domains, which are very well afined with graphs. This is just all very much in the beginning stages, but that's something that we're hoping to see grow in the future.
RVB: 01:48 Did I get that right? You have an academic background in graph query languages, right? Is that right?
PS: 01:54 That's right. So I'm actually towards the end of my PhD in the flexible querying of graph structure data. So essentially, I've developed a query language, which allows users to post queries which do not exactly match the structure of the graph, but which nevertheless gets answers back to them in a ranked order, depending on how closely their query matches the actual graph. So if you don't know your graph very well, you'll still get answers back, and essentially in this way, you actually get to know your data. So it's more of like an exploratory thing. If you like it, it's a "fuzzification" of queries, and it also does inferencing as well. For example, if you asked for things relating to cats, and somewhere you've that cats are related to lions because they're both felines, you also get data related to lions as well. So in this way, it's quite a powerful mechanism to-- this is motivated actually by biological domains, where they've got incredibly complex data that changes all the time. So where the situation where scientists were just kind of stuck sometimes not knowing what path's query the graphs. So some "fuzzification" and approximation of query was necessary. That's my PhD.
RVB: 03:11 Wow, that sounds extremely interesting here. Is it related to visualization technologies in any way because that's where people tend to do those approximations, or finding those patterns with visual tools quite often, or is not related at all?
PS: 03:28 It initially was meant to be because when I began actually, where I've seen many, many avenues to explore, and in fact, visualization is incredibly important. Everybody from sort of a developer's new to the seen through to as I say, very experienced physicists. They all find visualization very powerful, but actually, it turned out that there was so much to explore in this area that I concentrated rather on the theoretical proofs of the constructs required to undertake this optimizations. But I believe there are other PhDs going on using this, and then applying visualization techniques on that as well. So yeah, visualization very important, but alas not something I [crosstalk].
RVB: 04:09 Cool. Petra, how did you get into graphs and why did you get into graphs? Could we go into that a little bit more? I've asked this question to lots of the people on the podcasts, and I wonder if you've got a perspective on that?
PS: 04:21 Sure. So I've actually been an applications developer since 1997 actually, in loads of vertical markets and loads of different companies from places like IBM through to Internet solutions providers and those sort of things. And I think it was about five, six years ago, that I simultaneously began my PhD, and I feel into it quite by accident. I was supposed to undertake a PhD in description logics, which is just some mathematical logical thing that's quite arcane. After three months, I found it really was just not for me, and I actually went to speak to the dean of the university, and he said, "Well, actually we've got this other project that we'd like a PhD student for," and it actually happened to be this flexible querying of graph data. And soon as I read the brief, I just fell in love with it. It was absolutely kind of awesome.
PS: 05:11 So I fell into it via that route, and also at that time as well, I was working for a medical research company, and the type of data we were trying to store and allow users to query was incredibly complex. In fact, the use case was to represent the entire NHS hierarchy of top-range consultants and administrators across all the trusts and networks and strategic health authorities - all these organizations. I,t was very much a graph. Same people, they'd have different rules and different contexts. So at the time, I was obviously fighting as many other people were, in doing this in a relational database. There was a lot of pain around that. It was at the time actually, I thought, "Hang on, this is definitely a perfect sort of graph problem that a graph database would usually be able to solve." But at that time, there was very little around. I think this was around about 2007 or so. It' was just before obviously, graph databases as such, sprang out into industry. So certainly when I'm coming across Neo4j, that was uh-huh, light bulb moment. Finally, very, very happy that industry had also seen the light as such because of course, graph models had been around for decades in academia, but it just never really taken a light in an industry. So very glad that's actually changed now.
RVB: 06:31 What was it that actually attracted you most, was it the modelling side of things, or what was it that attracted you most when you sort of found this matching technology?
PS: 06:42 With me, it was the modelling. It was basically, so we didn't have that impedance mismatch, and all the sort of a 90% of the time, spent on writing stupid install procedures hundreds of lines long, and basically and obviously the increased number of testing and everything around that, and just ending up with a code that was not maintainable. But also it felt to me  - to use an analogy - it was as if a surgeon was trying to perform surgery using oven gloves -  big heavy oven gloves. It was just really awful. Whereas, I actually first was introduced to SPARQL, that's the semantic web graph query language, which was miles and miles better. But then, when I came across Cypher, I thought, "My goodness, this is brilliant. This is actually now like a surgeon using a scalpel to perform surgery,"which is like it should be. It's very precise, very expressive, and you immediately can just fall in there and actually start doing very complex things. I think it's the only way in which you can write really intelligent systems - really advanced systems. I think at some point, there's a point at which you're just so bogged down by relational technology that you sort of reach that limit as to what you can do. Whereas, a graph just opens it up for you, and you start off at a very strong position, and then, as you see what you can do, you just go ever more advanced.
RVB: 08:04 Absolutely. It's great to hear you talk about that. I love that analogy, by the way. That will stick, I think [chuckles]. Presumably on Cypher, is there anything particular that you think is the main reason why you think it's going be conquering the world or stuff like that? What is it specifically that you like so much about Cypher?
PS: 08:30 It just that it absolutely, completely just reflects the graph model without any cognitive overload. So ie you don't really need to think too much about it. It just fits so beautifully well. In particular, I love the matched query, and the way you can actually describe your pattern in a very, very natural and easy way. Whereas, trying to start with something like SQL, and trying to make it graphy-like  already, it just doesn't have the elegance or the expressivity that Cypher does. So that's certainly something that grips me immediately, was just the pattern matching capabilities.
RVB: 09:05 Yes, super, super. Cool, very cool. Maybe one more final question, if you don't mind, Petra? I ask this same question to everybody. It's about what does the future hold? Where do you think this is going? How you see, how do you hope that this will change the industry?
PS: 09:25 I think really since I knew about Neo4j, I think it was now about five years ago or something-- it's amazing actually how many more people, how many more developers and others in industry, now know what a graph database is, so when speaking with people, I don't have to start right from the beginning, and sort of, "This is a graph," and spend a lot of time talking about that. They immediately already know. So what the future holds? I think it's actually limitless. I think  - as I said before - I think it will be the only way in which we can actually solve some problems. In particular, I'm thinking of my background, in what I've worked before, which was in the sciences and in the medical research arena. The database-- I think we'll be able to do so much more there that we'll be able to get better applications out there much faster to domains like say medical healthcare and places like that, in order to be able to leverage all these wonderful scientific discoveries that are going on at the same time, and therefore get wonderful research being undertaken and performed. So I think actually, it's hard to say where this will end up, but I think it will be really, really big.
RVB: 10:35 Absolutely, very cool. Maybe one more question and a little bit more personal. Do you still speak Afrikaans, or do you speak any Afrikaans?
PS: 10:42 I've been in the UK now for 16 years, but yes, I do try and speak it with any South Africans - those who know it. So I do try and keep up-to-date and in practice.
RVB: 10:57 Well, you know, Flemish my mother tongue, and Afrikaans are very much related, right? So next time we can practice together maybe [chuckles]?
PS: 11:03 Indeed, indeed. That'd be good.
RVB: 11:06 Very cool. Thank you so much for coming on the podcast, Petra. I really appreciate it, and I look forward to seeing you soon.
PS: 11:12 Thank you. Your'e welcome.
RVB: 11:14 Bye.
PS: 11:15 Bye-bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Wednesday 2 September 2015

Podcast Interview with Eddy Wong, Wanderu

I have said it before and I will say it again: the nicest thing about doing these podcasts is that I get to talk to so many cool people. Here's another one: Eddy Wong is the CTO and co-founder of Wanderu, and has been active in the Neo4j community for a very long time. As you will see/read/watch below, they have done some very cool stuff in combining Neo4j with MongoDB, and that's actually something super interesting for many users. So read on, listen, and try to enjoy it as much as I did :) ...

Here's the transcript of our conversation:
RVB: 00:02 Hello everyone. My name is Rik. Rik Van Bruggen from Neo Technology and here we are again recording another episode for the Neo4j Graph Database podcast. And today I'm very excited to have another overseas guest on this episode, that's Eddy Wong from Wanderu. Hi, Eddy. 
EW: 00:22 Hi, how are you doing? 
RVB: 00:23 I'm doing very well, and you? 
EW: 00:25 Good. 
RVB: 00:26 Excellent. Excellent. So Eddy, I've been reading all of the blogs and watching some of your videos from GraphConnect (see below) about your work with Neo4j at Wanderu. But most people won't have done that yet, so maybe you  can introduce yourself very quickly if that's okay? 
EW: 00:44 Okay. I'm a co-founder of Wanderu. Wanderu is a search engine for buses and trains, and the reason we were looking for a graph solution is we wanted to model the network of transportation so that you could route two buses, or two trains, or a bus and a train very easily. And that led us to using Neo4j
RVB: 01:19 Super. How long have you guys been doing that already? 
EW: 01:22 Since 2012. We were one of the early users of Neo4j. I actually learned about Neo4j that year and I attended the first ever GraphConnect conference in San Francisco. 

RVB: 01:43 Yeah, that must be it. 2012, somewhere around that time. We're doing a fourth conference, a fourth year this year I believe. So, it's going to be exciting. So, search engines for trains and buses. How is it different from things like what I do? Route planning on Google Maps and stuff like that, is it very different? 
EW: 02:04 Well, what you do with Google Maps is mostly local. We're in a city so, we at Wanderu, we focus on intercity travel. 
RVB: 02:22 Okay. Super. Route planning does seem like a real good use case for graphs. But I also notice in some of the videos and the blogs that you guys are actually combining it with document storage right? Can you tell us a little bit more about that? 
EW: 02:39 Yeah. That was kind of a unique approach that we end up taking because our-- most of our data was in JSON and we were using MongoDB to store that. It was very convenient to just dump the JSON there and it would be indexed. 
RVB: 03:04 That makes sense. 
EW: 03:06 And at that time, there was no easy way to upload data in JSON format  to Neo4j. Usually you have to use the CSVLoader or write some custom code or write cipher. It wasn't very convenient to upload JSON into Neo4j, so we came up with a unique approach of using an open source project called Mongo Connector. So this is a piece of software that lets you add a trigger to MongoDB, so whenever you add data to Mongo, you insert or delete something, Mongo automatically makes a callback. Inside that callback we populate Neo4j. 
RVB: 04:19 Do you also write directly to Neo4j or always through that connector? 
EW: 04:24 Through that connector. Yeah, always through that connector. 
RVB: 04:28 Super interesting. 
EW: 04:30 Sorry. That way we extracted the meta data. We didn't copy everything, but just the meta data that is useful for routing. 
RVB: 04:47 Yes, makes a lot of sense, right? Just the keys and stuff like that, right?
EW: 04:52 Yeah the keys and the edges. I mean the edges that the trip  access from point A to point B. 
RVB: 05:02 So how did you guys get to Neo4j? I mean what was the attraction and why did you end up using Neo specifically? Can you elaborate on that? 
EW: 05:13 Yeah. So I looked at other graph databases. And at that time, the only ones that I could find or read about were proprietary solutions. They were expensive and they were closed source and they were not-- so Neo4j was the only one that was open source and had a vibrant community at that time. 
RVB: 05:44 And was it a good fit you think? Do you still think it's a good fit? 
EW: 05:48 Yeah, definitely. 
RVB: 05:51 Super. And how successful is Wanderu these days? Are you guys making good traction? Are you attracting good communities? 
EW: 06:00 Yeah. So we've grown from zero to now we have over a million users per month. About sales, we sell several thousand tickets everyday, and the architecture since day one hasn't changed that much. So, our solution has remained scaleable. 
RVB: 06:27 That's very impressive, very cool. So, what does the future hold, Eddy? How is this going to evolve going forward, both from a graph perspective and, maybe also a little bit, where is Wanderu going? 
EW: 06:44 We like to think of ourselves as travel for the next generation, so graphs enables us to model data in a way that that hasn't been done before. So in the future-- you look at social networks and all the information that is stored in graphs. So, eventually you have a social network connecting with a transportation network, and you can imagine all the cool stuff they can do with that. 
RVB: 07:30 So things like recommendations and stuff like that, is that the type of stuff you're thinking about? 
EW: 07:35 Yeah. 
RVB: 07:37 And that's stuff that's in the pipeline already? Or is that just thoughtware? 
EW: 07:43 Well, I can't tell you about it. 
RVB: 07:47 Oh man, I was looking for a scoop there [laughter]. Okay, well that sounds super interesting. And then what about Neo4j? How do you think that's going to evolve? And what are you hoping for, or stuff that you think will be super-useful? 
EW: 08:06 Well, there's one feature that we are really hoping that happens this year is the new geospatial plug-in. That would allow us to make our results even more interesting. 
RVB: 08:30 Okay. Are you guys in GeoData already right now? 
EW: 08:33 Yes. 
RVB: 08:36 Okay, all right. But that's in Mongo these days? 
EW: 08:37 Yeah, that's in Mongo. 
RVB: 08:40 Super. Thank you for sharing your perspective. I don't know if you have any final words or final considerations for our listeners. Anything in particular? 
EW: 08:53 Yeah, that Neo is a great product, and the community is great. I mean from day one going to the first GraphConnect it was great to interact with the community. The community's very enthusiastic and it's great to interact with them. 
RVB: 09:23 Super. I'm hoping that you can make it to GraphConnect as well this year. 
EW: 09:26 Yeah. 
RVB: 09:29 Super. I'll hope to see you there then. Thank you for coming on the podcast. I really appreciate it and, when I'm in the US in October, I'll have to give Wanderu a try myself. 
EW: 09:39 All right. Definitely. 
RVB: 09:42 Thank you, Eddy. Have a nice night. Bye. 
EW: 09:45 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik