Friday, 12 February 2016

Podcast Interview with Iian Neill, The Codex

A couple of months ago someone pointed me to this great Neo4j application called The Codex - which is like a semantic application mapping out an "atlas for history". Its author, Iian Neill, recorded a great video about it, and triggered me to want to learn more about it. There's been some other Neo4j projects (like for example Historiana, which Paul worked on) in this domain, and as you will see from the interview below - there's a lot to be said about it. So let's get cracking!

Here's the transcript of our conversation:
RVB: 00:02 Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology, and here we are again recording a long-distance podcast all the way from Australia. This is actually the second week in two weeks-- the second episode in two weeks that I'm recording with [chuckles] someone in Australia. And tonight I have Iian Neill on the Skype here. Iian Neill in Brisbane. Hi Iian. 
IN: 00:25 G'day Rik. 
RVB: 00:26 Hey, thanks for coming online. I know it's early for you and it's late for me, but this is a great time for us to chat, right [chuckles]? 
IN: 00:34 Absolutely. It's certainly my pleasure. 
RVB: 00:36 Very good. Ian, we got to know each other a couple of weeks, months ago. At least I started following your projects a little bit, but it will be good for you to introduce yourself to our podcast listeners because most people probably don't know who you are yet. 
IN: 00:53 Okay. My name is Iian Neill. I'm an ASP.NET developer. I have a bit of a background in computers and arts. I've got a Bachelor of Arts in Art History, but I work in IT. I also work for a non-profit art foundation called the Art Renewal Center. But basically, yeah, I've been passionate about art history and been looking for a way to data mine it and hence Neo4j. 
RVB: 01:23 And then we should also immediately mention one of your coolest projects, I think. This is how I got to know you: the Codex, right [chuckles]? 
IN: 01:31 Yes. That's right. Yes. The Codex is something I've been working on for a few years. It's kind of evolved a bit, but it's basically a way-- it's a project I built out of ASP.NET and Neo4j using the C# Neo4j client, and it's a tool that I'm building to sort of-- I call it an atlas of history. It's sort of trying to map history out, and the connections between people and events and places and things like that. 
RVB: 02:03 Okay. And then tell us a little bit more about that. I saw there's a lot of information about like Italian Renaissance, Leonardo da Vinci, Michelangelo, and stuff like that that you're trying to map out what they're doing or what they did, right? 
IN: 02:18 Absolutely. I kind of think of it as being a bit like a Facebook of the past or in some ways even a little bit like a time machine. There was TED talk on someone doing a project a little bit like that on Venetian history.
But what I really wanted to do was to be able to put myself back in the past and say, "What was happening on a certain day?" So if I saw a certain painting, and what's the context around this painting? Who were the people? What was going on in Florence when this painting was being made? And from that, I started to build the data structure and say, "What else can we find out about this? Can we use the system to abstract out some information? Can we see connections that we might not see if we were just reading a book in a linear way?" And that's kind of what's attracted me to Neo4j. 
RVB: 03:18 Tell us a little bit more about that. What's the relationship between the Codex and Neo4j? How do you use it? 
IN: 03:24 Oh, it's completely dependent on it. A few years ago I had an idea for breaking down a person's biography or life into a series of events, and you can think of it as being a verb phrase. So X meets Y at Place Z, for example. And that's just a data structure. I mean, you have two people, you have a place, and you have a time. And that data structure can be quite powerful for representing sequence of events and connections. And you could then use Cypher to sort of query that, and say, "If I know that X was at this place, at Florence, who else was there at the same time? If X had these friends, do these friends know the other person's friends?" You know? And you can sort of-- once you start down that road, you can sort of keep expanding that with the graph, basically. 
IN: 04:18 So I started in that fashion, but then I found that it was a little bit restrictive and a little bit time consuming to take written text and break it down into that kind of atomic way. So instead, I put a different model on top of that, so I put in the event, you know, somebody's diary event for a day in 1478, let's say. And then I could annotate who was there and what was-- the places, and everything that was mentioned. Those are all nodes in Neo4j. And then I put, if you like, subject tags on top of that. So it's a little-- like you would tag a photo or a Twitter post, a hashtag, you might tag it with a description of what's happening. So if you can sort of forgive the macabre example, a popular pastime in the Renaissance was hanging people. 
IN: 05:08 So for example, you might read somewhere that somebody was taken to the public square and they were hung that day. So I started by saying, "Let's put that in there." So I would create a tag for hanging and associated it with that event on that day at that place. And then I thought, "Why not bring a taxonomy to that tag?" So what I mean by that is putting that tag in a hierarchy. So I'd ask the question, "Well, what is a hanging? Well, that's a kind of public execution, and that's a kind of death," or something like that. And I thought, "Well, that could be an interesting scholarly tool for understanding history." So you've got the text of the event, you know who was there, what they were doing, and then you can use the graph and step out by sort of degrees of separation. 
IN: 06:00 You can say, "I'll start with a specific subject like hanging and then I'll go to all kinds of executions, which could be--" they were very creative back then, so you're bringing back lots of events. And then I have followed this procedure for every tag in the system where I can. And probably the last extension I've done to that is I thought, "When you put a tag in the system, why not record a numerical quantity with that?" So if three people were hung, you could put "hanging three." And then I thought, "That gives you chartable information for three." So you have an event, you have all the people there, you have the subject of the activities, and then if you have numbers, you have information that can be visualised as charts. So it occurred to me to bring all these things together. That's [crosstalk]-- 
RVB: 06:52 It sounds a little bit-- it sounds a little bit like a semantic application, doesn't it? You know, like-- 
IN: 06:57 Yes. 
RVB: 06:56 --triples and those types of things. Is it related to that in any way? 
IN: 07:01 Yes, absolutely. Many years ago when I did my postgraduate IT degree I did a course called "Ontology and the Semantic Web", and that's kind of where it all came from. It was about ten years ago and we used a language called OWL - I think O-W-L - as a modelling language. And I thought it was amazingly powerful for expressing real relationships. And then I was really disappointed to see that there was no practical database out there that could do that kind of thing. It was just sort of SQL. And I sort of failed to translate the Owl model into SQL in an efficient way, and I kind of put it aside. But then a few years ago I came across Neo4j and that seemed a good time to pick it up again. 
RVB: 07:49 Well, that's a perfect segue for my second question. It's, why Neo4j? Why did you use a graph database for this particular project? And then what's so good about it? Any comments on that? 
IN: 08:06 Well, I mean, originally it just started as a side project. As I said, I started with that sort of data structure, that X meets Y at a place. And originally, I just wrote it as a kind of MapReduce-style thing and JavaScript just using JSON, and just querying it through lambdas and so on. And it was always going to be temporary; it's just a in-memory JavaScript. And I started looking around, thinking, "Is there a database that can do this?" And I heard about NoSQL document databases, and I looked into Mongo and RavenDB. But what I found when I looked into Mongo - I read an interesting post; I will try and dig up the link later - is, it was by somebody who had used Mongo extensively and I think they thought that Mongo would be a relational kind of system for them, that it would have some of the power of-- the relational ability of SQL databases. And they realised that it didn't really have that. And I thought, "That's great. I won't go down that road." And then somebody in the comments recommended Neo4j, so I started looking into Neo4j. And it seemed to me the perfect intersection of the power of representing things in a document style and a graph style and then having the relationships as well that make it incredibly fast to query and update. 
RVB: 09:28 Very cool. So it's kind of like what I've heard many people on the podcast say: it's a combination of good modelling fit and then on the other hand, there's also just query power, right? Query possibilities that match this domain really well. 
IN: 09:45 Absolutely. And just to quickly round that up, but I was saying before, I was lucky enough to sit in on a talk that Jim Webber gave in Brisbane that was related to the YOW Conference in I think 2013. And I already knew about Neo4j at the point but going to the talk really convinced me. Jim gave a great description, gave lots of examples from Doctor Who (dataset is over here), which is wonderful [chuckles]. [crosstalk] you'd think.
RVB: 10:15 [chuckles] Yeah. Yeah.
IN: 10:17 And then he gave me a copy of the book as well, on graph databases, and it really went from there. It was absolutely decided that I was going to do that with Neo4j. 
RVB: 10:28 It's so funny. I mean, two weeks ago I spoke to two fellow Australians from Melbourne, and they as well got inspired by that tour that Jim did in-- 
IN: 10:40 Yes [chuckles]. 
RVB: 10:40 --2013 in Australia [chuckles], so it's been a productive visit, that one [chuckles]. Very [good?] [crosstalk]. 
IN: 10:47 Absolutely. 
RVB: 10:48 So the last question I always ask people, Iian, is what does the future hold? Where do you think this is going? Where is your project going and where do you see graph databases as part of that project going? Any perspectives? 
IN: 11:05 Sure. I've got a few plans with Codex. I want to continue-- I want to add the ability to put in more, what you might call, arbitrary data sets. So rather than just having events - you know, what people were doing - I want to be able to put in things like if somebody gave me a record set of births and deaths, or disease, epidemiology figures, or something like the spread of a plague or something, I think it would be possible to integrate that into the system so you could switch between data sets, you could be looking at somebody's life story but then also looking at more official statistics as well. So that's kind of where I'll be taking it in the next few months. One thing I've discovered working on Codex is that-- one thing I didn't expect from Neo4j was that it's such a good tool for modelling that in a way, you can almost-- in most domains, you have one database for one domain. 
IN: 12:12 You have a shopping cart and you have an art gallery collection or something like that, and you sort of think about them as being two separate databases. But with Neo4j, I've found that you can think about it as being one database. You can have multiple domains that if you define points of where they interface - certain commonalities like time or space or location - you can easily take the domain you started with and add other domains to it, so it becomes kind of what I think it was being, like an integral or universal database in a way. I don't know if that would be appropriate for every solution, but I think it's something that Neo4j offers that I think would be very difficult to do with another database. 
RVB: 12:58 Very cool, very cool. Well, thank you so much for talking about all of this. I really appreciate it. As you know, I try to keep these podcasts quite short so that they are digestible on everyone's commutes, you know what I mean? So we're going to wrap up here, but I really want to thank you again for coming online. Good luck with the Codex and all of your projects, and hopefully we'll get a chance to meet each other at some point. That would be great. 
IN: 13:28 That would be fantastic. And thank you, Rik. 
RVB: 13:31 Thank you. Bye-bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Friday, 5 February 2016

Podcast Interview with Ben Butler-Cole, Neo Technology

Here's another lovely conversation with a dear colleague of mine working in our London office, Ben Butler-Cole. Ben has been a part of the London team for well over 2 years now, and has been working together with the "Thoughtworks crew" (Jim, Ian, Alistair, Mark - and probably some more :)) for a very long time. So he really is part of the family. He became a hero of mine by summing up the corporate culture at Neo very succinctly a while back: "Neo is such a great place to work because we have no ass-holes over here". Or something of that nature :) ... In any case we had a great chat, and I would love to share it:

Here's the transcript of our conversation:

RVB: 00:02 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo Technology, and here I am recording a podcast episode with one of my dear British colleagues, all the way from across the Channel, Ben, Ben Butler Cole. Hi, Ben. Thanks for joining me. 
BBC: 00:16 Hi, there. 
RVB: 00:18 Thanks a lot for coming online. I appreciate it. 
BBC: 00:20 Not at all. 
RVB: 00:21 Very good. Ben, I invited you because I know you've been doing some really cool stuff on Neo4j in the engineering department, and I think it's always cool for people to sort of have a little bit of a view on how these things work. But before I get into that, why don't you introduce yourself and people might learn a little bit more about who you are and what you do at Neo? 
BBC: 00:45 I'm Ben. I am one of the developers at Neo. I've been at Neo for two-and-a-half years. In that time, I spent some time working on the core product. I've also worked on improving our internal development infrastructure, build systems and release processes and so on. I've spent some time working on building testing tools to allow us to automate and improve the testing of Neo4j, particularly testing in sort of real-world scenarios, testing end-to-end in the same way that our users use it so we can have the chance to beef the product a bit before we release it to the real world. 
RVB: 01:45 Absolutely. 
BBC: 01:46 Most recently I started a new stream of work to improve the operations surface of Neo4j. So we're looking at logging, configuration, packaging, and things like that, trying to really improve those and trying to make the product as easy and nice to use for systems operators as it is for the end users. 
RVB: 02:19 Yeah. Could you tell us a little bit more about some of these build processes and how they work? Just from a really high level. I know everything we do is open source, right, so people can look at some of the source codes on GitHub and stuff like that, but how does it actually work from a developer's perspective? What are some of the big bricks? 
BBC: 02:40 Yes. We use TeamCity, which is a build server, and we take the source that's in the GitHub repo, which as you see, people can see. We built that, run a test in that. Then we have quite a long complex pipeline of builds that follow on from that. Some of the testing we do is in the public source code repository. We also have a number of internal private repositories which have testing tools and so on. For each build we probably run 10 or 15 different kinds of tests that we have that test different aspects of the application. Some very simple ones, which we call tyre kicking, which just start the application - install it, start it, run it, make sure that we can write data and read data. And then varying levels of sophistication beyond that. Low-level tests for components where we want to stress-test or performance-test an individual component, and then larger end-to-end tests. Some of the most useful tests we've got are for our testing of our clustering where we do sort of fuzzy testing, where we stand up a cluster and read and write data to it. And while it's running, we knock over individual instances - deliberately crash them or shut them down cleanly - and make sure that the cluster stays up and is resilient to that. That's been a very useful aspect to the testing that we do. 
RVB: 04:30 Wow. Some of those tests, do they take a long time, or is it like instantaneous, or how does that work? Some of these things must take quite some time, no? 
BBC: 04:41 They do, yes. We have tests that run for several hours for that reason, because we're standing up real clusters of service, and we want to run them over time and make sure that they're stable over a long period of time. 
RVB: 04:58 Very cool, very cool. And you're now starting some new work on the operability, you said, right? There's a lot of people looking at that, I think. Some of our primary users are the administrators, aren't they? 
BBC: 05:11 Yes, exactly. Historically it's an area of the product that hasn't got quite as much love as it might have done. We've been focussing on end-user features of the product, and stability and the reliability of the data storage. But we've taken a decision to make an investment in trying to improve the operability of the product as well now. So we're improving the configuration and the way that works, the packaging particularly. It's not glamorous stuff, but because of my interests I'm very excited that we're doing it. So we're changing the directory structure of the application's tools too so that it's more sympathetic towards the standard ways to doing things on the platform, particularly on the Linux, where the majority of our production installs are running. 
RVB: 06:18 Is that work going to be visible in 3.0 or in the 3.x series, or...? 
BBC: 06:23 Yes. Some of that work has already gone into the first steps of the milestones, have already gone into the code base and will be in 3.0. 
RVB: 06:32 Super cool. 
BBC: 06:33 We're trying to get as much of it as possible into 3.0 as a major release because we're happy to make some backwards incompatible changes for 3.0 because it's a major release, and then hopefully things will settle down for the minor releases that come after it. 
RVB: 06:52 Totally, yeah. I always ask a couple of questions on this podcast, right? One of the things that I'm always interested in is what attracted you to Neo, and how did you get to Neo, and why do think it's a cool product to be working on. You can give me the real answers, Ben. I know there's a very boring answer to this one [laughter]. 
BBC: 07:15 I've spent nearly ten years before I came to Neo working as a consultant. So I saw a huge range of different systems and applications, and built quite a lot of them. One of the really attractive things that I see is what I think is the superiority of the Graph model as a way of modelling the real world and the shape of data in the real world, over SQL or key-value stores. I find that very appealing. I think we're effectively making the life easier of all those developers who I've been working with over the years, who are struggling with the impedance mismatch between particularly SQL and the applications they're trying to rhyme. 
RVB: 08:18 Very cool. 
BBC: 08:18 On a more personal level, I knew a bunch of people who are here working at Neo Technology, and they were people who I knew and respected, and I was keen to come and work with them. So that was what really sucked me. 
RVB: 08:32 Exactly. There's a bunch of people that come from the same background at Neo, right, people like Ian and Jim and Alistair and all of those folks? 
BBC: 08:38 Yeah, exactly. And they're all people who I'd worked with before, who I was keen to work with again. 
RVB: 08:43 Very cool. Maybe one more question, Ben, if you don't mind. Where do you think this is going now from a product perspective, from an industry perspective? Anything that you aspire or think we should be doing or believe we should be doing, or that type of stuff? Look into the future crystal ball. 
BBC: 09:09 I think there's a lot of work for me to do, carrying on the work I'm doing at the moment for the product. As you know, there are initiatives and exciting new features being built across the product at the moment, for 3.0 and beyond. I'm keen to kind of stick with the boring stuff. I really want to make Neo4j a very easy product to operate. So beyond just cleaning out what's effectively debt that we're working on at the moment, I have ambitions for monitoring particularly of live systems, to make the software able to explain to people what's going on inside it, integrate it with standard monitoring systems, and once we've reached a level of where we're happy that it's good enough, I've then got ambitions to start pushing on the state of the art, and improving on the state of the art for monitoring particularly, turning monitoring into a kind of feed of events so that it's really easy to understand, to interpret the behaviour of the system, the people operating it to be able to more or less leave it to tread away on its own, and then help build up the systems that can interact with it, and fix problems as they come up. 
RVB: 10:59 Very cool. Well, thank you so much, Ben, for coming online and sharing that with us. 
BBC: 11:04 Awesome. 
RVB: 11:05 It's always great to get like an inside-peak in how things work in Neo's engineering world. Really appreciate it. And I would say: You know what? Let's make Neo4j boringly fantastic, right [laughter]? That would be such a great achievement. Thank you so much, Ben. I appreciate it.
BBC: 11:26 All right. Thanks. 
RVB: 11:28 Cheers, man. Bye. 
BBC: 11:29 Bye-bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Friday, 29 January 2016

Podcast Interview with Stuart Begg and Matt Byrne, Independent Contractors for Sensis

This week I was able to connect with two lovely gentlemen that have been doing some truly inspiring work with Neo4j on a project in Australia. Stuart and Matt have been talking and writing about the project for a while now, and have been very forthcoming about sharing their lessons learnt with the Neo4j community. That's always a great thing to watch, so I literally could not wait to chatting with them - and so we did. Here's the recording:

Here's the transcript of our conversation:
RVB: 00:02 Hello everyone, my name is Rik. Rik Van Bruggen from Neo Technology and here I am recording the longest distance podcast recording ever in the history of this podcast, and that's all the way in Australia, Stuart and Matt. Thank you guys for joining us. 
SB 00:18 Thanks Rik. 
RVB: 00:19 It's great to have you on. I know it's early for you guys, it's late for me but it's great that you guys are making the time. I've read some of the blog posts that you guys have been posting on Neo, but it might be a good idea for you guys to introduce yourself, if you don't mind? 
SB: 00:36 Hi Rik, it's Stuart here, Stuart Begg. I'm a contractor developing software for Sensis. We've been using Neo for a while now, and I'm here with my colleague Matt Burn. 
MB: 00:47 I'm Matt Byrne, also contracting out of Sensis in Melbourne. Been on the same project as Stu. 
RVB: 00:54 Absolutely, very cool. And you guys have been using Neo for a long time? How long have you been working with Neo4j by now? 
SB: 01:01 Well the project kicked off about two years ago but we've been actively using Neo for the last 18 months after a selection process at the very beginning of the project to look at various technologies and Neo was the technology that was chosen. 
RVB: 01:16 And I suspect that my dear friend Jim Webber was part of that process, wasn't he? 
SB: 01:23 Yes, Jim was part of that process in the very early stages. He was the one that kind of seeded the idea in my mind in particular when he came to Sensis quite a few years ago actually and talked about the benefits of Neo Technology as an alternative to other more traditional database technologies. And I was very interested in what he had to say. Then when we started looking at this project, Neo was a very good fit for the style of application that we were building. 
RVB: 01:49 Well that gives me a perfect segue into the first question that I typically ask on this podcast, which is why? Why did you guys get into this? And what made it such a good fit? 
SB: 02:03 It's a good fit for a couple of different reasons. The project that we were working on is around content management within the organisation. We're a directory's company so we have a lot of content that's supplied by the various customers. But in many ways it's kind of a configuration and management system that the company sells products and those products are backed by the content. We've gone through a number of iterations of trying to work out a better and different way of managing the content within the backend system and have looked at various things from relational database systems, which there was some early work done on that but we ran into some modelling and performance issues with that. And then we also looked at some hierarchical data systems. But when we got down to it and looked at the data in more detail, it was more of a network model. We were trying to reduce duplication within the data. We were seeing that there was various pieces of data that we used in multiple different ways, the same piece of data used in multiple different contexts, and it's just that the graph made everything so much easier to model in the first instance, and also for the business to understand how we were actually using the data. And the visualisation tools that come with Neo as a standard feature make it really easy to get your message across to, not only the end users but also to the developers that are building the system itself. Anything you'd like to add to that? 
MB: 03:29 No, you were part of that selection [?] [chuckles]. 
RVB: 03:32 [chuckles] Well it's so cool right, because I don't know if you know these guys, but the way Neo got founded in late 90s, early 2000s was for a content management system as well. So it's history repeating itself so to speak, and exactly what you were saying right? It's a modelling fit, it's a performance fit. All of those types of things seem to come back quite regularly, which is super cool. So tell us a little bit more about these performance advantages that you guys wrote about in the blog post. There were some significant improvements there I gather, no? 
MB: 04:14 Yeah. So basically we're retrieving a sub-graph of information at a time. So we've got all this dynamically specified amount of data and for us to be able to model that in relational would just kill us. So in the blog post the motivation there was to really show how you could tune Neo and just to show all the tools available to you. So we put some stuff in there as an example of what we've done, but it was really to help people get started and have a bit more competence in it. I think at the time we only found Michael Hunger's article while we were writing the article, our one, because it was just a Google Doc I think. So he published his out and that helped us. It really helped get things going for us. 
RVB: 05:06 Your article was really good because it's also illustrating a common problem with time versioning and all of that stuff. You were inspired by Ian Robinson right? One of Jim's mates. 
MB: 05:20 Yeah. I came onto the project a bit after-- Stu did the hard work, him and another guy to select the technology, and obviously a bold move to do it with such a core system. And then when we came on, the time-based thing of Ian Robson was great, it fit us. But it also challenges you. You have to bake that in at a low level and it can affect performance. So we were tackling that. 
RVB: 05:53 I think everyone should read your blog post. There's some very valuable experiences and lessons to be learned from that I think. Thanks for writing it up. It's really helping our community I think. So guys, where is this going? Where is your project going? And where do you think the technology's going to be going in the next couple of years? Any perspectives on that? 
SB: 06:15 I think from the project's perspective, I think it's been particularly successful. We've met the business requirements and also I guess just the user requirements of providing a much richer experience for the users and a much easier and faster turn-around from concept to delivery of the sort of products that we're looking at delivering for Census as an organisation. But it would be good to have other companies and other individuals understand some more about the graph technology. And that's one of the reasons why Matt and I are keen to talk about our experience with the product, and I think that there is a good fit for this kind of technology in a lot of different areas that people may not have thought about. Even in these times of the NoSQL database, I think there's still a lot of people are very familiar SQL style databases and maybe don't think beyond that. So it would be good for people to have this as another option within their tool kit for providing different kinds of solutions to the sorts of problems that they're dealing with. 
MB: 07:24 Just to add to that, we've put the article out and a guy Mark in Melbourne's organised Melbourne meet-ups. We're trying to put out Sensis as a strong example of how it can work. But it'd be nice to see other examples in Australia come out, because it gives companies more confidence in Australia to adopt these technologies. And we're just contractors so we'd like to help the next company if that came up. But in terms of Neo, to have confidence in the technology we also wanted to highlight any drawbacks that there are, just so that you know what you're in for. But Cypher is a fantastic query language, it's very easy to use. But it's young compared to SQL. So that's still evolving. A lot more sugar in that, a lot more power coming, and there's always things that we hear. So I'm always keen to see the enhancements to that coming in new versions. And more indexing power. It'd be nice to have them on relationship because there are examples where we do want to query things that way. So some of those things. But all the work that's come in, it's just fantastic to see it improve so much between releases. 
SB: 08:49 And the-- I was just going to say, the feedback that we get from Neo for the sorts of requests that we make and just the ways that we using the technology, the feedback from the Neo guys has been terrific all the way through this process as well. 
RVB: 09:07 That's super great to hear. Just to address what you were just saying there Matt, there's so many things that we want to do in the product, it is a very long list of things. The cool thing is with Neo, we've actually now hired a very significant engineering team. We had an all notes call earlier today with all the Neo employees. And our engineering team is like over 40 people now, which is super super great, because that allows us to accelerate with those new feature developments as well, right? There's wonderful stuff coming. I can tell you that [chuckles]. 
MB: 09:46 And the good thing about dealing with the company Neo compared to our traditional database vendors or any other - what people love to call - enterprises is that we do get this great feedback and we do feel like our feedback's taken account. We realise there's only so much that can be done in releases, so we definitely don't feel hard done by. It's great. 
RVB: 10:11 Super. Cool. Well guys, thank you so much for coming on the podcast. I'm going to wrap it up here because I want to keep these episodes digestable and short for people's commutes. So thank you again, I really appreciate it and I look forward to meeting you guys in person one day. 
SB: 10:32 That will be great, thanks-- 
MB: 10:32 That'll be great. Cheers-- 
SB: 10:34 Look forward to it. 
RVB: 10:34 Thank you guys. Cheers, bye. 
MB: 10:36 Thanks, bye. 
SB: 10:37 Cheers.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Wednesday, 20 January 2016

Podcast Interview with Paul Jongsma, Webtic

NO! The Graphistania Podcast is NOT DEAD, in spite of what you may have thought. We have lots of cool interviews coming up (and are always looking for more - so please contact me if you have more ideas), and we are going to start the "not so new year" with a nice conversation with a dear community member from the Netherlands, Paul Jongsma. Paul has been around the block a few times, is a seasoned web developer that runs his own consultancy doing project with lots of cool and innovative technology, among which Neo4j, of course. So we got together and talked and here's the new episode:

Here's the transcript of our conversation:
RVB: 00:02 Hello everyone. My name is Rik, Rik Van Bruggen. I am very happy to start the New Year with another beautiful Neo4j podcast. Graphistania needs to be populated right so we're going to do another chat here today. I've invited a very, very dear community member from the Netherlands from this chat and that is Paul Jongsma. Hi Paul.
PJ: 00:26 Hi.
RVB: 00:27 Hey. Thanks for coming online. I really appreciate it.
PJ: 00:31 Glad to be here.
RVB: 00:32 Yeah, cool. Paul, we always start the podcast with just very simply, who are you, what do you do, what's the relationship between you and the wonderful world of graphs?
PJ: 00:43 Try to do that in a couple of seconds. I'm Paul. In a life beyond me I was co-founder of XS4ALL, one of the first public Internet providers in Holland, and after doing that for a couple of years, I left that company to create my own, called WebTic in which I, ever since it was founded, tried to disclose information on the web via websites, web applications, whatever you want to name it, and I've been doing that since 1996, so I've been on the road for quite a bit.
RVB: 01:22 Yeah, absolutely. So when you say disclosing information on the web, that means building websites? What types of tools do you typically use for building your web applications?
PJ: 01:35 Until recently, mainly-- back in 96, there was only a couple of ways of building a website--
RVB: 01:46 CGI Scripts!
PJ: 01:48 CGI Scripts and no databases. So I was one of the first to actually try to hook up databases onto the web and there were few tools at that time and one of which comes to mind is Mini-SQL, which is not even a predecessor of MySQL, at least timeline wise. We've been doing projects involving databases almost exclusively. We weren't building websites which were static. There was always a technical challenge behind it.
RVB: 02:24 Yeah, an interactive component.
PJ: 02:26 Interactive components, or automation components. For instance, we've been working on the website for the Department of...
RVB: 02:38 Looking for the English word [laughter].
PJ: 02:40 The English word, exactly [laughter]. The Statistics Department of Amsterdam and they obviously have loads of information within their systems and they want to publish those on the web. What we did was build a system where basically all the information which they have already can be collected, a script can be run on that and then a website comes out on the other end and updates are all done automatically so they just add a few files which they want to publicize. They push the button and they're on the web and there is no need for internal knowledge of how to convert stuff for the web or it's all automatic. That kind of engine of converting it to the web and making that engine work is the area of interest we do mainly--
RVB: 03:32 Cover. So how did you get to graphs. Tell us about that.
PJ: 03:36 Yeah, that's quite an interesting story. Obviously we've been working with SQL databases for quite a bit of time. Back in the day, Mini-SQL. We've been using MySQL for ages now. A couple of projects we also did Postgress instead of MySQL, but mostly SQL-based, and for a larger project we've done recently called Historiana, which is a website about history and European context and trying to create new and effective ways of teaching history to both students and teachers, facilitate the teacher and give tools to both. We ran into some modeling problems because as you can imagine, there are a lot of relations to be documented when you talk about history. You've got persons. You've got events. You've got locations. And as anything in the human world, you could connect anything to anything basically and that requirement was also there, so we built a system where you actually could connect basically everything to everything within the context of SQL. And that gave--
RVB: 04:57 That must have been fun [laughter].
PJ: 04:59 That gave very interesting results and after a couple of hundred records on various entry, various parts of the database, the interface came to a grinding halt and obviously because if you want to relate 100 items to 100 items to 100 items, the queries which you have to build are horrendous and the user interface becomes horrendous and even worse the timing becomes horrendous.
RVB: 05:28 Now was that the project that you using Neo4j for then[crosstalk]?
PJ: 05:31 Oh yeah, I did not know Neo4j at that time, but my antennae picked up the keyword graph database so I started looking into those because obviously there should be a solution for a problem like the history problem. Graph databases seemed like it might be a solution.
RVB: 06:00 You're not alone there, right. I mean recently there was this beautiful video from the Codex example. I don't remember the guy's example but it was beautiful I thought.
PJ: 06:09 Yeah, which-- the Australian guy. You should add that link to the--
RVB: 06:15 I will. I will.

PJ: 06:17 -- because interestingly enough, there was a lot of overlap between the two, at least in the words used and the concepts used within both projects. It's amazing at how history--
RVB: 06:32 So what was the main benefit then? Was it modeling then or also performance? What was the main driver then for looking at the graph databases?
PJ: 06:41 The main driver was the ability to connect stuff to stuff, so a person should be able to connect to an event, and when you look at that person you should see the events, but when you are at the event, you should be able to ask which persons are connected to this event and you will see that relation. And doing that in SQL is sort of possible within certain limits but the ease in the graph database allows you to do that is very much more better than SQL database.
RVB: 07:21 When I first got to know you in the Amsterdam community, I will always quote you on saying that before Neo4j, when you work with Neo4j, sequel databases feel like a useful sin [laughter]. I think that's a fantastic way of putting it to be honest.
PJ: 07:36 They are. They are. Obviously there is room for other systems as well, not only data, but the relation between the web and a database, a graph database, is so much more a logical model than any other model that for any page you're at at a website which is explorative, it's always about ok, there is this thing and how does it relate to other things? And that question can be answered by Neo within a couple of milliseconds, and you'll be able to render the results of that page in real time instead of doing queries, or doing queries up front, and cache the results and stuff like that. There are all kind of technical tricks to make it work with SQL, but your life becomes so much more easier when you use the right technology for the right job.
RVB: 08:30 Agreed. So last but not least, where do you think it's going Paul? Where do you think the world of graph databases is headed and how do you plan to use it in the future?
PJ: 08:44 I think more and more people will discover the world of graph databases--
RVB: 08:51 See the light [laughter]?
PJ: 08:53 See the light and make their life easier to use them. That is one thing I think is going to happen. I've been always a runner-up from the technology. People always ask, "Why are you picking that?" Just because I thought it was a good tool. So I think behind me, there are a lot of people picking it up and I hope that the new Neos will take the web development even more in mind. There are a couple of things you might want to make easier, like relate to files in a file system and stuff like that that would make life easier, but I think we're getting there [crosstalk]. There are interesting developments like the binary interface and stuff like that.
RVB: 09:51 Absolutely. Well, thank you so much Paul. It was a joy to talk to you again. And I wish you lots of luck and happiness in 2016 and hopefully then we'll see this option of graphs take off together right? And I'll see you in the Amsterdam community very, very soon.
PJ: 10:11 Yes.
RVB: 10:12 Thank you, Paul.
PJ: 10:13 Okay. No problem.
RVB: 10:14 Thank you. Bye bye.
PJ: 10:14 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Wednesday, 13 January 2016

The GraphBlogGraph: 3rd blogpost out of 3

Querying the GraphBlogGraph

After having created the GraphBlogGraph in a Google Spreadsheet in part 1, and having imported it into Neo4j in part 2, we can now start having some fun and analysing and querying that dataset. There are obviously a lot of things we could do here, but in this final blog post I am just going to explore some initial things that I am sure you could then elaborate and extend upon.

Let’s start with a simple query

// Which pages have the most links
match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)
return b.name, p.title, count(r)
order by count(r) desc
Run this in the Neo4j browser and we get:

or just return the graphical result with a slightly different query:

match (b:Blog)--(p:Page)-[r:LINKS_TO]->(p2:Page)
with b,p,r,p2, count(r) as count
order by count DESC
limit 50
return b,p,r,p2

And then you start to see that Max De Marzi is actually the “king of linking”: he links his pages to other web pages a lot (which is actually very good for search-engine-optimization) .

A quick visit to one of Max’ pages does actually confirm that: there’s a lot of cool, bizarre, but always interesting links on Max’ blogposts:
So let’s do another query. Let’s look at the different links that exist between blogposts of our blog-authors. Are they actually quoting/referring to one another or not? Let’s do

//links between blogposts
MATCH p=((n1:Blog)--(p1:Page)-[:LINKS_TO]-(p2:Page)--(b2:Blog))
RETURN p;

and then we actually find that there are some links - but not that many.


Same thing if we look at this a different way: let’s do some pathfinding and check out the paths between different blogs, for example my blog and Michael’s

match (b1:Blog {name:"Bruggen"}),(b3:Blog {name:"JEXP Blog"}),
p2 = allshortestpaths((b1)-[*]-(b3))
return p2 as paths

Then we actually see a bit more interesting connections: we don’t refer to one another directly very often, but we both refer to the same pages - and those pages become the links between our blogs. At depth 4 we see these kinds of patterns:

Interesting, right? I think so, at least!

Then let’s do some more playing around, looking at the most linked to pages:

//Which pages are being linked to most
match ()-[r:LINKS_TO]->(p:Page)
return p.url, count(r)
order by count(r) DESC
limit 10;

That quickly uncovers the true “spider in the web”, my friend, colleague and graphista-extraordinaire: Michael Hunger:

Last but not least, I wanted to revisit an old and interesting way of running PageRank on Neo4j using Cypher (not using the Graphaware NodeRank module, therefore). I blogged about some time ago, and it’s actually really interesting and easy to do. Here’s the query:

UNWIND range(1,50) AS round
MATCH (n:Page)
WHERE rand() < 0.1
MATCH (n:Page)-[:LINKS_TO*..10]->(m:Page)
SET m.rank = coalesce(m.rank,0) + 1

This does 50 iterations of PageRank, using a 0,1 damping factor and a maximum depth of 10. Running it is surprisingly quick:

If you do that a couple of times, and even do a few hundred iterations at once, you will quickly see the results emerge with the following simple query:
match (n:Page)
where n.rank is not null
return n.url, n.rank
order by n.rank desc
limit 10;
Confirming the “spider in the web” theory that I mentioned above. Michael rules the links!


All of these queries are of course on Github for you to play around with. Would love to hear your thoughts on these three blogposts, and hope that they were as fun for you to read as they were for me to write.

All the best.

Rik

Monday, 11 January 2016

The GraphBlogGraph: 2nd blogpost out of 3

Importing the GraphBlogGraph into Neo4j

In the previous part of this blog-series about the GraphBlogGraph, I talked a lot about creating the dataset for creating what I wanted: a graph of blogs about graphs. I was able to read the blog-feeds of several cool graphblogs with a Google spreadsheet function called “ImportFEED”, and scrape their pages using another function using “ImportXML”. So now I have the sheet ready to go, and we also know that with a Google spreadsheet, it is really easy to download that as a CSV file:

You then basically get a URL for the CSV file (from your browser’s download history):

and that gets you ready to start working with the CSV file:

I can work with that CSV file in Cypher’s LOAD CSV command, as we know. All we really need is to come up with a solid Graph Model to do what we want to do. So I went to Alistair’s Arrows, and drew out a very simple graph model:



So that basically get’s me ready to start working with the CSV files in Cypher. Let’s run through the different import commands that I ran to do the imports. All of those are on github of course, but I will take you through them here too...

First create the indexes

create index on :Blog(name);
create constraint on (p:Page) assert p.url is unique;

Then manually create the blog-nodes:

create (b:Blog {name:"Bruggen", url:"http://blog.bruggen.com"});
create (n:Blog {name:"Neo4j Blog", url:"http://neo4j.com/blog"});
create (n:Blog {name:"JEXP Blog", url:"http://jexp.de/blog/"});
create (n:Blog {name:"Armbruster-IT Blog", url:"http://blog.armbruster-it.de/"});
create (n:Blog {name:"Max De Marzi's Blog", url:"http://maxdemarzi.com/"});
create (n:Blog {name:"Will Lyon's Blog", url:"http://lyonwj.com/"});

I could have done that from a CSV file as well, of course. But hey - I have no excuse - I was lazy :) … Again…

Then I can start with importing the pages and links for the first (my own) blog, which is at blog.bruggen.com and has a feed at blog.bruggen.com/feeds/posts/default:

//create the Bruggen blog entries
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w/export?format=csv&id=1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w&gid=0" as csv
match (b:Blog {name:"Bruggen", url:"http://blog.bruggen.com"})
create (p:Page {url: csv.URL, title: csv.Title, created: csv.Date})-[:PART_OF]->(b);

This just creates the 20 leaf nodes from the Blog node. The fancy styff happens next, when I then read from the “Links” column, holding the “****”-separated links to other pages, split them up into individual links, and merge the pages and create the links to them. I use some fancy Cypher magic that I have also used before for Graph Karaoke: I read the cell, and then split the cell into parts and put them into a collection, and then unwind the collection and iterate through it using an index:

//create the link graph
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w/export?format=csv&id=1LAQarqQ-id74-zxV6R4SdG7mCq_24xACXO5WNOP-2_w&gid=0" as csv
with csv.URL as URL, csv.Links as row
unwind row as linklist
with URL, [l in split(linklist,"****") | trim(l)] as links
unwind range(0,size(links)-2) as idx
MERGE (l:Page {url:links[idx]})
WITH l, URL
MATCH (p:Page {url: URL})
MERGE (p)-[:LINKS_TO]->(l);

So this first MERGEs the new pages (finds them if they already exist, creates them if they do not yet exist) and then MERGEs the links to those pages. This creates a LOT of pages and links, because of course - like with every blog - there’s a lot of hyperlinks that are the same on every page of the blog (essentially the “template” links that are used over and over again).
And as you can see it looks a little bit like a hairball when you look at it in the Neo4j Browser:
So in order to make the rest of our GraphBlogGraph explorations a bit more interesting, I decided that it would be useful to do a bit of cleanup on this graph. I wrote a couple of Cypher queries that remove the “uninteresting”, redundant links from the Graph:

//remove the redundant links
//linking to pages with same url (eg. archive pages, label pages...)
match (b:Blog {name:"Bruggen"})<-[:PART_OF]-(p1:Page)-[:LINKS_TO]->(p2:Page)
where p2.url starts with "http://blog.bruggen.com"
and not ((b)<-[:PART_OF]-(p2))
detach delete p2;
//linking to other posts of the same blog
match (p1:Page)-[:PART_OF]->(b:Blog {name:"Bruggen"})<-[:PART_OF]-(p2:Page),
(p1)-[lt:LINKS_TO]-(p2)
delete lt;

//linking to itself
match (p1:Page)-[:PART_OF]->(b:Blog {name:"Bruggen"}),
(p1)-[lt:LINKS_TO]-(p1)
delete lt;

//linking to the blog provider (Blogger)
match (p:Page)
where p.url contains "//www.blogger.com"
detach delete p;

Which turned out to be pretty effective. When I run these queries I weed out a lot of “not so very useful” links between nodes in the graph.
And the cleaned-up store looks a lot better and workable.

If you take a look at the import script on github, you will see that there’s a similar script like the one above for every one of the blogs that we set out to import. Copy and paste that into the browser one by one, the neo4j shell, or use LazyWebCypher, and have fun:
So that’s it for the import part. Now there’s only one thing left to do, in Part 3/3 of this blogpost series, and that is to start playing around with some cool queries. Look that post in the next few days.

Hope this was interesting for you.

Cheers

Rik