Thursday 24 December 2015

Podcast Interview with Andreas B. Kolleger, Neo Technology

Yey! The festive season is upon us! And: here's the 50th (!!!) podcast episode in the Graphistania recordings. What a journey it has been! To celebrate, I got to talk to my friend and colleague Andreas B. Kollegger. Andreas was probably one of the first people that I talked to when I met the Neo4j team for the first time in the summer of 2012, and he always impressed me with his calm and creative mindset. To give you an example, here's a short demo that Andreas created showing how you can create the essence of a recommendation engine in Neo4j in 2 minutes:

Super sweet. Nowadays, "ABK" is part of the product management team at Neo4j, and he has plenty of interesting things to talk about when it comes to past, present and future of Neo4j. So let's listen to the episode:

Here's the transcript of our conversation:
RVB: 00:02 Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology. Tonight, I am going to record a podcast episode that I've been looking forward to for a very long time, with my dear friend in dark, rainy, beautiful Portland, Andreas Kollegger. Hi Andreas? 
ABK: 00:20 Hello, Rik. Thank you for having me on the podcast today. 
RVB: 00:23 Yeah, I know. It's such a joy, thanks for coming on. I am going to call you ABK for short, if you don't mind, lots people know you as ABK. Why don't you introduce yourself Andreas? You've been part of the new Neo4j ecosystem for a very long time but maybe some people don't know you yet. 
ABK: 00:42 Sure. My name is Andreas B. Kollegger. I'll let the B be a mystery for the moment. I work for Neo Technology and I've been a part of the Neo4j community for-- it feels like as long as I can remember from at least as long back as back is the epic 0.9 release of Neo4j. 
RVB: 01:01 Oh no way. 
ABK: 01:03 Yeah, it's been quite awhile and I've grown from community member all the way through to now being a product manager or product designer depending on who you talk to and the time of day. 
RVB: 01:14 [laughter] That's very funny. And you've been a great stimulator of the Neo4j community. I remember talking to people on the East coast, West coast, South, wherever, that said, ABK made me start up a community over here or meet up over here. You've been part of it for such a long time, right? 
ABK: 01:34 Yeah that's right. From when we were truly just trying to get things going and get excitement through meet-ups and lots of advance and actually trying to continue that trend. When I was on the east coast it was my duty and honor and pleasure to travel up and down almost all the way from Boston down to around Washington, having meet ups, meeting people, talking whenever I could about Neo4j and spreading the great love of crafts that I'd found. And I'm doing the same now here in Portland. Actually tonight will be my very first community meet up in Portland. I'm very excited about that. 
RVB: 02:11 Super cool, super cool. When you say product manager, product designer, what does that mean? What do you do as a day job? 
ABK: 02:21 Yeah, that's why the sort of manager verse designer split is interesting. When I moved into like this role, it was thought of more as doing product design, which I guess is certainly a bit of a vague term if you we were building physical products like phones or something then I would very clearly be doing like graphic design for the phones like in 3D or something, sculpting physical objects would be part of the product design. But as they involved in the software project, product design ranges a bit more obviously from the front-end user experience elements of the product. But also we thought about it as paying a bit more attention to sort of the experience generally of using Neo4j. Any of the parts that you touch of Neo4j whether it's an API or the documentation on to the website. It would be good to have somebody just to try and connect all the different parts, and have them all make sense so that when you read something on the website it reflected how that part actually behaved, messages and things had the same voice and tone as maybe some of our blog posts. Obviously, that was only really something that when we were a smaller organization made sense. Now that we've grown substantially, we have people who are superb at each one of these things. I've been doing less and less of that focus and thinking much more in terms of just purely product management type of stuff which is take on different features of Neo4j. Taking a look, I guess, at the giant list of things we would love for your Neo4j to become, and what it is today, and where we think we are going, and figuring out what to do next, and how much of it do next. 
RVB: 04:03 But you're figuring more big things, right and yes I mean you're the movie star in the Neo4j trainings, right? 

ABK: 04:11 [laughter] I suppose that's true. I do forget that, and every now and again, I'll still meet people and they'll look at me and they'll say, "Wait I know you. Aren’t you the guy from the videos?" "[laughter] That's right. That is me." So, if you’ve had the pleasure of using our online tutorials and watching the short video clips, that is me in the video clips. 
RVB: 04:30 Exactly, in a beautiful tie and [laughter]... So why graphs, ABK? What attracted you to the graphs in the first place? And what fires you up every morning to keep on working on this stuff? 
ABK: 04:48 I have to say that my motivation hasn't changed since the early days of when I first went and tracked down Neo4j. I was in that generation of people who started looking for a graph before we knew we were looking for a graph. I've been doing international non-profit work. Actually, with this wonderful organization doing work in sub-Saharan Africa, and effectively use medical informatics work, right? We were doing patient care and disease surveillance and things like that and so many of the data models we were working with, we have the sort of classic realization that our sort of traditional, good old relational database models were either, really perfect and awesome for the reporting we needed to do. But maybe, not so great for actually doing any analysis and trying to understand public health concerns like, why did this pattern of disease progress in the way that it did? And, sort of collectively I think, we had an understanding that what we were doing was a graph problem. We didn't think about it, I think, in that way. Except that, I happen to be lucky enough to be, at the time, living in Baltimore and one of my neighbors was heavy into ontology databases. And, I was looking at his database and I thought, "Oh wow, that's brilliant. That's maybe exactly what I want I want." He talked to me about it, he said, "Well, maybe, but this may be more than you actually need. There could be something in between that has the flexibility and expressiveness of an ontology database but without being entirely prescriptive, so it could be a bit more flexible for the application and easy to use. And he actually introduced me to Neo4j. He said, "Why don't you go check out this project. It looks like it might be perfect for what you're trying to get done." And I fell in love. It was exactly what I wanted. It thought about data the way that I want it to think about data. And so, I used for few projects, I tried to you know get involved with community and make some contributions of my own to the code base and that's what began my life long sort of journey with Neo4j and the organisation and community. 
RVB: 06:56 Do you remember what was the killer feature that attracted you to sort of get started with it, what was that? Was it the domain model? Or what was it exactly that attracted you so much? 
ABK: 07:07 Honestly, it was this whole-- the simple to say the thing like that, it's all about relationships [chuckles]. It's almost trife, but like, that simple shift in thinking from looking at the, and caring about the individual records but to thinking about how these records relate. That's where all the value was and all the data modeling I was doing, all the applications I was doing. That was so much more powerful than the individual records themselves because that's where you see patterns and progressions of things and Neo4j elevated it and it made it an actual concern you dealt with as part of normal modeling, rather than maybe later on you add in some foreign key constraints or something. 
RVB: 07:51 Yeah, totally. Is that something that you still think is a core thing to the product? This relationship-centric view on things, is that still one of the core things? 
ABK: 08:03 I do think that it really is-- I think that's really were with the long term relationship with their graph model is, that's where the power is, is in the relationships. One of the near term challenges we had and startup challenge is that one of the getting started and introducing Neo4j. I happen to have my own epiphany. I realized that this is what I want until it felt perfect and it was awesome. But until you think in that way, it can seem weird, right? I feel like we were in this place where we've done a really great job with making graphs awesome, but we can actually do a little bit more to make grass easy to use as well. As it is right now, maybe you have to do. It's great that you think about relationships, but if you're always thinking about relationships, then some amount of structuring and just for getting started, it's like you have to think too much. And we'd like find a nice balance between, you will have to think only a little bit. If you're doing something simple then you don't have to think too much. The simple things you're really easier to do, but you don't get caught into a corner where, because we've made it too simple it's hard to do the more expressive and richer things. So, that's the balance I think that we're trying to move towards in the next-- actually certainly in  the next release as well. We're starting to put in some bits of capabilities, that will make that, I think a nicer interaction. 
RVB: 09:31 You know what, you're setting yourself up for my final question [laughter]. Where is it all going in this? Where do you see the industry, but also the product, a couple of years from now? What does the future hold? 
ABK: 09:48 Yeah, I think that certainly in the industry - that's the broader data base industry - and soon we're part of the NoSQL segment of it, which I think people finally realized like, isn't really a separate segment. It's just people, trying to deal with lots of data and figure out what the best way is to work with all that data. And from each of our different starting points, whether it's graph databases, the column stores, the key values, or anything else, that we're all, of course, iterating on our world view and slowly progressing towards a common understanding that we want to be able to do all the things really well. Of course, I still think that in the end of the day, that graphs of course, are going to be the best way always to think about everything, but as I was saying I guess, like maybe they have a little bit of extra thinking you've got to do just before you start structuring things. So there's things we can do to improve that, but I feel like within the next couple of years we'll see other databases realize that they want to do graph stuff and they'll start adding graph features, and you'll see us making it easier to do stuff that isn't strictly graph stuff. Simple things like let's say my favorite is always to say if you want to manage a list of things, it's very easy to conceive of but you've got to do a little bit of work if what you're doing is always managing relationships, connecting and disconnecting things. That should be dead simple to do and I think you're going to see in the next couple years that we have an easy way of entrusting that as well. 
RVB: 11:22 So what's your favorite feature in Neo4j 3.0? Sorry, trick question [laughter]. 
ABK: 11:36 So there are two things I'm excited about in Neo4j 3.0. One is actually just a very simple change to how we present what's currently called Neo4j browser, are the user client for accessing the database. We're taking just some practical steps there to actually separate development to that, from development of the database. And coupling that with the new protocol that we have, this BOLT protocol for connecting with Neo4j gives us the opportunity to do something that was awkward to, previously, which is that you can run Neo4j client, separate from Neo4j, and it can connect to any Neo4j database that happens to be out. It doesn't have to be tied to the database that started up, right? 
RVB: 12:17 Yup. 
ABK: 12:18 I think that's going to be brilliant. For just day to day use of the Neo4j, it will make much more pleasurable, and also we'll be able to deliver the clients separately from the browser and have an up surf from the database and have more frequent updates and feature requests going in. So, that'll be pretty exciting. 
RVB: 12:36 You know, I think there's so many nice things we could talk about. But as you know, I want to keep these podcasts digestible and short, so that people can listen to it on their commute, so I'm going to thank you so much for coming online Andreas. It's been a very nice conversation. I really appreciate it. Thanks again, and I look forward to seeing you soon. 
ABK: 13:00 Thank you Rik this is great fun, we'll have a beer soon. 
RVB: 13:02 Absolutely, no doubt. Cheers bye bye. 
ABK: 13:05 Cheers.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Thursday 10 December 2015

Podcast Interview with Ashley Sun, LendingClub

At the last GraphConnect in San Francisco, we had a wonderful Neo4j user on stage presenting their usage of Neo4j in a very modern and insightful way: to manage and automate some of their software development processes. Ashley Sun can tell you all about this in more detail, so without further ado, here's this weeks' Graphistania episode:

As always, here's the transcript of our conversation:
RVB: 00:01 Hello everyone, my name is Rik, Rik Van Bruggen from Neo, and here I am recording a podcast episode together with a wonderful guest on the podcast episode, all the way from California, Ashley Sun. Hi Ashley. 
AS: 00:17 Hi Rik, thanks for having me. 
RVB: 00:22 Thanks for coming on the podcast. Ashley, you work for-- 
AS: 00:23 Of course. 
RVB: 00:23 --Lending Club, right? 
AS: 00:26 Yes. 
RVB: 00:26 I've seen some of the videos of GraphConnect

and I've read a little bit about what you guys are doing, but why don't you introduce yourselves to our listeners? 
AS: 00:37 Okay. Hi, I'm Ashley. I work on the DevOps team at Lending Club based in San Francisco. I work a lot on deployment and release automation, and I use Neo4j to do it. 
RVB: 00:54 Wow, that's great. How long have you been doing work with Neo4j, Ashley? It must have been for a long time already or-- 
AS: 01:01 Only a little over a year, I'd say, so my manager first introduced it to me. I think he stumbled upon graph databases on Twitter or something and he's like, "Hey, check out this new thing called Neo4j." And so, we started playing around with it, and it quickly evolved from just a side project to being a really critical part of a lot of our release and deployment automation, infrastructure mapping, app auto-discovery, and a lot of other things, actually. 
RVB: 01:37 That's a great segue into what do you guys use if for exactly? Why don't you tell us a little bit more about that? 
AS: 01:43 Sure. So, we use if for a lot, a lot of things, actually. So, as I was saying, I guess it was very opportunistic when we started using Neo4j. We had a lot of problems in DevOps and growth pains. So, we started with maybe like five micro-services and a couple years later, we're almost at 150, and so it was getting really difficult to manage and keep track of all these services, and so we did is we use Neo4j to keep track of all these instances. We had them radio home-- we have this internal app called MacGyver. So, we had them radio home every minute to MacGyver, and MacGyver would save all these app instances in Neo4j, and so already, immediately, we just gained a lot of visibility into what services were out there, where they were running, a lot of info like that. And it was really low maintenance, it was easy to scale, we didn't have to do any work because these new instances would just keep reporting back to MacGyver and get saved into Neo4j. 
So, from there, we were like, "Oh, this is really useful, so we're going to take this a step further." And so, at Lending Club, we use blue-green deployments. Basically, this just means that we have two pools for every app, and at any given time, only one pool is live. We didn't have a good way before to track what pool is live and what pool is dark, and so we started using Neo4j. We were already mapping our app check-ins, and so we took that and then within Neo4j, created the server nodes, which we then mapped to belong to pool nodes, which we then mapped to belong to service groups. By keeping track of what servers existed in what pool, and whether that pool was live or dark, we were able to automate our releases. Whereas before, releases were very manual. We'd have to go into this GUI and check mark all these boxes; it was just very tedious and very time-consuming. Now with Neo4j, we are keeping track of these info, and so it was just really quick. It was like a flip of a switch and we could make a pool live or dark or even really easily look up what pool is live. And also we used it to track the health of our instances and our apps. So, that was also really, really important and that's what we're using now for deployments. 
RVB: 04:18 Super cool. By the way, I love the naming. I'm a big fan of MacGyver [laughter]. 
AS: 04:24 Awesome. Actually, my manager came up with the idea of the name and I had no idea what MacGyver was and just-- my time of-- I'm like, "Is that like MacGruber from SNL?" So, I had to watch an episode of MacGyver to-- 
RVB: 04:41 I guess I'm showing my age here a little bit [laughter]. Ashley, so it's basically what you're using for dependencies between all the micro-services, is that what I'm hearing? You're basically tracking everything with these automated ping backs, but then you're mapping it onto like a model of all your micro-services, is that what I'm hearing? 
AS: 05:05 Yes, that's correct. So, taking the instances and then arranging that data in a way that becomes useful, so that we know where our apps are and what's live at any time and what's dark. And also even-- so, we're mapping with dependencies from services. So, we map them onto-- for example, vCenter instances and vCenter hosts, and then we take those vCenter arrays, then map those to our storage arrays and our storage volumes. So, what we get is like this huge mapping of our infrastructure. For example, if we want to find a single point of failure, for example, we have an app called ABC and all of its instances reside on one vCenter host, and if that host goes down, then our entire app is wiped out. That's a single point of failure. So, we use Neo4j to keep track of things like that, to avoid these-- it would be a huge disaster if that were to happen. 
RVB: 06:12 Yeah. And I seem to recall from one of your talks that you also use this tooling to help you guys do more stuff with Amazon web services. Can you tell us a little bit more about that or did I get that completely wrong? 
AS: 06:28 No, no. You're totally right. One of our multi-year projects is we're moving into AWS, and so we're just starting that process now, but already we are mapping a ton of AWS stuff into Neo4j. For example, like our VPCs, subnets, availability zones, our RDS instances and EC2 instances, those map to load balancers, auto-scaling groups, launch configs. You can tell already, it's like a huge, a huge map in Neo4j. So, there's all these different parts, but we are able to make sense of it by mapping relationships together in Neo4j, and also as we move into AWS, we'll start using code deploy. But again, using Neo4j to automate that and put it into MacGyver, so that developers at any time can say, "Hey, I need an instance and I want to launch this app onto it." It'll just be really simple and we'll use MacGyver and Neo4j to do that. 
RVB: 07:33 Super cool. So, that brings me to the question that I ask everyone on this podcast, why Neo4j? Why a graph database to do what you're doing? Was there any specific reason for that or is there anything you want to call out that you really like about it in your current environment? 
AS: 07:54 Definitely. I guess, first off, the low latency and it's really the ad-hoc querying is super, super useful. I think another thing that really stands out to me is how flexible and scalable Neo4j is. So, we started small just with app instances, but it's really easy to build new layers and new relationships on top of already existing ones, and so where we started with just app instances, now it's become this huge infrastructure mappings of so many different types of nodes. It's really cool how with Neo4j, your data set can really easily evolve and grow in terms of complexity or structure. It's just so easy to use and that's why we've been able to keep using it. And also, obviously, it's just really good at graphing relationships between things and mappings. That's where I really-- 
RVB: 08:52 Yeah, it makes total sense. Do you guys use Cypher at all? Do you do interactive querying or maybe--? 
AS: 08:58 Yeah. 
RVB: 08:57 Yeah, you do. 
AS: 08:59 Actually, within MacGyver, we have a web interface for Neo4j and people enter and Cypher queries to look up stuff. 
RVB: 09:06 Super cool. Very good. So, where is it going, Ashley? Where are you guys going to take Neo4j in the future? Any perspectives on that? I'd love to know more about that. 
AS: 09:17 One unit's already a really integral part of my MacGyver and we're just using to hold everything together and-- MacGyver also has become a central point of information. As I said, as we move into AWS, we're going to keep putting all that stuff into Neo4j. We also are using it to track our asset management, and even as I was saying before, the infrastructure mapping. We could add network and database components into that and get-- just keep building the infrastructure map, keep building our AWS map. Another thing that's on the road map is to utilize those app instance check-ins to create a service registry for all of our apps. That way, we can keep track of who owns this app, or maybe if we map it to get [?] what Repo is it? What Jenkins job does this correspond to? Is this out public? So, we'll have a service registry of all our apps, where people can go and just find out info that otherwise would be difficult to pin down. 
RVB: 10:22 Super cool. I think it's a great use case for Neo4j, and I'm so happy that you guys found your way to it and are getting good use out of Neo4j. So, it's really-- 
AS: 10:34 Yeah, me too. 
RVB: 10:34 Yeah, it's really great and I think we'll wrap up the podcast for now. I really want to thank you for coming online and talking to us about it. 
AS: 10:45 Of course. 
RVB: 10:45 If you don't mind, I'll share a couple of links to your video from GraphConnect-- 
AS: 10:51 Yeah, of course. 
RVB: 10:51 --on the blog post as well. And I look forward to hearing more about you guys as you guys expand, and as you guys grow the service with Neo4j. 
AS: 11:04 Awesome. Thanks, Rik. Thanks for having me. 
RVB: 11:06 Thank you so much, Ashley. Bye. 
AS: 11:07 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Friday 4 December 2015

My new "Intro to Graphs" prezi

This week I finally got round to updating and "voice-overing" my Introductory talk about Neo4j and Graph Databases. As you would expect, it's a bit different from some of the early introductory talks, and has a lot more examples and use cases mentioned in it. I would be curious to learn what you think about it - so PRESS PLAY below, sit back and RELAX.

Hope this is useful for you - let me know if you have any comments.



Thursday 3 December 2015

Podcast Interview with Will Lyon, Neo Technology

Last GraphConnect in , I spent some time at the GraphClinic helping lots of interested attendees get the most out of Neo4j. I really enjoyed, also because for a good time, I shared the "clinic" with one of my colleagues, Will Lyon, who is working in our Developer Evangelism team. Will has been working on lots of cool stuff with Neo4j for the longest time, and has plenty of stuff to share and discuss. So we got on a Skype call - and ... chatted away... here's the result:

Here's the transcript of our conversation:
RVB: 00:00 Hello everyone, my name is Rik, Rik Van Bruggen from Neo Technology and here we are again recording a Neo4j graph database podcast. It's been a while since we've been doing recordings, and tonight I'm joined by Will Lyon, all the way from California. Hi Will.
WL: 00:19 Hi Rik, thanks for having me.
RVB: 00:20 Hey, good to have you on the call. I thank you for joining us. Will, I've read a bunch of your blog posts and I've seen a bunch of your work but many people may not have seen it yet, so why don't you introduce yourself to get us going?
WL: 00:36 Sure, thanks. I'm Will Lyon, I'm on the developer relations team at Neo. That means that it's my job to help encourage awareness and drive adoption of Neo4j and also graph databases in general. So, I do this by writing blog post that talk about Neo4j and graph databases, building cool demo apps, integrating with other technologies, proving out new use cases. For example, earlier this week I was at QCon Conference in San Francisco talking to our users there. Tomorrow, I'll be giving a webinar about using Neo4j and MongoDB together.
RVB: 01:18 Wow! Super cool. And then, how long have you been working with Neo, just as a community member, Will? Quite some time, right?
WL: 01:25 Quite some time, I joined the company just in September. So, I have been with Neo Technology for about two months now. Prior to that, I was working as a software developer for a couple of start ups and always trying to work Neo into the job.
RVB: 01:43 That's very cool. Well, that also immediately begs the question, why, right? Why were you trying to work Neo into your job all the time? What attracted you to it, I suppose?
WL: 01:55 Sure. The first time I was exposed to Neo was a few years ago at a hackathon over the weekend, and the team I was working with, we needed a project. We had read a blog post about building recommender systems with Neo4j, this graph database thing. I didn't know anything about graph databases or collaborative filtering recommender systems, but I thought it sounded interesting. So, we tackled this project over the course of the hackathon and we were able to build a GitHub repository recommender system. So, it looked at your previous activity on GitHub as an open source contributor and recommended other repositories that you might be interested in. It was a really fun project to put together, and I was amazed at how sort of easy it was to get going with Neo4j and Cypher, the query language, and actually build this application. At the end of the weekend, it worked and we went on to actually win the hackathon. So, I was sort of--
RVB: 03:02 Wow! That's cool.
WL: 03:03 Yeah. I was hooked from that point on. What I really liked about Neo is the way that you think about the data model with graph data is very close to how we think about data in the real world. So you have this very close mental map. It seems very intuitive when we're thinking about our data model. For example, Rik is my co-worker, I'm at a conference, the conference is in San Francisco. These are all entity nodes and relationships, and so it [crosstalk]-- so, it seems very easy to express very complex data models. We don't have this weird transformation that we have to go through.
RVB: 03:52 Absolutely. What made it so productive then to implement that recommender system? What was it that made-- is it just the model or is it also Cypher? What made it so easy to develop with, in that particular case?
WL: 04:05 Sure, I think, really, Cypher was the biggest thing for us, and just being able to define the problem that we were trying to solve as a traversal to this graph, and being able to very clearly define that pattern in a Cypher query and get that back right away. It was actually very easy to build something that was not quite trivial.
RVB: 04:36 Yeah, I know. I understand. Well I've seen some of you other hackathon works, like for example, that thing that you built to fire multiple Cypher queries and now you're working on something really interesting to import CSVs you told me, right?
WL: 04:51 Sure. So, on the developer relations team, one thing that we're focused on is the new user experience. So, for users seeing Neo4j for the first time, what's the first thing they want to do? Well, a lot of times that's play with their own data. And so, we are trying to make that process of importing your data into Neo4j much easier. So, one of the projects I'm working on is a web application that guides the user through the process of converting their CSV files into a graph data model, and then allows to quick execute those against the Neo4j instance to import your data.
RVB: 05:29 You mean it's going to be even easier than with Load CSV, then?
WL: 05:33 That's right [laughter], exactly.
RVB: 05:34 That's super cool. I mean, I've been with Neo a couple of years now, and when I started it was a brutal experience [chuckles]. It's gotten so much easier, and it's going to get even more easy. So, that's great to hear. Thanks for that.
WL: 05:51 Yeah. Absolutely.
RVB: 05:52 Very cool. So, Will, one of the topics that we always cover on this podcast is, where is it going? What are the big things that you see coming up and you would love to see happen in Graphistania, as we call it sometimes [chuckles]. Where do you see this going? What's your perspective on that?
WL: 06:13 Sure. I think we're at a really interesting time now where we're seeing lots of improvements in the technology - Neo4j, graph databases in general, around performance - but also around the API's that we're using to interact with graph data. So things like Cypher, it's becoming much cleaner, much easier to work with. And I think this investment in the technology is really indicative of a larger trend in applications in general. Users are expecting more from our applications. So, let's take e-commerce as an example. Browsing and searching and filtering are great, but users are really expecting things like personalized recommendations in their e-commerce platform, and a great way to generate those is with a graph database.  Same with things like contents delivery, we expect personalized content recommendations. So, I really think we're seeing the case where going forward, we're going to see graph databases used in more and more applications, used alongside more and more technologies, and it will feel very natural and easy to use Neo4j in your modern application stack.
RVB: 07:31 Does that mean things like availability of Neo4j to other development platforms as well? Not just Java, .NET, and all those types of things as well. Is that part of that?
WL: 07:46 Sure, absolutely. I think with Cypher, that's becoming much easier now. It's very easy to shoot a Cypher script to Neo4j server from a .NET environment, from a Python application. We're really seeing a standardization around the API there.
RVB: 08:06 Well, I'm really looking forward to it, as you are, I imagine [chuckles]. What I'll do is when we write up the podcast and transcribe it, we'll put a bunch of links to some of your work and all the other developer evangelists' work in the article so that people can find the way around even more easily. So, thank you so much Will for coming on the podcast, really appreciate it. I'll wrap up here and I look forward to seeing you at an event very soon.
WL: 08:37 Great. Thanks a lot, Rik.
RVB: 08:38 Thank you. Bye bye.
WL: 08:40 Bye bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Thursday 26 November 2015

Podcast Interview with Karl Urich, Datafoxtrot

Been a hectic couple of weeks, which is why I am lagging behind a little bit in publishing lovely podcast episodes that I actually recorded over a month ago. Here's a wonderful and super-interesting chat with Karl Urich of DataFoxtrot, who wrote about graphs, spatial applications and visualisations recently on our blog and on LinkedIn Pulse. Lovely chat - hope you will enjoy as much as I did:

Here's the transcript of our conversation:
RVB: 00:01 Hello, everyone. My name Rik. Rik Van Bruggen from Neo Technology, and here I am, again, recording another episode of the Graph Database podcast. Today, I've got a guest all the way from the US, Karl Urich. Hi, Karl. 
KF: 00:15 Rik, very nice to speak with you. 
RVB: 00:17 Thank you for joining us. It's always great when people make the time to come on these podcasts and share their experience and their knowledge with the community, I really appreciate it. Karl, why don't you introduce yourself? Many people won't know you yet. You might want to change that. 
KF: 00:35 Yeah, absolutely. So, again, thanks for having me on this podcast. It's really great to be able to talk about the things I have experimented with and see if it resonates with people. I own a small consulting business called DataFoxtrot, started under a year ago. Primary focus of the business is on data monetisation. If a company has content or data, how can we help those companies make money or get new value from that content or data if they could be collecting data as a by-product of their business or they could be using data internally in their business and then they realise that someone outside the company can use that as well? So, that's the primary focus of my business, but like any good consulting company, I have a few other explorations and really this intersection of the world of graph and spatial analytics or location intelligence is what interests me. So, talking a little bit about those explorations is what will hopefully interest your listeners. 
RVB: 01:38 Yeah, absolutely. Well, so, that's interesting, right? I mean, what's the background to your relationship to the wonderful world of graphs then, you know? How did you get into it? 
KF: 01:45 Yeah, so going all the way back to college, I did take a good Introduction to Graph Theory as a mathematics elective, but then really got into the world of spatial and data analytics.  For 20 years working with all things data: demographic data, spatial data, vertical industry data, along the way building some routing products, late 1990's or late 2000's products, that did point to point routing, drive time calculations, multi-point routing. Really kind of that original intersection of graph and spatial. But, data junky, very interested in data: graph, spatial, data modelling et cetera. 
RVB: 02:28 Yeah. Cool. I understand that these spatial components is like your unique focus area, or one of your at least focus areas these days, right? Tell us more about that. 
KF: 02:39 Yeah, absolutely. And it's certainly what resonates when I think of about the graph side, spatial data really should define-- spatial data could be any sort of business problems related to proximity location or driving things because you know where something is, your  competitors, your customers, the people that you serve. And that's where it resonated to me when, as I start to look at graph and spatial, I was really excited back in April. I walked in, just very coincidentally, in a big data conference to a presentation being put on by Cambridge Intelligence-- 
RVB: 03:24 Oh, yeah. 
KF: 03:26 And so they were introducing spatial elements to their graph visualization. 
RVB: 03:31 That's really-- they just released a new product, I think. Right?
KF: 03:34 Just released the new product, at the time had gone beta. So, that really got me thinking about how could you combine graph and spatial together to solve a problem. Looking at Cambridge Intelligences, technology of looking at some spatial plugins for Neo, and again, my company is a consulting company and if there is a need for that expertise at the intersection of graph and spatial, we want to explore that. 
RVB: 04:05 Very cool. Did you do some experiments around this as well, Karl? Did you, sort of, try to prove out the goals just a little bit? 
KF: 04:11 Yeah. Absolutely. Let me talk a little bit about that. At this concept of combined spatial and graph problem that looked at the outliers, outliers just meaning things that are exceptional, extraordinary, and the thinking is, in my mind, was businesses and organisations can get value from identifying outliers and acting on those outliers. So, maybe an outlier can represent an opportunity for growth by capitalising on outliers, or bottom-line savings by eliminating outliers. Let me give an example of an outlier. If you look at a graph of all major North American airports, and their flight patterns, and put it on a map, you could visualise that Honolulu and Anchorage airports are outliers. There are just few other airports that, "look the same”, meaning same location, same incoming and outgoing flight patterns. And that's really relatively easy if you have a very small graph to visualise outliers, but if you want to look at a larger graph, hundreds of thousands, millions of nodes, what would you do? So, that really started the experiment. I was looking around for test data. Wikipedia is fantastic. You can download-- 
RVB: 05:28 [chuckles] It is. 
KF: 05:29 Wikipedia data-- I love Wikipedia. Anyway, it seemed very natural. And the great thing is that there are probably around a million or so records that have some sort of geographic tagging
RVB: 05:42 Oh, do they? 
KF: 05:44 Yep, so a page-- London, England has a latitude longitude. Tower of London has a latitude and longitude. An airport has a latitude longitude. 
RVB: 05:54 Of course. 
KF: 05:54 So, you can tease out all of the records that have latitude longitude  tagging, preserve the relationships and shove that all into a graph. So, you have a spatially enabled graph, every XY has a-- every page has a latitude longitude or XY. So, really the hard work started, which was taking a look at outliers. So, quick explanation of outliers, so, you think of  a Wikipedia page for London, England, a Wikipedia page for Sidney, Australia, they cross reference each other. Pretty unusual to locations other side of the world, but would you call those outliers? Not really, because there's also a relationship between the London page and the Melbourne, Australia Wikipedia page. So, you really wouldn't call those anything exceptional. And so, what I built was  a system, or just a very brief explanation is that I looked at relationships in the graph, looked only at the bi-directional or bilateral relationships where pages cross-referenced each other. None have really identified how close every relationship was to another relationship or looked for the most spatially similar relationship. You can score them then, and you can kind of rank outliers. So, let me just give one quick example. It's actually my favorite outlier that I've found-- 
RVB: 07:30 Which category? 
KF: 07:31 Unusual thing to say. There's a small town in Australia called Arish. I think I'm pronouncing that right, that has a relationship with the town in the Sinai Peninsula called Arish, and El Arish in Australia is named after Arish, Egypt because Australian soldiers were based there in World War One-- 
RVB: 07:51 No way! 
KF: 07:53 Yep! And most importantly, this relationship from a spatial perspective, looks like no other relationship. So, that's the kind of thing, when you are able to look at relationships, try to rate them in terms of spatial outliers-- 
RVB: 08:10 Yeah, sure. 
KF: 08:12 You can find things that lead to additional discovery as well. 
RVB: 08:18 Super cool. 
KF: 08:19 As a Wikipedia junkie, that's pretty fascinating. 
RVB: 08:21 [laughter] Very cool. Well, I read your blog post about-- outliers made me think of security aspects actually. I don't know if you know the book Liars and Outliers. It's a really great book by Bruce Schneier. I also have to think about-- we recently did a Wiki Wiki challenge, which is, you know, finding the connections between Wikipedia pages. You know, how are two randomly chosen Wikipedia pages linked together, which is always super fun to do. 
KF: 09:00 It was even in my original posting and I didn't want to say that, "Hey, this could be used for security type applications." So, I think I talked in code and said, "You could use this to identify red flag events," but I like to think of it as both the positive opportunity and the negative opportunity when you're able to identify outliers and-- 
RVB: 09:26 Yeah, identifying outliers has lots of business applications, right? I mean, those outliers are typically very interesting, whether it's in terms of unexpected knowledge, or fraudulent transactions, suspect transactions. Outliers tend to be really interesting, right? 
KF: 09:43 Absolutely, absolutely. 
RVB: 09:45 Super cool. So, where is this going, Carl? What do you think-- what's the next step for you and DataFoxtrot, but also graph knowledge in general? Any perspectives on that? 
KF: 09:56 Yeah. So, there's more of a tactical thing, which is as we record a week from now we have GraphConnect probably-- 
RVB: 10:04 I am so looking forward to it. 
KF: 10:06 Which will be fantastic and being able to test this out with people. It's always great to bounce ideas off to people. In terms of our next experiments, the one that interests me is almost the opposite of outliers and let me explain. So, I have some background in demographics, analytics, and segmentation, so, what interests me a lot is looking at clustering of relationships of the graph. Think of clustering is grouping things that are similar in to bins or clusters, so that you can really make over arching statements or productions about each cluster. You can use techniques like K Means to do the clustering. So, what interests me about graph and spatial for clustering is you can use both elements. The relationships of the graph, spatial location of the nodes, together to drive the clustering. I've started some of the work on this and, again,  using Wikipedia data and maybe the outcome, using Wikipedia, if you did your clustering based on spatial location of the nodes, plus strength of the connection, plus the importance of the nodes, plus maybe some other qualifiers, like if a node is a Wikipedia page for a city or a man-made feature, a natural feature, you might end up with clusters that have labels to them. One cluster might be all relationships connecting cities in South America and Western Europe, or relationships between sports teams around the world. So, it's kind of the opposite, if outliers is finding the outliers, the exceptional things, clustering is finding the patterns. 
RVB: 11:42 Commonalities. 
KF: 11:44 A real-world example might be an eCommerce company is looking at the distribution network, and they want to do clustering based on shipments, who shipped what to whom, where the shipper and recipient are, package type, value, other factors, and they could create a clustering system that categorises their distribution network and they can look at business performance by cluster, impact of marketing on clusters and sometimes just the basic visualisation of clustering just often yields those Eureka moments of insight. That's kind of the next entrusting project that's out there. I'd say, ask me in six to eight weeks [laughter]. 
RVB: 12:29 We'll definitely do that. Cool. Carl, I think we're going to wrap up here. It's been a great pleasure talking to you. Thank you for taking the time, and I really look forward to seeing you at GraphConnect. I wish you lots of fun and success with your project. 
KF: 12:49 Excellent. Thank you very much Rik, really appreciate it. 
RVB: 12:51 Thank you, bye bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Wednesday 18 November 2015

Podcast Interview with Felienne Hermans, TU Delft

Almost two years ago, my colleagues at Neo alerted me to the fact that there was this really cool lady in the Netherlands that had an even cooler story about Neo4j and Spreadsheets. Curious as I am, I invited Felienne to come and talk about this at our Amsterdam meetup, and we have been in touch with her and her department ever since. So naturally, we also invited her to come on our Podcast, and as usual :), we had a great conversation. Hope you enjoy:

Here's the transcript of our conversation:
RVB: 00:01 Hello everyone. This is Rik, Rik Van Bruggen from Neo Technology. Today, we're recording another podcast episode that I've been looking forward to for a long time, actually, because I'm a true fan, and not a groupie yet, but a fan of Felienne Hermans from Delft University. Hi, Felienne.
FH: 00:19 Hi, Rik. 
RVB: 00:20 Hey. Great to have you on the podcast. It's fantastic. We've seen you talk at a number of conferences and at the meet-up before. But most people won't have seen that yet, so maybe you can introduce yourself?
FH: 00:33 Sure. I'm Felienne Hermans. I'm assistant professor at Delft University of Technology where I run the Spreadsheet Lab. That's a research group of the university that researches spreadsheets, obviously. 
RVB: 00:45 Obviously, exactly. That also sort of hints at the relationship with graphs, I think, right? 
FH: 00:53 Yeah. People that have seen my talk or maybe googled me know that the title of my talk that I usually give is "Spreadsheets are graphs".

So, we use Neo4j to do graph analysis on spreadsheets. We do all sorts of analysis, actually. The whole idea of our research is that spreadsheets are actually "code". And then if it's code, you need an IDE. So you need to analyze all sorts of constructions within your code so you can see, maybe, do you have code smells in your spreadsheets? Maybe does your spreadsheet need refactoring? All the things you typically do on source code for analysis, you should also do on your spreadsheet, and this is where we use Neo4j. 
RVB: 01:34 If I recall, you actually did some work on proving that spreadsheets are Turing complete, right? 
FH: 01:41 Yeah, I did. To make my point that spreadsheets are code because then, people think it's funny. If I say, "Hey, I do research on spreadsheets and I've wrote a PhD dissertation on spreadsheets," people laugh in my face often [chuckles]. "Really, can you do dissertation on that?" It's software engineering. I'd say, "Yeah." But actually, people don't believe me. To prove my point - indeed I implemented it - a Turing machine in a spreadsheet using only formulas, ensure that they are Turing complete and that makes them as powerful as any other programming language. That should stop people from laughing at me. 
RVB: 02:17 [chuckles] Exactly. It's funny, but it's also really, really interesting. I think fundamentally, it's a very interesting approach and that's why I think people also love your talks. It's a very interesting topic. So, why did you get into graphs then, Felienne? Tell us more about that. How did you get into that relationship between spreadsheets and graphs? 
FH: 02:40 As I said, we do smell detection, as well. And for that, we initially stored information in a database and we stored information, for instance, what cell relates to what other cell? Because if you want to calculate a smell like "feature envy", and source code feature envy would be a method that uses a lot of fields from a different class. So you can see in a similar way that in a spreadsheet, a formula that uses a lot of cells from a different worksheet has the feature and the smell. It should actually go in the other worksheet. So in order to save that type of information, you need to store what cell relates to what other cell. And initially, I never thought about what database do I use. 
FH: 03:26 In my mind, a few years ago, database was just a synonym for a SQL server. I was in Microsoft world where we make plugins for Excel, so database is just SQL. Same thing. I didn't think about it. I just dropped all my stuff in database, aka SQL. And initially, that worked fine. Some analyses you can really easily do but at one point, you want to really deeply understand how all the cells relate to each other because you want to measure the health of the spreadsheet. So we got horrible queries. SQL queries of like one A4 sheet of paper. Very, very complicated. But still I thought databases are just hard. I didn't really think about it until I saw a talk from one of your colleagues, Amanda, at Build Stuff in Vilnius. I saw her talk, and then it was really like a light bulb above my head. Bing. This is what I need. 
FH: 04:23 And a few weeks later, I was onsite at a customer for weeks so I wasn't bothered by students or colleagues so I could really program for a while.  I thought, "Okay. This is my chance. Let's try and get my data into Neo4j and see how it will improve, what type of analysis that it would make easier." So that's what I did. When I tried, lots of the analyses - how many hopes are there between these two cells or what is the biggest span within the graph - obviously are very easy to answer in Neo4j. So that's how we changed some of our analyses queries to run on the Neo4J database because it was so much easier to explore the data in a graph way because spreadsheets are graphs. There's a whole graph layer underneath the grid-like interface. That was really easy to analyze. 
RVB: 05:16 That is such a great story and such a great summary of why it's such a great fit. I guess most people don't think of it that way. But effectively, what you're doing is like dependency analysis, right? 
FH: 05:28 Yeah. That's what we're doing. How did cells depend on each other? 
RVB: 05:31 Super interesting. Is that something that you currently already use? I know you've been developing software on top of Neo4j, right? Is that already something that people can look at? 
FH: 05:45 No. Currently we only look at it within our own research group. The analyses we do is for us as researchers. It's not user-facing, so we have a smell detection tool that is somewhat user-facing where spreadsheet users can upload their spreadsheet and they get some analysis in the browser, but that is no less advanced than the analysis we use. They're still using the SQL back-end because users typically don't really want to explore their spreadsheet in a way that we want to explore spreadsheets if we're researching. A spreadsheet user is not going to ask himself the question, "Hey, how wide are my cells connected?" That's really more a research tool. 
RVB: 06:31 I understand. Totally. What are your plans around that, Felienne? Are you still expanding that work or is that something that is still under development then? Where is this going, you think? 
FH: 06:42 Obviously, if you say smells and you say refactoring. We've done lots of work on the smells. Even though we keep adding new smells, we feel that we have covered the smells area pretty nicely. Then, the next step of course would be refactoring the smells. If I know that this cell suffers from feature envy, it is jealous of that nice worksheet where all the cells are-- that he is using that formula. You want to move the formula to the other worksheet so that it's nicely close together to the cells that he's using. So these type of refactorings - moving cells in order to improve the graph structure - is something that we are looking at. 
FH: 07:21 One of my PhD students is currently looking at comparing the underlying graph, so where are the cells connected to each other? Compare that look on the spreadsheet to where are the cells in the worksheet? If you have a big cluster of cells, they're all referring to each other but they are physically located on two different worksheets, that's probably not ideal for the user because then you have to switch back and forth. And the other way around is true, as well. If you have a worksheet where there are two clusters of cells relating to each other, maybe it would be better to give each of these clusters their own worksheets. So these are the type of refactorings that we are looking into. If you have a big disparity between how your spreadsheet is lay-outed, and how your graph connections are, then this is very smelly and you should do something to improve the structure. There are still a lot of graphs also, in the refactoring future that we see. 
RVB: 08:23 That, again, sounds so interesting. I think we could have a lot of joy - spreadsheet joy - because of that. I would love to see that. Very cool. Any other topics that you think would be relevant for our podcast listeners, Felienne?  Or anything else that comes to mind? 
FH: 08:44 Yes. One other thing, one final thing. I like to pitch my research a little bit if-- 
RVB: 08:49 Of course. Yeah. 
FH: 08:51 One of the things that we're also looking at is looking at what in a spreadsheet are labels and what in a spreadsheet is data? And especially how do they relate to each other? If you want to, for instance, generate documentation from a spreadsheet, or help a user understand a spreadsheet, it's very important to know if you have-- you take a random formula from a spreadsheet, what is it calculating? Is this the turnover of January, or is this the sales of blue shoes? And sometimes it's easy because, again, your layout matches the formulas. Sometimes you can just walk up the column or down the row to get the label. Sometimes the layout is a little bit more complicated. One of the things that we are working on is trying to make an algorithm - semiautomatic, and maybe with some user assistance, or entirely automated - where you can pick a random cell and then it will give you what it is, what is semantically happening in that cell. Can I add links to your story, as well? 
RVB: 09:57 Yes, you can. Yeah. 
FH: 09:59 Okay. So let's share a link of-- we did an online game where we gave people a random spreadsheet and a random cell, and they had to click the labels. We used that in an online course that I taught and we got 150,000 data points out of that game. We're currently analyzing that data to see what patterns are there in labeling. What usually is described by users as the labels of cells, and we hope that from that, we can generate or synthesize an algorithm that can do that for us. 
RVB: 10:28 Super cool. Let's put together a couple of links to your talks, but also to your research on the blog post that goes with this podcast. And then I'm sure people will love reading about it and will also love to hear about your future work. 
FH: 10:45 Thanks. 
RVB: 10:45 It will be great to keep in touch. Super. Thank you so much for coming on the Podcast, Felienne. I really appreciate it, and I look forward to seeing one of your talks, blog posts, whatever, in the near future. 
FH: 10:59 No problem. 
RVB: 11:00 All right. Have a nice day. 
FH: 11:01 Bye. 
RVB: 11:01 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Friday 13 November 2015

Podcast Interview with Chris Daly

The wonderful Neo4j Blog is a great source of conversations for this podcast - I really have met some wonderful people over the past couple of months, big kudos to Bryce for making the introductions! So today we will have another one of those sessions: an interview with Chris Daly, who has been doing some fascinating things with Neo4j in his home after hours. Let's listen to the conversation:

Of course, we also have a transcript of our conversation available for you guys:
RVB: 00:02 Hello everyone. My name is Rik, Rik van Bruggen from Neo Technology. Here we are again recording another Neo4j graph database podcast. Tonight I'm joined by Chris Daly, all the way from Oregon. Hi Chris.

CD: 00:13 Hello.

RVB: 00:15 Hey, good to have you on the podcast. Thanks for taking the time.

Thursday 12 November 2015

Querying GTFS data - using Neo4j 2.3 - part 2/2

So let's see what we can do with this data that we loaded into Neo4j in the previous blogpost. In these query examples, I will try to illustrate some of the key differences between older versions of Neo4j, and the newer, shinier Neo4j 2.3. There's a bunch of new features in that wonderful release, and some of these we will illustrate using the (Belgian) GTFS dataset. Basically there's two interesting ones that I really like:

  • using simple "full-text" indexing with the "starts with" where clause
  • using ranges in your where clauses
Both of these were formerly very cumbersome, and very easy and powerful in Neo4j 2.3. So let's explore.

Finding a stop using "STARTS WITH"

In this first set of queries we will be using some of the newer functionalities in Neo4j 2.3, which allow you to use the underlying full-text search capabilities of Lucene to quickly and efficiently find starting points for your traversals. The first examples start with the "START WITH" string matching function - let's consider this query:

 match (s:Stop)  
 where starts with "Turn"  
 return s  

In the new shiny 2.3 version of Neo4j, we generate the following query plan:
This, as you can see, is a super efficient query with only 5 "db hits" - so a wonderful example of using the Neo4j indexing system (see the NodeIndexSeekByRange step at the top). Great - this is a super useful new feature of Neo4j 2.3, which really helps to speed up (simple) fulltext queries. Now, let me tell you about a very easy and un-intuitive way to mess this up. Consider the following variation to the query:

 match (s:Stop)  
 where upper( starts with "TURN"  
 return s  

All I am doing here is using the "UPPER" function to enable case-insensitive querying - but as you can probably predict, the query plan then all of a sudden looks like this:
and it generates 2785 db hits. So that is terribly inefficient: the first step (NodeByLabelScan) basically sucks in all of the nodes that have a particular Label ("Stop") and then does all of the filtering on that set. On a smallish dataset like this one it may not really matter, but on a larger one (or on a deeper traversal) this would absolutely matter. The only way to avoid this in the current product is to have a second property that would hold the lower() or upper() of the original property, and then index/query on that property. It's a reasonable workaround for most cases.

So cool - learned something.

Range queries in 2.3

I would like to get to know a little more about Neo4j 2.3's range query capabilities. I will do that by , but limiting the desired departure and arrival times. (ie. Stoptimes) by their departure_time and/or arrival_time.  Let's try that with the following simple query to start with:

 match (st:Stoptime)  
 where st.departure_time < "07:45:00"  
 return st.departure_time;  

If I run that query without an index on :Stoptime(departure_time) I get a query plan like this:
As you can see the plan starts with a "NodeByLabelScan". Very inefficient.

If however we put the index in place, and run the same query again, we get the following plan:
Which stars with a "NodeIndexSeekByRange". Very efficient. So that's good.

Now let's see how we can apply that in a realistic route finding query.

Route finding on the GTFS dataset

The obvious application for a GTFS dataset it to use it for some real-world route planning. Let's start with the following simple query, which looks up two "Stops", Antwerp (where I live) and Turnhout (where I am from):

 match (ant:Stop), (tu:Stop)  
 where starts with "Antw"  
 AND starts with "Turn"  
 return distinct tu,ant;  

This gives me all the stops for "Antwerpen" and my hometown "Turnhout". Now I can narrow this down a bit and only look at the "top-level" stops (remember that stops can have parent stops), and calculate some shortestpaths between them. Let's use this query:

 match (t:Stop)<-[:PART_OF]-(:Stop),  
 where starts with "Turn"  
 with t,a  
 match p = allshortestpaths((t)-[*]-(a))  
 return p  
 limit 10;  

This gives me the following result (note that I have "limited") the number of paths, as there are quite a number of trains running between the two cities):

The important thing to note here is that there is a DIRECT ROUTE between Antwerp and Turnhout and that this really makes the route-finding a lot easier.

Querying for direct routes

A real-world route planning query would look something like this:

 match (tu:Stop {name: "Turnhout"})--(tu_st:Stoptime)  
 where tu_st.departure_time > "07:00:00"  
 AND tu_st.departure_time < "09:00:00"  
 with tu, tu_st  
 match (ant:Stop {name:"Antwerpen-Centraal"})--(ant_st:Stoptime)  
 where ant_st.arrival_time < "09:00:00"  
 AND ant_st.arrival_time > "07:00:00"  
 and ant_st.arrival_time > tu_st.departure_time  
 with ant,ant_st,tu, tu_st  
 match p = allshortestpaths((tu_st)-[*]->(ant_st))  
 with nodes(p) as n  
 unwind n as nodes  
 match (nodes)-[r]-()  
 return nodes,r  

which would give me a result like this:

The interesting thing here is that you can immediately see from this graph visualization that there is a "fast train" (the pink "Trip" at the bottom) and a "slow train" (the pink "Trip" at the top) between origin and destination. The slow train actually makes three additional stops.

Querying for indirect routes

Now let's look at a route-planning query for an indirect route between Turnhout and Arlon (the Southern most city in Belgium, close to the border with Luxemburg). Running this query will show me that I can only get from origin to destination by transferring from one train to another midway:

 match (t:Stop),(a:Stop)  
 where = "Turnhout"  
 with t,a  
 match p = allshortestpaths((t)-[*]-(a))  
 where NONE (x in relationships(p) where type(x)="OPERATES")  
 return p  
 limit 10  

This is what I get back then:

You can clearly see that I can get from Turnhout to Brussels, but then need to transfer to one of the Brussels-to-Arlon trains on the right. So... which one would that be? Let's run the following query:

 MATCH (tu:Stop {name:"Turnhout"})--(st_tu:Stoptime),  
 (ar:Stop {name:"Arlon"})--(st_ar:Stoptime),  
 st_tu.departure_time > "10:00:00"  
 AND st_tu.departure_time < "11:00:00"  
 AND st_midway_arr.arrival_time > st_tu.departure_time  
 AND st_midway_dep.departure_time > st_midway_arr.arrival_time  
 AND st_ar.arrival_time > st_midway_dep.departure_time  
 order by (st_ar.arrival_time_int-st_tu.departure_time_int) ASC  
 limit 1  

You can tell that this is a bit of a more complicated. It definitely comes back with a correct result:

At the top is the Trip from Turnhout to Brussels, and at the bottom is the Trip from Brussels to Arlon. You can also see that there's a bit of a wait there, so it may actually make more sense to take a later train from Turnhout to Brussels.

The problem with this approach is of course that it would not work for a journey that involved more than one stopover. If I would, for example, want to travel from "Leopoldsburg" to "Arlon", I would need two stopovers (in Hasselt, and then in Brussels):
and therefore the query above would become even more complicated.

My conclusion here is that

  1. it's actually pretty simple to represent GTFS data in Neo4j - and very nice to navigate through the data this way. Of course.
  2. direct routes are very easily queries with Cypher.
  3. indirect routes would require a bit more tweaking to the model and/or the use of a different API in Neo4j. That's currently beyond my scope of these blogposts, but I am very confident that it could be done.
I really hope you enjoyed these two blogposts, and that you will also apply it to your own local GTFS dataset - there's so many of them available. All of the queries above are on github as well of course - I hope you can use them as a baseline.



Monday 9 November 2015

Loading General Transport Feed Spec (GTFS) files into Neo4j - part 1/2

Lately I have been having a lot of fun with a pretty simple but interesting type of data: transport system data. That is: any kind of schedule data that a transportation network (bus, rail, tram, tube, metro, ...) would publish to it's users. This is super interesting data for a graph, right, as you could easily see that "shortestpath" operations over a larger transportation network would be super useful and quick.

The General Transport Feed Specification

Turns out that there is a very, very nice and easy spec for that kind of data. It was originally developed by Google as the "Google Transport Feed Specification" in cooperation with Portland Trimet, and is now known as the "General Transport Feed Specification". Here's a bit more detail from Wikipedia:
A GTFS feed is a collection of CSV files (with extension .txt) contained within a .zip file. Together, the related CSV tables describe a transit system's scheduled operations. The specification is designed to be sufficient to provide trip planning functionality, but is also useful for other applications such as analysis of service levels and some general performance measures. GTFS only includes scheduled operations, and does not include real-time information. However real-time information can be related to GTFS schedules according to the related GTFS-realtime specification.
More info on the Google Developer site. I believe that Google originally developed this to integrate transport information into Maps - which really worked very well I think. But since that time, the spec has been standardized - and now it turns out there are LOTS and lots of datasets like that.  Most of them are on the GTFS Exchange, it seems - and I have downloaded a few of them:
and there's many, many more.

Converting the files to a graph

The nice thing about these .zip files is that - once unzipped - they contain a bunch of comma-separated value files (.txt extension though), and that thee files all have a similar structure:

So I took a look at some of these files, and while I found that there are a few differences between the structures here and there (some of the GTFS data elements appear to be optional), but that generally I had a structure that looked like this:

You can see that there are a few "keys" in there (color coded) that link one file to the next. So then I could quite easily translate this to a graph model:

So now that we have that model, we should be able to import our data into Neo4j quite easily. Let's give that a go.

Loading GTFS data

Here's a couple of Cypher statements that I have used to load the data into the model. First we create some indexes and schema constraints (for uniqueness):

 create constraint on (a:Agency) assert is unique;  
 create constraint on (r:Route) assert is unique;  
 create constraint on (t:Trip) assert is unique;  
 create index on :Trip(service_id);  
 create constraint on (s:Stop) assert is unique;  
 create index on :Stoptime(stop_sequence);  
 create index on :Stop(name);  

Then we add the Agency, Routes and Trips:
 //add the agency  
 load csv with headers from  
 'file:///delijn/agency.txt' as csv  
 create (a:Agency {id: toInt(csv.agency_id), name: csv.agency_name, url: csv.agency_url, timezone: csv.agency_timezone});  
// add the routes  
 load csv with headers from  
 'file:///ns/routes.txt' as csv  
 match (a:Agency {id: toInt(csv.agency_id)})  
 create (a)-[:OPERATES]->(r:Route {id: csv.route_id, short_name: csv.route_short_name, long_name: csv.route_long_name, type: toInt(csv.route_type)});  
 // add the trips  
 load csv with headers from  
 'file:///ns/trips.txt' as csv  
 match (r:Route {id: csv.route_id})  
 create (r)<-[:USES]-(t:Trip {id: csv.trip_id, service_id: csv.service_id, headsign: csv.trip_headsign, direction_id: csv.direction_id, short_name: csv.trip_short_name, block_id: csv.block_id, bikes_allowed: csv.bikes_allowed, shape_id: csv.shape_id});  

Next we first load the "stops" without connecting them to the graph, including the parent/child relationships that can exist between specific stops:
 //add the stops  
 load csv with headers from  
 'file:///ns/stops.txt' as csv  
 create (s:Stop {id: csv.stop_id, name: csv.stop_name, lat: toFloat(csv.stop_lat), lon: toFloat(csv.stop_lon), platform_code: csv.platform_code, parent_station: csv.parent_station, location_type: csv.location_type, timezone: csv.stop_timezone, code: csv.stop_code});  
//connect parent/child relationships to stops  
 load csv with headers from  
 'file:///ns/stops.txt' as csv  
 with csv  
 where not (csv.parent_station is null)  
 match (ps:Stop {id: csv.parent_station}), (s:Stop {id: csv.stop_id})  
 create (ps)<-[:PART_OF]-(s);  

Then, finally, we add the Stoptimes which connect the Trips to the Stops:
 //add the stoptimes  
 using periodic commit  
 load csv with headers from  
 'file:///ns/stop_times.txt' as csv  
 match (t:Trip {id: csv.trip_id}), (s:Stop {id: csv.stop_id})  
 create (t)<-[:PART_OF_TRIP]-(st:Stoptime {arrival_time: csv.arrival_time, departure_time: csv.departure_time, stop_sequence: toInt(csv.stop_sequence)})-[:LOCATED_AT]->(s);  
This query/load operation has been a bit trickier for me when experimenting with various example GTFS files: because there can be a LOT of stoptimes for large transportation networks like bus networks, they can take a long time to complete and should be treated with care. On some occasions, I have had to split the Stoptimes.txt file into multiple parts to make it work.

Finally, we will connect the stoptimes to one another, forming a sequence of stops that constitute a trip:
 //connect the stoptime sequences  
 match (s1:Stoptime)-[:PART_OF_TRIP]->(t:Trip),  
 where s2.stop_sequence=s1.stop_sequence+1  
 create (s1)-[:PRECEDES]->(s2);  

That's it, really. When I generate the meta-graph for this data, I get something like this:

Which is exactly the Model that we outlined above :) ... Good!

The entire load script can be found on github, so you can try it yourself. All you need to do is chance the load csv file/directory. Also, don't forget that load csv now takes its import files from the local directory that you configure in
That's about it for now. In a next blogpost, I will take Neo4j 2.3 for a spin on a GTFS dataset, and see what we can find out. Check back soon to read up on that.

Hope this was interesting for you.