Thursday 26 November 2015

Podcast Interview with Karl Urich, Datafoxtrot

Been a hectic couple of weeks, which is why I am lagging behind a little bit in publishing lovely podcast episodes that I actually recorded over a month ago. Here's a wonderful and super-interesting chat with Karl Urich of DataFoxtrot, who wrote about graphs, spatial applications and visualisations recently on our blog and on LinkedIn Pulse. Lovely chat - hope you will enjoy as much as I did:

Here's the transcript of our conversation:
RVB: 00:01 Hello, everyone. My name Rik. Rik Van Bruggen from Neo Technology, and here I am, again, recording another episode of the Graph Database podcast. Today, I've got a guest all the way from the US, Karl Urich. Hi, Karl. 
KF: 00:15 Rik, very nice to speak with you. 
RVB: 00:17 Thank you for joining us. It's always great when people make the time to come on these podcasts and share their experience and their knowledge with the community, I really appreciate it. Karl, why don't you introduce yourself? Many people won't know you yet. You might want to change that. 
KF: 00:35 Yeah, absolutely. So, again, thanks for having me on this podcast. It's really great to be able to talk about the things I have experimented with and see if it resonates with people. I own a small consulting business called DataFoxtrot, started under a year ago. Primary focus of the business is on data monetisation. If a company has content or data, how can we help those companies make money or get new value from that content or data if they could be collecting data as a by-product of their business or they could be using data internally in their business and then they realise that someone outside the company can use that as well? So, that's the primary focus of my business, but like any good consulting company, I have a few other explorations and really this intersection of the world of graph and spatial analytics or location intelligence is what interests me. So, talking a little bit about those explorations is what will hopefully interest your listeners. 
RVB: 01:38 Yeah, absolutely. Well, so, that's interesting, right? I mean, what's the background to your relationship to the wonderful world of graphs then, you know? How did you get into it? 
KF: 01:45 Yeah, so going all the way back to college, I did take a good Introduction to Graph Theory as a mathematics elective, but then really got into the world of spatial and data analytics.  For 20 years working with all things data: demographic data, spatial data, vertical industry data, along the way building some routing products, late 1990's or late 2000's products, that did point to point routing, drive time calculations, multi-point routing. Really kind of that original intersection of graph and spatial. But, data junky, very interested in data: graph, spatial, data modelling et cetera. 
RVB: 02:28 Yeah. Cool. I understand that these spatial components is like your unique focus area, or one of your at least focus areas these days, right? Tell us more about that. 
KF: 02:39 Yeah, absolutely. And it's certainly what resonates when I think of about the graph side, spatial data really should define-- spatial data could be any sort of business problems related to proximity location or driving things because you know where something is, your  competitors, your customers, the people that you serve. And that's where it resonated to me when, as I start to look at graph and spatial, I was really excited back in April. I walked in, just very coincidentally, in a big data conference to a presentation being put on by Cambridge Intelligence-- 
RVB: 03:24 Oh, yeah. 
KF: 03:26 And so they were introducing spatial elements to their graph visualization. 
RVB: 03:31 That's really-- they just released a new product, I think. Right?
KF: 03:34 Just released the new product, at the time had gone beta. So, that really got me thinking about how could you combine graph and spatial together to solve a problem. Looking at Cambridge Intelligences, technology of looking at some spatial plugins for Neo, and again, my company is a consulting company and if there is a need for that expertise at the intersection of graph and spatial, we want to explore that. 
RVB: 04:05 Very cool. Did you do some experiments around this as well, Karl? Did you, sort of, try to prove out the goals just a little bit? 
KF: 04:11 Yeah. Absolutely. Let me talk a little bit about that. At this concept of combined spatial and graph problem that looked at the outliers, outliers just meaning things that are exceptional, extraordinary, and the thinking is, in my mind, was businesses and organisations can get value from identifying outliers and acting on those outliers. So, maybe an outlier can represent an opportunity for growth by capitalising on outliers, or bottom-line savings by eliminating outliers. Let me give an example of an outlier. If you look at a graph of all major North American airports, and their flight patterns, and put it on a map, you could visualise that Honolulu and Anchorage airports are outliers. There are just few other airports that, "look the same”, meaning same location, same incoming and outgoing flight patterns. And that's really relatively easy if you have a very small graph to visualise outliers, but if you want to look at a larger graph, hundreds of thousands, millions of nodes, what would you do? So, that really started the experiment. I was looking around for test data. Wikipedia is fantastic. You can download-- 
RVB: 05:28 [chuckles] It is. 
KF: 05:29 Wikipedia data-- I love Wikipedia. Anyway, it seemed very natural. And the great thing is that there are probably around a million or so records that have some sort of geographic tagging
RVB: 05:42 Oh, do they? 
KF: 05:44 Yep, so a page-- London, England has a latitude longitude. Tower of London has a latitude and longitude. An airport has a latitude longitude. 
RVB: 05:54 Of course. 
KF: 05:54 So, you can tease out all of the records that have latitude longitude  tagging, preserve the relationships and shove that all into a graph. So, you have a spatially enabled graph, every XY has a-- every page has a latitude longitude or XY. So, really the hard work started, which was taking a look at outliers. So, quick explanation of outliers, so, you think of  a Wikipedia page for London, England, a Wikipedia page for Sidney, Australia, they cross reference each other. Pretty unusual to locations other side of the world, but would you call those outliers? Not really, because there's also a relationship between the London page and the Melbourne, Australia Wikipedia page. So, you really wouldn't call those anything exceptional. And so, what I built was  a system, or just a very brief explanation is that I looked at relationships in the graph, looked only at the bi-directional or bilateral relationships where pages cross-referenced each other. None have really identified how close every relationship was to another relationship or looked for the most spatially similar relationship. You can score them then, and you can kind of rank outliers. So, let me just give one quick example. It's actually my favorite outlier that I've found-- 
RVB: 07:30 Which category? 
KF: 07:31 Unusual thing to say. There's a small town in Australia called Arish. I think I'm pronouncing that right, that has a relationship with the town in the Sinai Peninsula called Arish, and El Arish in Australia is named after Arish, Egypt because Australian soldiers were based there in World War One-- 
RVB: 07:51 No way! 
KF: 07:53 Yep! And most importantly, this relationship from a spatial perspective, looks like no other relationship. So, that's the kind of thing, when you are able to look at relationships, try to rate them in terms of spatial outliers-- 
RVB: 08:10 Yeah, sure. 
KF: 08:12 You can find things that lead to additional discovery as well. 
RVB: 08:18 Super cool. 
KF: 08:19 As a Wikipedia junkie, that's pretty fascinating. 
RVB: 08:21 [laughter] Very cool. Well, I read your blog post about-- outliers made me think of security aspects actually. I don't know if you know the book Liars and Outliers. It's a really great book by Bruce Schneier. I also have to think about-- we recently did a Wiki Wiki challenge, which is, you know, finding the connections between Wikipedia pages. You know, how are two randomly chosen Wikipedia pages linked together, which is always super fun to do. 
KF: 09:00 It was even in my original posting and I didn't want to say that, "Hey, this could be used for security type applications." So, I think I talked in code and said, "You could use this to identify red flag events," but I like to think of it as both the positive opportunity and the negative opportunity when you're able to identify outliers and-- 
RVB: 09:26 Yeah, identifying outliers has lots of business applications, right? I mean, those outliers are typically very interesting, whether it's in terms of unexpected knowledge, or fraudulent transactions, suspect transactions. Outliers tend to be really interesting, right? 
KF: 09:43 Absolutely, absolutely. 
RVB: 09:45 Super cool. So, where is this going, Carl? What do you think-- what's the next step for you and DataFoxtrot, but also graph knowledge in general? Any perspectives on that? 
KF: 09:56 Yeah. So, there's more of a tactical thing, which is as we record a week from now we have GraphConnect probably-- 
RVB: 10:04 I am so looking forward to it. 
KF: 10:06 Which will be fantastic and being able to test this out with people. It's always great to bounce ideas off to people. In terms of our next experiments, the one that interests me is almost the opposite of outliers and let me explain. So, I have some background in demographics, analytics, and segmentation, so, what interests me a lot is looking at clustering of relationships of the graph. Think of clustering is grouping things that are similar in to bins or clusters, so that you can really make over arching statements or productions about each cluster. You can use techniques like K Means to do the clustering. So, what interests me about graph and spatial for clustering is you can use both elements. The relationships of the graph, spatial location of the nodes, together to drive the clustering. I've started some of the work on this and, again,  using Wikipedia data and maybe the outcome, using Wikipedia, if you did your clustering based on spatial location of the nodes, plus strength of the connection, plus the importance of the nodes, plus maybe some other qualifiers, like if a node is a Wikipedia page for a city or a man-made feature, a natural feature, you might end up with clusters that have labels to them. One cluster might be all relationships connecting cities in South America and Western Europe, or relationships between sports teams around the world. So, it's kind of the opposite, if outliers is finding the outliers, the exceptional things, clustering is finding the patterns. 
RVB: 11:42 Commonalities. 
KF: 11:44 A real-world example might be an eCommerce company is looking at the distribution network, and they want to do clustering based on shipments, who shipped what to whom, where the shipper and recipient are, package type, value, other factors, and they could create a clustering system that categorises their distribution network and they can look at business performance by cluster, impact of marketing on clusters and sometimes just the basic visualisation of clustering just often yields those Eureka moments of insight. That's kind of the next entrusting project that's out there. I'd say, ask me in six to eight weeks [laughter]. 
RVB: 12:29 We'll definitely do that. Cool. Carl, I think we're going to wrap up here. It's been a great pleasure talking to you. Thank you for taking the time, and I really look forward to seeing you at GraphConnect. I wish you lots of fun and success with your project. 
KF: 12:49 Excellent. Thank you very much Rik, really appreciate it. 
RVB: 12:51 Thank you, bye bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Wednesday 18 November 2015

Podcast Interview with Felienne Hermans, TU Delft

Almost two years ago, my colleagues at Neo alerted me to the fact that there was this really cool lady in the Netherlands that had an even cooler story about Neo4j and Spreadsheets. Curious as I am, I invited Felienne to come and talk about this at our Amsterdam meetup, and we have been in touch with her and her department ever since. So naturally, we also invited her to come on our Podcast, and as usual :), we had a great conversation. Hope you enjoy:

Here's the transcript of our conversation:
RVB: 00:01 Hello everyone. This is Rik, Rik Van Bruggen from Neo Technology. Today, we're recording another podcast episode that I've been looking forward to for a long time, actually, because I'm a true fan, and not a groupie yet, but a fan of Felienne Hermans from Delft University. Hi, Felienne.
FH: 00:19 Hi, Rik. 
RVB: 00:20 Hey. Great to have you on the podcast. It's fantastic. We've seen you talk at a number of conferences and at the meet-up before. But most people won't have seen that yet, so maybe you can introduce yourself?
FH: 00:33 Sure. I'm Felienne Hermans. I'm assistant professor at Delft University of Technology where I run the Spreadsheet Lab. That's a research group of the university that researches spreadsheets, obviously. 
RVB: 00:45 Obviously, exactly. That also sort of hints at the relationship with graphs, I think, right? 
FH: 00:53 Yeah. People that have seen my talk or maybe googled me know that the title of my talk that I usually give is "Spreadsheets are graphs".

So, we use Neo4j to do graph analysis on spreadsheets. We do all sorts of analysis, actually. The whole idea of our research is that spreadsheets are actually "code". And then if it's code, you need an IDE. So you need to analyze all sorts of constructions within your code so you can see, maybe, do you have code smells in your spreadsheets? Maybe does your spreadsheet need refactoring? All the things you typically do on source code for analysis, you should also do on your spreadsheet, and this is where we use Neo4j. 
RVB: 01:34 If I recall, you actually did some work on proving that spreadsheets are Turing complete, right? 
FH: 01:41 Yeah, I did. To make my point that spreadsheets are code because then, people think it's funny. If I say, "Hey, I do research on spreadsheets and I've wrote a PhD dissertation on spreadsheets," people laugh in my face often [chuckles]. "Really, can you do dissertation on that?" It's software engineering. I'd say, "Yeah." But actually, people don't believe me. To prove my point - indeed I implemented it - a Turing machine in a spreadsheet using only formulas, ensure that they are Turing complete and that makes them as powerful as any other programming language. That should stop people from laughing at me. 
RVB: 02:17 [chuckles] Exactly. It's funny, but it's also really, really interesting. I think fundamentally, it's a very interesting approach and that's why I think people also love your talks. It's a very interesting topic. So, why did you get into graphs then, Felienne? Tell us more about that. How did you get into that relationship between spreadsheets and graphs? 
FH: 02:40 As I said, we do smell detection, as well. And for that, we initially stored information in a database and we stored information, for instance, what cell relates to what other cell? Because if you want to calculate a smell like "feature envy", and source code feature envy would be a method that uses a lot of fields from a different class. So you can see in a similar way that in a spreadsheet, a formula that uses a lot of cells from a different worksheet has the feature and the smell. It should actually go in the other worksheet. So in order to save that type of information, you need to store what cell relates to what other cell. And initially, I never thought about what database do I use. 
FH: 03:26 In my mind, a few years ago, database was just a synonym for a SQL server. I was in Microsoft world where we make plugins for Excel, so database is just SQL. Same thing. I didn't think about it. I just dropped all my stuff in database, aka SQL. And initially, that worked fine. Some analyses you can really easily do but at one point, you want to really deeply understand how all the cells relate to each other because you want to measure the health of the spreadsheet. So we got horrible queries. SQL queries of like one A4 sheet of paper. Very, very complicated. But still I thought databases are just hard. I didn't really think about it until I saw a talk from one of your colleagues, Amanda, at Build Stuff in Vilnius. I saw her talk, and then it was really like a light bulb above my head. Bing. This is what I need. 
FH: 04:23 And a few weeks later, I was onsite at a customer for weeks so I wasn't bothered by students or colleagues so I could really program for a while.  I thought, "Okay. This is my chance. Let's try and get my data into Neo4j and see how it will improve, what type of analysis that it would make easier." So that's what I did. When I tried, lots of the analyses - how many hopes are there between these two cells or what is the biggest span within the graph - obviously are very easy to answer in Neo4j. So that's how we changed some of our analyses queries to run on the Neo4J database because it was so much easier to explore the data in a graph way because spreadsheets are graphs. There's a whole graph layer underneath the grid-like interface. That was really easy to analyze. 
RVB: 05:16 That is such a great story and such a great summary of why it's such a great fit. I guess most people don't think of it that way. But effectively, what you're doing is like dependency analysis, right? 
FH: 05:28 Yeah. That's what we're doing. How did cells depend on each other? 
RVB: 05:31 Super interesting. Is that something that you currently already use? I know you've been developing software on top of Neo4j, right? Is that already something that people can look at? 
FH: 05:45 No. Currently we only look at it within our own research group. The analyses we do is for us as researchers. It's not user-facing, so we have a smell detection tool that is somewhat user-facing where spreadsheet users can upload their spreadsheet and they get some analysis in the browser, but that is no less advanced than the analysis we use. They're still using the SQL back-end because users typically don't really want to explore their spreadsheet in a way that we want to explore spreadsheets if we're researching. A spreadsheet user is not going to ask himself the question, "Hey, how wide are my cells connected?" That's really more a research tool. 
RVB: 06:31 I understand. Totally. What are your plans around that, Felienne? Are you still expanding that work or is that something that is still under development then? Where is this going, you think? 
FH: 06:42 Obviously, if you say smells and you say refactoring. We've done lots of work on the smells. Even though we keep adding new smells, we feel that we have covered the smells area pretty nicely. Then, the next step of course would be refactoring the smells. If I know that this cell suffers from feature envy, it is jealous of that nice worksheet where all the cells are-- that he is using that formula. You want to move the formula to the other worksheet so that it's nicely close together to the cells that he's using. So these type of refactorings - moving cells in order to improve the graph structure - is something that we are looking at. 
FH: 07:21 One of my PhD students is currently looking at comparing the underlying graph, so where are the cells connected to each other? Compare that look on the spreadsheet to where are the cells in the worksheet? If you have a big cluster of cells, they're all referring to each other but they are physically located on two different worksheets, that's probably not ideal for the user because then you have to switch back and forth. And the other way around is true, as well. If you have a worksheet where there are two clusters of cells relating to each other, maybe it would be better to give each of these clusters their own worksheets. So these are the type of refactorings that we are looking into. If you have a big disparity between how your spreadsheet is lay-outed, and how your graph connections are, then this is very smelly and you should do something to improve the structure. There are still a lot of graphs also, in the refactoring future that we see. 
RVB: 08:23 That, again, sounds so interesting. I think we could have a lot of joy - spreadsheet joy - because of that. I would love to see that. Very cool. Any other topics that you think would be relevant for our podcast listeners, Felienne?  Or anything else that comes to mind? 
FH: 08:44 Yes. One other thing, one final thing. I like to pitch my research a little bit if-- 
RVB: 08:49 Of course. Yeah. 
FH: 08:51 One of the things that we're also looking at is looking at what in a spreadsheet are labels and what in a spreadsheet is data? And especially how do they relate to each other? If you want to, for instance, generate documentation from a spreadsheet, or help a user understand a spreadsheet, it's very important to know if you have-- you take a random formula from a spreadsheet, what is it calculating? Is this the turnover of January, or is this the sales of blue shoes? And sometimes it's easy because, again, your layout matches the formulas. Sometimes you can just walk up the column or down the row to get the label. Sometimes the layout is a little bit more complicated. One of the things that we are working on is trying to make an algorithm - semiautomatic, and maybe with some user assistance, or entirely automated - where you can pick a random cell and then it will give you what it is, what is semantically happening in that cell. Can I add links to your story, as well? 
RVB: 09:57 Yes, you can. Yeah. 
FH: 09:59 Okay. So let's share a link of-- we did an online game where we gave people a random spreadsheet and a random cell, and they had to click the labels. We used that in an online course that I taught and we got 150,000 data points out of that game. We're currently analyzing that data to see what patterns are there in labeling. What usually is described by users as the labels of cells, and we hope that from that, we can generate or synthesize an algorithm that can do that for us. 
RVB: 10:28 Super cool. Let's put together a couple of links to your talks, but also to your research on the blog post that goes with this podcast. And then I'm sure people will love reading about it and will also love to hear about your future work. 
FH: 10:45 Thanks. 
RVB: 10:45 It will be great to keep in touch. Super. Thank you so much for coming on the Podcast, Felienne. I really appreciate it, and I look forward to seeing one of your talks, blog posts, whatever, in the near future. 
FH: 10:59 No problem. 
RVB: 11:00 All right. Have a nice day. 
FH: 11:01 Bye. 
RVB: 11:01 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best


Friday 13 November 2015

Podcast Interview with Chris Daly

The wonderful Neo4j Blog is a great source of conversations for this podcast - I really have met some wonderful people over the past couple of months, big kudos to Bryce for making the introductions! So today we will have another one of those sessions: an interview with Chris Daly, who has been doing some fascinating things with Neo4j in his home after hours. Let's listen to the conversation:

Of course, we also have a transcript of our conversation available for you guys:
RVB: 00:02 Hello everyone. My name is Rik, Rik van Bruggen from Neo Technology. Here we are again recording another Neo4j graph database podcast. Tonight I'm joined by Chris Daly, all the way from Oregon. Hi Chris.

CD: 00:13 Hello.

RVB: 00:15 Hey, good to have you on the podcast. Thanks for taking the time.

Thursday 12 November 2015

Querying GTFS data - using Neo4j 2.3 - part 2/2

So let's see what we can do with this data that we loaded into Neo4j in the previous blogpost. In these query examples, I will try to illustrate some of the key differences between older versions of Neo4j, and the newer, shinier Neo4j 2.3. There's a bunch of new features in that wonderful release, and some of these we will illustrate using the (Belgian) GTFS dataset. Basically there's two interesting ones that I really like:

  • using simple "full-text" indexing with the "starts with" where clause
  • using ranges in your where clauses
Both of these were formerly very cumbersome, and very easy and powerful in Neo4j 2.3. So let's explore.

Finding a stop using "STARTS WITH"

In this first set of queries we will be using some of the newer functionalities in Neo4j 2.3, which allow you to use the underlying full-text search capabilities of Lucene to quickly and efficiently find starting points for your traversals. The first examples start with the "START WITH" string matching function - let's consider this query:

 match (s:Stop)  
 where starts with "Turn"  
 return s  

In the new shiny 2.3 version of Neo4j, we generate the following query plan:
This, as you can see, is a super efficient query with only 5 "db hits" - so a wonderful example of using the Neo4j indexing system (see the NodeIndexSeekByRange step at the top). Great - this is a super useful new feature of Neo4j 2.3, which really helps to speed up (simple) fulltext queries. Now, let me tell you about a very easy and un-intuitive way to mess this up. Consider the following variation to the query:

 match (s:Stop)  
 where upper( starts with "TURN"  
 return s  

All I am doing here is using the "UPPER" function to enable case-insensitive querying - but as you can probably predict, the query plan then all of a sudden looks like this:
and it generates 2785 db hits. So that is terribly inefficient: the first step (NodeByLabelScan) basically sucks in all of the nodes that have a particular Label ("Stop") and then does all of the filtering on that set. On a smallish dataset like this one it may not really matter, but on a larger one (or on a deeper traversal) this would absolutely matter. The only way to avoid this in the current product is to have a second property that would hold the lower() or upper() of the original property, and then index/query on that property. It's a reasonable workaround for most cases.

So cool - learned something.

Range queries in 2.3

I would like to get to know a little more about Neo4j 2.3's range query capabilities. I will do that by , but limiting the desired departure and arrival times. (ie. Stoptimes) by their departure_time and/or arrival_time.  Let's try that with the following simple query to start with:

 match (st:Stoptime)  
 where st.departure_time < "07:45:00"  
 return st.departure_time;  

If I run that query without an index on :Stoptime(departure_time) I get a query plan like this:
As you can see the plan starts with a "NodeByLabelScan". Very inefficient.

If however we put the index in place, and run the same query again, we get the following plan:
Which stars with a "NodeIndexSeekByRange". Very efficient. So that's good.

Now let's see how we can apply that in a realistic route finding query.

Route finding on the GTFS dataset

The obvious application for a GTFS dataset it to use it for some real-world route planning. Let's start with the following simple query, which looks up two "Stops", Antwerp (where I live) and Turnhout (where I am from):

 match (ant:Stop), (tu:Stop)  
 where starts with "Antw"  
 AND starts with "Turn"  
 return distinct tu,ant;  

This gives me all the stops for "Antwerpen" and my hometown "Turnhout". Now I can narrow this down a bit and only look at the "top-level" stops (remember that stops can have parent stops), and calculate some shortestpaths between them. Let's use this query:

 match (t:Stop)<-[:PART_OF]-(:Stop),  
 where starts with "Turn"  
 with t,a  
 match p = allshortestpaths((t)-[*]-(a))  
 return p  
 limit 10;  

This gives me the following result (note that I have "limited") the number of paths, as there are quite a number of trains running between the two cities):

The important thing to note here is that there is a DIRECT ROUTE between Antwerp and Turnhout and that this really makes the route-finding a lot easier.

Querying for direct routes

A real-world route planning query would look something like this:

 match (tu:Stop {name: "Turnhout"})--(tu_st:Stoptime)  
 where tu_st.departure_time > "07:00:00"  
 AND tu_st.departure_time < "09:00:00"  
 with tu, tu_st  
 match (ant:Stop {name:"Antwerpen-Centraal"})--(ant_st:Stoptime)  
 where ant_st.arrival_time < "09:00:00"  
 AND ant_st.arrival_time > "07:00:00"  
 and ant_st.arrival_time > tu_st.departure_time  
 with ant,ant_st,tu, tu_st  
 match p = allshortestpaths((tu_st)-[*]->(ant_st))  
 with nodes(p) as n  
 unwind n as nodes  
 match (nodes)-[r]-()  
 return nodes,r  

which would give me a result like this:

The interesting thing here is that you can immediately see from this graph visualization that there is a "fast train" (the pink "Trip" at the bottom) and a "slow train" (the pink "Trip" at the top) between origin and destination. The slow train actually makes three additional stops.

Querying for indirect routes

Now let's look at a route-planning query for an indirect route between Turnhout and Arlon (the Southern most city in Belgium, close to the border with Luxemburg). Running this query will show me that I can only get from origin to destination by transferring from one train to another midway:

 match (t:Stop),(a:Stop)  
 where = "Turnhout"  
 with t,a  
 match p = allshortestpaths((t)-[*]-(a))  
 where NONE (x in relationships(p) where type(x)="OPERATES")  
 return p  
 limit 10  

This is what I get back then:

You can clearly see that I can get from Turnhout to Brussels, but then need to transfer to one of the Brussels-to-Arlon trains on the right. So... which one would that be? Let's run the following query:

 MATCH (tu:Stop {name:"Turnhout"})--(st_tu:Stoptime),  
 (ar:Stop {name:"Arlon"})--(st_ar:Stoptime),  
 st_tu.departure_time > "10:00:00"  
 AND st_tu.departure_time < "11:00:00"  
 AND st_midway_arr.arrival_time > st_tu.departure_time  
 AND st_midway_dep.departure_time > st_midway_arr.arrival_time  
 AND st_ar.arrival_time > st_midway_dep.departure_time  
 order by (st_ar.arrival_time_int-st_tu.departure_time_int) ASC  
 limit 1  

You can tell that this is a bit of a more complicated. It definitely comes back with a correct result:

At the top is the Trip from Turnhout to Brussels, and at the bottom is the Trip from Brussels to Arlon. You can also see that there's a bit of a wait there, so it may actually make more sense to take a later train from Turnhout to Brussels.

The problem with this approach is of course that it would not work for a journey that involved more than one stopover. If I would, for example, want to travel from "Leopoldsburg" to "Arlon", I would need two stopovers (in Hasselt, and then in Brussels):
and therefore the query above would become even more complicated.

My conclusion here is that

  1. it's actually pretty simple to represent GTFS data in Neo4j - and very nice to navigate through the data this way. Of course.
  2. direct routes are very easily queries with Cypher.
  3. indirect routes would require a bit more tweaking to the model and/or the use of a different API in Neo4j. That's currently beyond my scope of these blogposts, but I am very confident that it could be done.
I really hope you enjoyed these two blogposts, and that you will also apply it to your own local GTFS dataset - there's so many of them available. All of the queries above are on github as well of course - I hope you can use them as a baseline.



Monday 9 November 2015

Loading General Transport Feed Spec (GTFS) files into Neo4j - part 1/2

Lately I have been having a lot of fun with a pretty simple but interesting type of data: transport system data. That is: any kind of schedule data that a transportation network (bus, rail, tram, tube, metro, ...) would publish to it's users. This is super interesting data for a graph, right, as you could easily see that "shortestpath" operations over a larger transportation network would be super useful and quick.

The General Transport Feed Specification

Turns out that there is a very, very nice and easy spec for that kind of data. It was originally developed by Google as the "Google Transport Feed Specification" in cooperation with Portland Trimet, and is now known as the "General Transport Feed Specification". Here's a bit more detail from Wikipedia:
A GTFS feed is a collection of CSV files (with extension .txt) contained within a .zip file. Together, the related CSV tables describe a transit system's scheduled operations. The specification is designed to be sufficient to provide trip planning functionality, but is also useful for other applications such as analysis of service levels and some general performance measures. GTFS only includes scheduled operations, and does not include real-time information. However real-time information can be related to GTFS schedules according to the related GTFS-realtime specification.
More info on the Google Developer site. I believe that Google originally developed this to integrate transport information into Maps - which really worked very well I think. But since that time, the spec has been standardized - and now it turns out there are LOTS and lots of datasets like that.  Most of them are on the GTFS Exchange, it seems - and I have downloaded a few of them:
and there's many, many more.

Converting the files to a graph

The nice thing about these .zip files is that - once unzipped - they contain a bunch of comma-separated value files (.txt extension though), and that thee files all have a similar structure:

So I took a look at some of these files, and while I found that there are a few differences between the structures here and there (some of the GTFS data elements appear to be optional), but that generally I had a structure that looked like this:

You can see that there are a few "keys" in there (color coded) that link one file to the next. So then I could quite easily translate this to a graph model:

So now that we have that model, we should be able to import our data into Neo4j quite easily. Let's give that a go.

Loading GTFS data

Here's a couple of Cypher statements that I have used to load the data into the model. First we create some indexes and schema constraints (for uniqueness):

 create constraint on (a:Agency) assert is unique;  
 create constraint on (r:Route) assert is unique;  
 create constraint on (t:Trip) assert is unique;  
 create index on :Trip(service_id);  
 create constraint on (s:Stop) assert is unique;  
 create index on :Stoptime(stop_sequence);  
 create index on :Stop(name);  

Then we add the Agency, Routes and Trips:
 //add the agency  
 load csv with headers from  
 'file:///delijn/agency.txt' as csv  
 create (a:Agency {id: toInt(csv.agency_id), name: csv.agency_name, url: csv.agency_url, timezone: csv.agency_timezone});  
// add the routes  
 load csv with headers from  
 'file:///ns/routes.txt' as csv  
 match (a:Agency {id: toInt(csv.agency_id)})  
 create (a)-[:OPERATES]->(r:Route {id: csv.route_id, short_name: csv.route_short_name, long_name: csv.route_long_name, type: toInt(csv.route_type)});  
 // add the trips  
 load csv with headers from  
 'file:///ns/trips.txt' as csv  
 match (r:Route {id: csv.route_id})  
 create (r)<-[:USES]-(t:Trip {id: csv.trip_id, service_id: csv.service_id, headsign: csv.trip_headsign, direction_id: csv.direction_id, short_name: csv.trip_short_name, block_id: csv.block_id, bikes_allowed: csv.bikes_allowed, shape_id: csv.shape_id});  

Next we first load the "stops" without connecting them to the graph, including the parent/child relationships that can exist between specific stops:
 //add the stops  
 load csv with headers from  
 'file:///ns/stops.txt' as csv  
 create (s:Stop {id: csv.stop_id, name: csv.stop_name, lat: toFloat(csv.stop_lat), lon: toFloat(csv.stop_lon), platform_code: csv.platform_code, parent_station: csv.parent_station, location_type: csv.location_type, timezone: csv.stop_timezone, code: csv.stop_code});  
//connect parent/child relationships to stops  
 load csv with headers from  
 'file:///ns/stops.txt' as csv  
 with csv  
 where not (csv.parent_station is null)  
 match (ps:Stop {id: csv.parent_station}), (s:Stop {id: csv.stop_id})  
 create (ps)<-[:PART_OF]-(s);  

Then, finally, we add the Stoptimes which connect the Trips to the Stops:
 //add the stoptimes  
 using periodic commit  
 load csv with headers from  
 'file:///ns/stop_times.txt' as csv  
 match (t:Trip {id: csv.trip_id}), (s:Stop {id: csv.stop_id})  
 create (t)<-[:PART_OF_TRIP]-(st:Stoptime {arrival_time: csv.arrival_time, departure_time: csv.departure_time, stop_sequence: toInt(csv.stop_sequence)})-[:LOCATED_AT]->(s);  
This query/load operation has been a bit trickier for me when experimenting with various example GTFS files: because there can be a LOT of stoptimes for large transportation networks like bus networks, they can take a long time to complete and should be treated with care. On some occasions, I have had to split the Stoptimes.txt file into multiple parts to make it work.

Finally, we will connect the stoptimes to one another, forming a sequence of stops that constitute a trip:
 //connect the stoptime sequences  
 match (s1:Stoptime)-[:PART_OF_TRIP]->(t:Trip),  
 where s2.stop_sequence=s1.stop_sequence+1  
 create (s1)-[:PRECEDES]->(s2);  

That's it, really. When I generate the meta-graph for this data, I get something like this:

Which is exactly the Model that we outlined above :) ... Good!

The entire load script can be found on github, so you can try it yourself. All you need to do is chance the load csv file/directory. Also, don't forget that load csv now takes its import files from the local directory that you configure in
That's about it for now. In a next blogpost, I will take Neo4j 2.3 for a spin on a GTFS dataset, and see what we can find out. Check back soon to read up on that.

Hope this was interesting for you.



Wednesday 4 November 2015

Podcast Interview with David Meza, NASA

Every engineer or informatician dreams about their "output", the stuff they work on day by day, to achieve something BIG. Something that will change the world... With Neo4j, I am lucky enough to experience something similar - even though I am not an engineer. On many occasions, I have had the good fortune - on this podcast and elsewhere - to talk to people that are truly revolutionising an industry. Whether it's in the transportation industry, telecoms, pharmaceuticals, or... getting to Mars - I have talked to people that are actually using Graphs to do so.

So today's podcast was a particularly nice and insightful talk with David Meza, of NASA. David is using Neo4j at a small scale today to enable true Knowledge Management on their "Lessons Learnt" data. He can tell you all about that himself - but rest assured that it was truly interesting:
Here's the transcript of our conversation:
RVB: 00:02 Hello everyone. My name is Rik, Rik van Bruggen from Neo Technology and here we are again recording a wonderful session for the Neo4j graph database podcast. I'm joined today all the way from Texas in the U.S.A. by David Meza from NASA. Hello David. 
DM: 00:18 Hello Rik, glad to be here. 
RVB: 00:19 It's wonderful to have you on the podcast, thanks for making the time. 
DM: 00:22 Oh, my pleasure. 
RVB: 00:23 Well, it's not every day that we have a space-related guest on the podcast, so it's particularly exciting especially because you've done some wonderful things with Neo4j. But, David, would you mind introducing yourself a little bit to our audience? 
DM: 00:41 Sure, again, my name is David Meza.  I am the Chief Knowledge Architect at NASA, stationed out of Johnson Space Center in Houston. And as a Chief Knowledge Architect, my primary role is to look at the technological road map for our knowledge services out of our Chief Knowledge Office. Basically, they're doing this by merging information architecture and knowledge management in such a way they can provide our users with all the tools, processes, and applications to really extract the golden nuggets of knowledge from our critical data. 
RVB: 01:17 Wow, that sounds really exciting. I can imagine that NASA has a lot of knowledge. You must have a lot of interesting things that you've been working on. I read your blog post from the lessons learned database, David, can you tell us a little bit more about that? 
DM: 01:32 Sure, what I was looking at doing with our lessons learned database is most folks while they find lessons learned very important, and they want to make sure  that they can get the lessons out of our projects and programs from the past in order to implement them into our future programs and projects, I found that most people tend not to really look through them because they find it very difficult to find information inside of the lessons learned. So I needed to find another way of looking at these lessons learned that would give users, one, a better way to search through the lessons, and two, a better way to connect the dots between lessons to try to make sure they find all the relevant lessons for what they're looking for. 
RVB: 02:20 How did you get into graph databases or how did that connect with the wonderful world of graphs? 
DM: 02:30 Well, recently, about two, three years ago, I had taken course on social network analysis because I was really looking at how to develop expert finders within our organizations and how to make those connections between people. And when I was posed with this question from an engineer on our lessons learned on how to make it easier to find information and how to connect lessons together, it just dawned on me that a graph database, while it works for people, it should work for documents also because they're also related in many different ways. 
RVB: 03:02 So basically, you're looking at a graph of documentation of knowledge base, is that what I'm hearing? 
DM: 03:10 Correct, we're taking the lessons learned and after applying some various topic model algorithms to them, we can build relationships in connections between the documents inside there based on unstructured data inside the document. So for example, by using a topic model algorithm, I can create groups of topics for each of the lessons and eventually correlate them together based on self-assign categories.  And in doing so, I build relationships in connections in notes and  in  relationships between these documents. 
RVB: 03:48 I read your blog post and what struck me was that you actually use some other tools together with Neo4j as well. You use R for the topic modelling, I believe, is that true? 
DM: 03:59 That's correct, I'm a statistician at heart and R has been one of my tools for many years, and I utilize R to do the Latent Dirichlet Allocation or topic modeling algorithm against the documents. But that only gets me to the point of understanding my documents for the analyst. I needed something more for my end user as I go forward. 
RVB: 04:25 And that's where you also looked at things like visualization tools I think, the Linkurious and that type of technology? 
DM: 04:33 Correct, having read through the Neo4j books and try to get as much information as I can on graph databases, there was a section in one of the books that talked about different types of visualization tools. And so I did an evaluation on various different applications and I found Linkurious was one that did really showed, at least for the end user, a very easy way to walk through the graph in the relationships as you're trying to find information. 
RVB: 05:04 They're from Europe, they're from Paris, and we've been doing a lot of projects with them recently. Actually, it's a very nice tool set. 
DM: 05:14 Definitely. 
RVB: 05:15 How many users are using this solution right now, David? Is it in production or is this just experimental right now? 
DM: 05:23 It started up experimental just to showcase the technology and what it could do so that I could secure a little bit of budget to expand, but now to this point, I'm probably up to 150 to 200 different users that are utilizing the tool sets to look through the information. But I hope as I move forward and as I start showcasing this to some of our centers, that I can expand it up to a several thousands of them next year or so. 
RVB: 05:54 That's by adding more topic models, more project information, more documents that you'll expand them or how should I see that? 
DM: 06:03 Correct, what I'm doing here is showcasing how we can utilize these types of machine learning algorithm and visualization tools against our critical data to make it easier for users to find that information. So more and more groups are, as are coming forward and asking me to help them visualize their data based on the article that I have written there. 
RVB: 06:26 That seamlessly sort of brings me to my last question, if you don't mind, and that's, where is this going? What's in store in the future, David, any insights or perspectives on that? 
DM: 06:42 My ultimate goal, of course, is to expand how we visualize our data to our end users. I see a fairly decent connection between Neo4j and maybe some other NoSQL databases such as MongoDB or other document databases to help, two things, one, to help capture all the documentation or information in document database, but yet to work together to build relationships in Neo4j communicating with MongoDB in such a way that it automatically creates to nurture relationships based on how the information is input into the databases. 
RVB: 07:23 So that would be the entire topic modeling phase would be automated, sort of or is that something different? 
DM: 07:32 Correct, it would be automated based on the fact that a user would just have to input their information or upload their information based on certain criteria. But then looking at the topic modeling algorithm, and of course playing with those algorithms and trying to find the best algorithm for the data in order to be able to visualize it correctly. 
RVB: 07:54 Super exciting, and I'm assuming that this will then get us to Mars, right [chuckles]? 
DM: 07:59 I'm hoping that at least get us the information necessary to get us to Mars a lot quicker. My goal is to get that document, that information to our engineers and our scientists faster because finding information is difficult not just for NASA but for many organizations. Mainly, there's been different surveys in research analysis that generally takes an engineer anywhere from eight to 13 different searches in order to find the information they want and that takes time. I want to shorten that time frame. 
RVB: 08:31 That makes so much sense. It's a very popular use case for Neo4j, this graph-based search as we call it. When you're talking about engineering documents or legal documents or published articles, it's a very popular use case and I'm very excited that you guys are using it as well. This is really cool. Thank you for sharing that with us. 
DM: 08:54 You're quite welcome. 
RVB: 08:55 And I'm hoping that maybe you're attending GraphConnect in the next couple of weeks or? 
DM: 09:00 I wish I could. It's on my bucket list of things to do. I may have to wait until next year to get out there. Unfortunately, just too many other things I'm involved in. 
RVB: 09:10 I understand. David, we look forward to seeing you at  some point and I want to thank you so much for coming on the podcast. It's very cool and thank you also for writing the blog post. It's very much appreciated. 
DM: 09:22 You're quite welcome, with my pleasure and I look forward to other topics in the future. 
RVB: 09:28 Thank you.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best