Thursday 30 July 2015

Hierarchies and the Google Product Taxonomy in Neo4j

Quite some time ago, I wrote a blogpost about using Neo4j for managing and calculating hierarchies. That post was then also later used in my book as it proved very useful for explaining one of the key use-cases for Neo4j, Impact Analysis and Simulation. So it should be pretty clear by now that HIERARCHIES ARE GRAPHS right? I think so :) ...

Hierarchical Product Taxonomy

Recently, I was preparing for a very cool brown-bag session at a client's offices, when I wanted to include a demonstration around product taxonomies. These structures are typically presented as some kind of a hierarchy/tree on many eCommerce websites - and are very well known by online users. So I wanted to find a taxonomy, and here here, Google immediately came to the rescue. I found this page on the Google Merchant Center.

You can follow the link to the Excel file, and boom - there's your product Taxonomy for you.

Tuesday 28 July 2015

Podcast Interview with Max De Marzi, Neo Technology

When I started working for Neo4j 3+ years ago, the graph database universe was in a different place. Sure, Neo4j had great documentation, but we did not have books, graphgists, training, or anything like that. We have truly come a long way. At that time, however, if you wanted to Learn Neo4j, you had to dig and keep on digging through lots of "typical open source" information out there. It was all there - but it was just not as structured. Cathedral and the bazaar, etc. And one of my biggest sources for quality information at the time, was a blog written by this funky guy out in the Windy City that was doing all of this super cool but also very understandable stuff with Neo4j. I followed his blog religiously - still do - and recently I had the opportunity to interview Max De Marzi for our podcast. Here's our conversation:


Here's the transcript of our conversation:
RVB: 00:02 Hello, everyone. My name is Rik, Rik van Bruggen from Neo Technology, and here I am again recording another episode for our graph database podcast. Tonight I am joined by a very tired and worn down consultant of Neo Technology, all the way over in Chicago, Max De Marzi. I've been looking forward to this. Hi Max, how are you? 
MDM: 00:24 I'm doing well, Rik. How are you, man? 
RVB: 00:25 I'm doing very well. Thanks for coming on the podcast, I really appreciate it. 
MDM: 00:30 No worries. 
RVB: 00:30 So Max, I've been a big fan of your blog and all the work you've been doing with Neo long before you joined Neo, but most people might not know you, so why don't you introduce yourself? 
MDM: 00:44 Sure. So, Max De Marzi, I am a field engineer for Neo4j. What does that mean? That means I get to go around the country and help people build Neo4j solutions. So I don't work on the product at all, I don't work on the back-end whatsoever, instead I use Neo4j to build real world apps that the customers use for things. My specialty is building proof-of-concepts in about a week or two, that way you have something tangible to show to upper management because the reality is, we sell you an empty database and about three or six months worth of work, and it makes everyone's life easier if you can see the light at the end of the tunnel before you even get started by having a POC. That is pretty much what you want it to do. 
RVB: 01:30 Couldn't agree more. So Max, how did you get to Neo? Tell us about that because you were working with Neo long before you joined Neo right? 
MDM: 01:39 Yeah, I was taking vacation time off to go do Neo4j consulting gigs. So I was-- 
RVB: 01:47 Are you serious [chuckles]? 
MDM: 01:49 I was obsessed is the right word. I fell in love with the idea of graphs, it just clicked in my head. I'd been a SQL developer for a decade before I had Neo4j and I was like, "Why wasn't I shown this ten years ago?" It just made life so much easier way to think about the problems I was dealing with. It's better to be in graph format … and the biggest kick out of Neo is, sometimes you're doing things that no one has ever done before, versus you can't really say the same thing if you're building a CRUD application on MySQL. You've done it a hundred times but this time, something no-one's ever done before and that gives me that developer-high that I crave. 
RVB: 02:29 I can imagine absolutely, yeah. And you've started writing about that as well, right? You've had a very interesting and - at least I think - popular blog about it as well, right? 
MDM: 02:41 Yeah, my strength is that I can go to a customer’s site on Monday knowing nothing and on Friday I can build something that makes sense with their domain, whatever it may be. My weakness is that I don't remember anything. So to counter that weakness, what I do is I blog about the things I do and then I set everything up on GitHub so there's a code there as well. So that way people can learn from my experience and I upload my memory to the web and I don't have to remember that stuff because honestly, this job is so fast paced that I have no time to even remember what I did three months ago or six months ago or a year ago. 
RVB: 03:17 And Google will remember everything, right? 
MDM: 03:19 Exactly. 
RVB: 03:20 That's how it goes, absolutely. How did you actually come about Neo and why did you fall in love with it? What's so good about it? 
MDM: 03:31 I watched some video oddly enough on Windy City TV, they were showing a video of how to model comic book data, and the guy tried to do it in a relational data base and really failed, tried to do it in a document database and it's just wasn't working, and then he turned to the graph and he was like, "Okay, this can handle anything." And oddly enough, Marvel Comics ended up actually building the real Marvel Comic graph using Neo4j. 
RVB: 03:58 Absolutely, that guy talked about it a lot as well, right? I don't know his name anymore, but I'll find it. He did some presentations about that, right?
The Marvel Universe? 

MDM: 04:08 Yeah, and so from four years from someone's idea to a real company really doing it, and actually using it in production is amazing. 
RVB: 04:17 Very cool. What do you think is the big strength then? What's the big strength of a graph database if you compare it to other data stores? Why do you think it's so powerful? 
MDM: 04:29 For me it's not just a graph database, it's Neo4j itself. It's super extensible. You can do whatever you want with it. And sometimes people forget that it's not just a database. You can say, "Here's the thing that you talk to by a language and that's all you can do with it." Neo4j encourages you, wants you to play with it, wants you to get inside the guts of it and say, "Build an extension. Build a kernel extension. Build a plug-in for Neo4j. Modify the actual kernel itself and use the custom [jar?] if you want." All those things are possible in Neo  and I show in the blog how to do them and how painless it is, versus trying to do that with Postgress or MySQL or Redis or anything else that's out there, your head would explode. So it's not just the here is nodes, here relationships and I can model anything, it's that plus everything else under the sun you can do, and do with Neo4j. 
RVB: 05:20 So what you're talking about is actually the implementation as well, it's not just the model that's great but it's also the way we use it or the way Neo uses it internally, you think it's really powerful. 
MDM: 05:32 Yeah, absolutely. And everything is changeable. You don't like the way the indexes are handled in Neo4j, you can change them. If you don't like the way the interface is or where the REST API, you can have your own custom REST API. You don't like the way logging is done, you can change the logging. Permissions, you can change. Anything you want, you can change. The whole thing is open source which is really nice. 
RVB: 05:52 Yeah, super great, absolutely. So what's the most exciting application that you ever built for Neo?  Anything that comes to mind? I know I could go to Google but you know [chuckles]. Can you remember anything? 
MDM: 06:07 No, I mean there's lots of things that stick out. One of my first projects in Neo4j was this boolean logic search basically, you express a set of conditions in boolean logic and the graph would go and match them and find them and get your results, which is kind of hard to do with just about anything else but a graph does it marvelously and very, very fast. Then there's the Facebook graph search example
RVB: 06:31 Oh wow, yeah, I remember that one [chuckles]. 
MDM: 06:35 In three days I was able to replicate what Facebook was doing using some custom Cypher queries and a little bit of trickery from Ruby. 
RVB: 06:43 Didn't Facebook actually forbid that application or something like that? I heard something that you weren't able to do it anymore. 
MDM: 06:50 Yeah, I mean they were happy with it for about a year and then I got a cease and desist type letter saying that I was replicating their core functionality and they had me shut it down. And I was like, "That was the whole point."  to replicate your functionality [laughter]. With one guy with a few hours instead of a whole team for six months or a year. How long would it take them to build it? My POC was a lot smaller but hey. 
RVB: 07:14 It was pretty sophisticated. It had like natural language processing in it and everything. It was pretty sophisticated, I thought. 
MDM: 07:21 It's a lot of smoke and mirrors, but as far as a user of it are concerned it was magic, and that's really all we care about. You can do things the hard way or you can do things the easy way. The user doesn't care as long as he gets his results. 
RVB: 07:35 Yeah, I thought that was fantastic. I'll put a link to that. I don’t know if it's still on your blog, but I'll try and look for it and include it in the transcript and everything. So Max where is this going man? Where do you think we'll end up with graph technology in Neo4j in 10 years from now, what's your view of the future? 
MDM: 07:57 Oh man. I wish more people got into the graph and started thinking that they are initially as a graph instead of as a set of table and joint tables and indexes and so on. I think we're not even close to that point yet. People still think about the data as tables and it's going to be a long time to get people off of that and onto a different model. And there's so many different competing models as well, 250 different vendors let you install software on site then another 250 that do hosted database. It's a jungle and there's  going to be some consolidation at some point. Hopefully some nice standards come out of it and we learn as a group how to do things better. My hope is that SQL dies a horrible death at some point and that we move on to a better language. I loathe with a great passion all the people who put SQL on these new technologies like SQL on Hadoop and SQL on Spark and SQL on Storm, I want to kill those people. They're going backwards. They're damaging the movement of let's get better databases. Let's get  better query languages for data. Let's move the industry forward not backwards. 
RVB: 09:07 I need to ask this. What is the number one thing that bothers you about SQL
MDM: 09:14 It's just the way we've been doing things for 30 years. At some point we have to evolve and say, "Okay, we tried this first and it was cool but we're better than that. We've grown as an industry. We can do more. We can think of our data in different ways, in alternate ways. Not everyone has to be shoehorned into the same way of doing things. 
RVB: 09:34 Yeah, absolutely, cool. Well man, I think, unless you have any flaming points to make left, we'll wrap up here. We want to keep these podcasts quite short. I think people will want to come to one of your talks maybe, right? You talk about need for Neo4j quite often. Oscon next isn't? 
MDM: 09:57 Yeah, I have Stampede in a week and then Oscon after that and then who knows. Also, just look at the blog MaxDiMarzi.com and follow me on Twitter for all kinds of craziness. 
RVB: 10:08 I do want to wish you lots of good luck with your new puppy. I hope to talk to you soon, Max. 
MDM: 10:16 All right, I'll go get some sleep. Thanks. 
RVB: 10:18 Cheers, man. Bye-bye. 
MDM: 10:19 Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Friday 24 July 2015

Loading the Belgian Corporate Registry into Neo4j - part 4

In this fourth and final blog post (parts 1, 2 and 3 were published before), I would like to try and summarize my experience in loading the Belgian Corporate Registry into Neo4j. Here's a couple of meaningful points, that I hope will benefit everyone.

1. Size matters: doing import at scale is totally different than doing it for a few hundred/thousand nodes and relationships. More memory is good. Tweaking the settings is good. In a real production environment it would probably have been a better idea to do this import offline. Read some of the documentation and my previous article for tips.


2. Complexity matters: the more connected the graph is, the more you will need to think about the import process in detail. Bulk loading stuff is easy, but connecting it up can be hard and needs to be thought through. The magic happens in the query plan. So take a look at a small import first to understand what is happening in every step of the plan - and make sure you avoid "expensive" steps that take a lot of resources. Often times that will mean splitting up operations into smaller parts, like for example creating nodes first, and then adding the relationship - instead of writing the pattern in one go.

3. A fool with a tool: there are a range of different import tools at your disposal - but if you don't understand what they do, you may still fail. In part 2 I was super convinced that my funky bash+python wizardry was going to do the trick - but it didn't. I should have looked at the query plan in more detail, and thought about how to get around it. In hindsight, it would probably have been a good idea to look at offline import in more detail.

4. Dungeons and Dragons: down in the bowels of Neo4j there are still some nasty dungeons and dragons, like the Eager Pipe that we tackled in part 3. Our engineers are fighting these day and night, and know how to beat them. So the number one thing to do if you are struggling - is to reach out and talk to us. Otherwise it's all too easy to get lost.

That's about it, for now. Please don't forget to look at the following links next time you want to do real import magic:



Hope this was useful. Feedback always welcome.

Cheers

Rik

Wednesday 22 July 2015

Podcast Interview with Stefan Armbruster, Neo Technology

Just over three years ago, me and a couple of other weirdos started working for this even weirder company :) called Neo Technology. Stefan Armbruster and I started on the same day, and over the years we have worked together a lot. Stefan is a superb guy, with a great taste in beers :), and a big vision for Neo4j - as you will learn in this interview:


Here's the transcript of our conversation (now including timestamps thanks to the TranscribeMe service):
RVB @ 00:02 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo, and here we are again recording a podcast session for our Neo4j graph database podcast. This morning, actually, on the show I have Stefan Armbruster with me from Munich. Hi, Stefan. 
SA @  00:20 Hi, Rik. How are you? 
RVB @ 00:21 I'm doing really well. How are you? 
SA @  00:24 Perfectly fine. 
RVB @ 00:24 Super. Is the weather as nice over there as it is over here? 
SA @  00:28 Yeah, we are more today, more than 30 degrees, so I already have the shorts and t-shirt. 
RVB @ 00:36 [chuckles] Same here, same here. Hey, Stefan, this is a podcast where we talk to people about their relationship to graphs and graph databases, so why don't you introduce yourself and tell us a little bit about yourself, your work, and how you got to graph databases? 
SA @  00:53 Yeah, as you have mentioned, my name is Stefan. I'm from Munich in Germany. I'm one of the field engineers with Neo Technology so it's basically the interface between the customers and the ebgubeers... so we work with very deep technical background and the ultimate goal is to make the customer successful, make sure that projects run well, and make them happy that they can spend more bucks on us of course. 
RVB @ 01:24 If I recall correctly, Stefan, you and me started together about a year ago, didn't we? 
SA @  01:29 About three years ago, I-- 
RVB @ 01:31 Three years ago, I mean, yes [chuckles]. 
SA @  01:33 This nice company meeting in Sweden where they served so good foods so I finally decided to sign the contract [chuckles]. But to be more honest, my story to Neo was started already more than seven years ago, of course, the first Gr8conf conference in Copenhagen. At the time being, I was working as a freelancer, did a little work with Grails and built some websites or some backend applications for my customers, and it was the first conference on the graph framework. There was another guy attending from Sweden, Emile, and after some beers in the evening, I got in touch with him, he talked some crazy things about graph databases - some fancy stuff - and, at the first place, I said, "Okay, it sounds interesting, but I don't see a use case for me here." So I put it on my list of technologies to take a look at. After some months, I really looked at Neo4j, so this was way before the 1.0 version, but it was a 0. something whatever. I was really pleased by the cleanness of the java API. At the time being, there was no Cypher, there was no server, so Neo4j was just a kind of embedded graph database for Java. That's [inaudible] Neo4j. 
RVB @ 03:05 And then so you wrote the groovy driver? Is that true or...? 
SA @  03:10 It's not [inaudible]. For groovy you don't need a driver because groovy is basically Java, so I wrote the Grails driver, and this was then the next step so couple of months later I was getting asked to deliver a project where you can place comments on the equi-distant of football so you can express your opinion - well, was the red card in that game in minute 45 justified, or who would be the next champion? Will the trainer of that team be fired? So they tried to aggregate the opinions of the fans together and since you can put a comment in everything and everything can be potentially connected with everything else in the data set, it's natural in a graph. This was my eye-opener. I said, "Okay, we should use Neo4j for our project." I also wanted to use Grails because that's my common framework the years, and then I decided to bring the two things together and wrote the first version of the Grails driver. 
RVB @ 04:17 And nowadays you still use that a lot, or do you use more Java or--? 
SA @  04:22 To be honest, 'm still maintaining the Grails driver, but it's a little bit rotten so I'm spending some time to deliver your version 2.0 for the Grails driver which then will be only based on Cypher and JDBC, and therefore it's also future-proof already. If we look a little bit in the future as of the plans to merge efforts with the spring data, a Neo4j project to reuse their mapper because that's a little bit more powerful than what I did. 
RVB @ 04:58 Super cool. So, Stefan, now you mentioned sort of how you got into it, but what do you love about it? What do you think is so powerful about a graph and a graph database? Is there something that jumps out? 
SA @  05:09 What I like, of course, I'm a big fan of open source software so I really appreciate that we're completely open source. I think it's very easy to make more sense out of data basically. So you can organize your data into a graph and then you can have immediately some new insights which you didn't know beforehand. 
RVB @ 05:37 What kind of insights would that be? Things like unknown connections or stuff like that? 
SA @  05:41 Unknown connections so you can easily field connections, so as a simple example, we both work for Neo so we have a [?] relationship to you as a company and as an indirection, since Neo is a rather small [?] - 100 people - there is an implicit relationship between you and me because we know each other. So that can be filled without having that explicitly as data so you can infer hidden knowledge basically. In my opinion, it's more or less kind of a philosophical thing. The interesting part from pure IT perspective, the cool thing is that you can [create?] a graph database independent of data size. So your view is as long as their local road gets slower, just because of growing a data set. 
RVB @ 06:41 Yeah, that's a really powerful feature. Every customer that I meet, or user that I meet, loves that graphical kind of end piece of it. 
SA @  06:49 Yeah, that's super cool. 
RVB @ 06:52 We actually have a hidden connection in common as well - our love for fantastic beers, right [chuckles]? 
SA @  06:58 Oh, yes, oh, yes. Not the common kind of beer but a different kind of beer, but yes. 
RVB @ 07:03 Absolutely. Now, you gave me some really nice beers a while back. That's really nice of you. 
SA @  07:09 that was the Unertl, right? 
RVB @ 07:10 Yes, I think so. 
SA @  07:11 I love that one. 
RVB @ 07:12 Very cool. So what does the future hold, Stefan? Where is this thing going? What do you think is the big things coming up for you and for us, for the industry? Get your perspective on that. 
SA @  07:26 I think the adoption will grow. Since graph databases are, from a mental perspective, rather close to relational databases, it's easy for people to move over so they don't need to change too much, but of course they need to change something. I think we will see graph databases being as commonly used in the industry as today are Oracle and DB2. 
RVB @ 07:57 That's an aggressive statement [chuckles]. 
SA @  08:00 It's in fact going to happen. Next year, probably not the next  year, but if you look in the large scope - four, five, years - that is probably where I see the future. 
RVB @ 08:10 It's a very natural revolution. So many people, it's such a natural way for people to deal with data, I suppose. 
SA @  08:18 Yeah, yeah, exactly. And if you look more the short term what I'm really looking forward is that binary protocol will be-- we'll see will most likely end of the year for version 3.0-- 
RVB @ 08:29 Yeah, yeah. 
SA @  08:31 That makes the interactions with client drivers much more easy and much more unified-- 
RVB @ 08:40 That's the replacement protocol for REST, or complementary to REST. 
SA @  08:43 Complementary, so we will have REST protocol forever, I guess, but the binary on the side provides kind of a completely redesigned interface. There is already a kind of spike on the Java driver that uses the protocol and on top of the Java driver, Rickard is doing a little bit of spiking on JDBC for that. Because I think JDBC is the key integration technology that allows anyone to use Neo4j within their existing infrastructure. All of the BI tools, all the reporting tools, everyone has a JDBC interface and by that we could just easily plug in the graph database in existing infrastructures. 
RVB @ 09:32 Yeah, because the current JDBC driver uses REST and you can replace that REST layer with a binary protocol-- 
SA @  09:38 Yeah, even people using the current REST-based JDBC driver, migrating that over is a completely no-brainer because they don't have to rewrite a single-- they just have to change the URL and the driver, but they don't have to change a single line of code. 
RVB @ 09:56 Super interesting. Thank you, Stefan. We want to keep these podcasts reasonably short so thank you for coming online and talking to me. It was really interesting, and I'm hoping we'll see each other again very soon, probably this summer. 
SA @  10:14 Hopefully. 
RVB @ 10:15 Thanks [chuckles]. 
SA @  10:15 With some beers. 
RVB @ 10:16 With some beers, exactly. Thank you, Stefan. Have a nice day. 
SA @  10:20 Thank you so much, Rik. Have a good day. Thank you. Good-bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Monday 20 July 2015

Loading the Belgian Corporate Registry into Neo4j - part 3

In this third part of the blogposts around the Belgian Corporate registry, we're going to get some REAL success. After all the trouble in part 1 (with LoadCSV) and part 2 (with lots of smaller CSV files, bash and python scripts) that we had before, we're now going to get somewhere.

The thing is, that after having split the files into smaller chunks and iterating over them with Python - I still was not getting the performance I needed. Why o why is that? I looked at the profile of one of the problematic load scripts, and saw this:
I checked all of my setup multiple times, read and re-read Michael Hunger's fantastic Load CSV summary, and still was hitting problems that I should not be hitting. This is where I started looking at the query plan in more detail, and spotted the "Problem with Eager". I remembered reading one of Mark Needham's blogposts about "avoiding the Eager", and not fully understanding it as usual - but realizing that this must be what is causing the trouble. Let's drill into this a little more.

Trying to understand the "Eager Operation"

I had read about this before, but did not really understand it until Andres explained it to me again: in all normal operations, Cypher loads data lazily. See for example this page in the manual - it basically just loads as little as possible into memory when doing an operation. This laziness is usually a really good thing. But it can get you into a lot of trouble as well - as Michael explained it to me:
"Cypher tries to honor the contract that the different operations within a statement are not affecting each other. Otherwise you might up with non-deterministic behavior or endless loops. Imagine a statement like this: 
MATCH (n:Foo) WHERE n.value > 100 CREATE (m:Foo {m.value = n.value + 100}); 
If the two statements would not be isolated, then each node the CREATE generates would cause the MATCH to match again etc. an endless loop. That's why in such cases, Cypher eagerly runs all MATCH statements to exhaustion so that all the intermediate results are accumulated and kept (in memory). 
Usually with most operations that's not an issue as we mostly match only a few hundred thousand elements max. With data imports using LOAD CSV, however,  this operation will pull in ALL the rows of the CSV (which might be millions), execute all operations eagerly (which might be millions of creates/merges/matches) and also keeps the intermediate results in memory to feed the next operations in line. This also disables PERIODIC COMMIT effectively because when we get to the end of the statement execution all create operations will already have happened and the gigantic tx-state has accumulated."
So that's what's going on my load csv queries. MATCH/MERGE/CREATE caused an eager pipe to be added to the execution plan, and it effectively disables the batching of my operations "using periodic commit".  Apparently quite a few users run into this issue even with seemingly simple LOAD CSV statements. Very often you can avoid it, but sometimes you can't."

Try something different: neo4j-shell-tools

So I was wondering if there were any other ways to avoid eager, or if there would be any way for the individual cypher statement to "touch" less of the graph. That's when I thought back to a couple of years back, when we did not have an easy and convenient tool like LOAD CSV yet. In those early days of import (it's actually hard to believe that this is just a few years back - man have we made a lot of progress since that time!!!) we used completely different tools. One of those tools were basically a plugin into the neo4j-shell, called the, neo4j-shell-tools.

These tools still offer a lot of functionality that is terribly useful at times - among which a cypher-based import command, the import-cypher command. Similar to LOAD CSV, the command has a batching option, that will "execute each statement individually (per csv-line) and then batch statements on the outside so they (unintentionally, because they were written long before load csv) they circumvent the eager problem by only having one row of input per execution". Nice - so this could actually solve it! Exciting.

So then I spent about 30 mins rewriting the load csv statements as shell-tools commands. Here's an example:
//connect the Establishments to the addresses 
import-cypher -i /<path>/sourcecsv/address.csv -b 10000 -d , -q with distinct toUpper({Zipcode}) as Zipcode, toUpper({StreetNL}) as StreetNL, toUpper({HouseNumber}) as HouseNumber, {EntityNumber} as EntityNumber match (e:Establishment {EstablishmentNumber: EntityNumber}), (street:Street {name: StreetNL, zip:Zipcode})<-[:PART_OF]-(h:HouseNumber {houseNumber: HouseNumber}) create (e)-[:HAS_ADDRESS]->(h);
In this command the -i indicates the source file, -b the REAL batch size of the outside commit, -d the delimiter, and finally -q the fact that the source file is quoted. Executing this in the shell was dead easy of course, and immediately also provides nice feedback of the progress:

Just a few minutes later, everything was processed.

So this allowed us to quickly and conveniently execute all of the import statements in one convenient go. Once we had connected all the Enterprises and Establishments to the addresses, the model looks like this:


So then all that is left it to do was to connect Enterprises and Establishments to the activities:



The total import time of this entire dataset - on my Macbook Air 11 was about 3 hours - without any hickups whatsoever.

So that was a very interesting experience. Had to try lots of different approaches - but managed to get the job done.

As with the previous parts of this blog series, you can find all of the scripts etc on this gist.

In the last section of this series, we will try to summarize our lessons learnt. In any case I hope this has been a learning experience for you as well as it was for me.

Cheers

Rik

Thursday 16 July 2015

Loading the Belgian Corporate Registry into Neo4j - part 2

In the previous blogpost of this series, I was trying to import the Belgian Corporate Registry dataset into Neo4j - all in preparation of some interesting querying. Unfortunately, as you could read in part 1, this was not that easy. Load CSV queries were starting to take a very long time, and my initial reaction was that the problem must be down to the size of the CSV file. In this blogpost, I will take you through that experience - and show you what I did, and how I - again - only partially succeeded.

As a refresher, here's the dataset that we are looking at:

As you can see, the address.csv file is quite big - and that was already starting to be a problem in part 1. But I quickly realised that if I then would want to connect the Enterprises and Establishments to the respective Activities by loading the 669 MB activity.csv file, the problems would just get even bigger. I needed a solution.

Bash and Python to the rescue

So here was my first idea on how to solve this:

  • I would figure out a way to split the address.csv and/or activity.csv file into multiple smaller csv files
  • I would then create a python script that would iterate over all of the generated files, and execute the loading transactions over these much smaller CSV files.

Sounded like a good idea to me, so I explored it and actually partially succeeded - and learned some more bash and python along the way :) ... what's not to like?

Here's the Bash script to split the csv file, in several distinct steps:

1. First create the split files with 25000 files each

tail -n +2 ./sourcecsv/address.csv | split -l 25000 - ./splitcsv/splitaddress2/splitaddress_
This commend takes the "tail" of the address.csv file starting from line 2, and pipes that into the split command and generates a separate file for ever 25000 lines that it encounters. The output looks like this:

Then of course I also needed to copy the header row of the original address.csv file to each of the splitaddress_ files. That one took some time for me to figure out, but I managed it with this simple script:
for file in ./splitcsv/splitaddress_*do    head -n 1 ./sourcecsv/address.csv > tmp_file    cat $file >> tmp_file    mv -f tmp_file $filedone
What this does is simple: it loops through all the splitaddress_* files, takes the first line of the address.csv file, copies that into a tmp_file and then concatinates the splitaddress_* file with the tmp_file and renames it to the splitaddress_* file... easy! So then you get a bunch of smaller, 25000 line csv files looking like this one:

The last step in my Shell wizardry was to then rename the files to have numeric increments instead of alpha ones - just so that we can process them in a python script and iterate over the list of files. This turned out to be a bit trickier and I had ato do some significant googling and then copying and pasting :) ... Here's what I ended up with:

ls -trU ./splitcsv/splitaddress_*| awk 'BEGIN{ a=0 }{ printf "mv %s ./splitcsv/splitaddress_%d\n", $0, a++ }' | bash
So basically three commands:

  • listing the files splitaddress_*
  • passing them to awk and letting it iterate over it and renaming the files one by one
  • piping that to bash
Not trivial - but it works:
So that gave me a bunch of smaller, 25k line files that I could work with in python. So let's see how I did that.

Iterating over CSV files with Python

I created a simple python script to iterate over the files - here it is:
1:  import datetime  
2:  from py2neo import Graph  
3:  from py2neo.packages.httpstream import http  
4:  http.socket_timeout = 9999  
5:    
6:  graph = Graph()  
7:    
8:  print "Starting to process links between Enterprises and Addresses..."  
9:    
10:  for filenr in range(0,113):  
11:      tx1 = graph.cypher.begin()  
12:      statement1 = """  
13:           load csv with headers from  
14:            "file:/<path>/splitcsv/splitaddress/splitaddress_"""+str(filenr)+"""" as csv  
15:            with distinct toUpper(csv.Zipcode) as Zipcode, toUpper(csv.StreetNL) as StreetNL, toUpper(csv.HouseNumber) as HouseNumber, csv.EntityNumber as EntityNumber  
16:            match (e:Enterprise {EnterpriseNumber: EntityNumber}),  
17:            (street:Street {name: StreetNL, zip:Zipcode})<-[:PART_OF]-(h:HouseNumber {houseNumber: HouseNumber})  
18:            create (e)-[:HAS_ADDRESS]->(h);  
19:            """  
20:    
21:      tx1.append(statement1)  
22:    
23:      tx1.process()  
24:      tx1.commit()  
25:      print "Enterprise Filenr: "+str(filenr)+" processed, at "+str(datetime.datetime.now())  
26:    
27:  print "Starting to process links between Establishments and Addresses..."  
28:    
29:  for filenr in range(0,113):  
30:      tx2 = graph.cypher.begin()  
31:      statement2 = """  
32:            load csv with headers from  
33:            "file:/<path>/splitcsv/splitaddress/splitaddress_"""+str(filenr)+"""" as csv  
34:            with distinct toUpper(csv.Zipcode) as Zipcode, toUpper(csv.StreetNL) as StreetNL, toUpper(csv.HouseNumber) as HouseNumber, csv.EntityNumber as EntityNumber  
35:            match (e:Establishment {EstablishmentNumber: EntityNumber}),  
36:            (street:Street {name: StreetNL, zip:Zipcode})<-[:PART_OF]-(h:HouseNumber {houseNumber: HouseNumber})  
37:            create (e)-[:HAS_ADDRESS]->(h);  
38:            """  
39:    
40:      tx2.append(statement2)  
41:    
42:      tx2.process()  
43:      tx2.commit()  
44:      print "Establishment Filenr: "+str(filenr)+" processed, at "+str(datetime.datetime.now())  

Yey! This actually worked! Here's how the script started running over the files:


And then as you can see below, 22 minutes later all the Enterprises and Establishments were correctly linked to their addresses! Now we are getting somewhere!

But... there is a but. The last step of this excercise is to connect the Enterprises to their "Activities", which are part of the Code-tree in our model. And: although I actually created a Python script to do that, and that script actually worked quite well - it was just too slow.

So that meant back to the drawing board and figuring another way to do this in a reasonable amount of time. In hindsight, everything I wrote about in this blogpost was not really used for the actually import - but I wanted to show you everything that I did and all of the stuff that I learned about bash and python and Neo4j...

All of the material mentioned on this blog series is on github if you want to take a look at it.

Hope this was still useful.

Cheers

Rik

Monday 13 July 2015

Loading the Belgian Corporate Registry into Neo4j - part 1

Every now and again, my graph-nerves get itchy. It feels like I need to get my hands dirty again, and do some playing around with the latest and greatest version of Neo4j. Now that I have a bit of a bitter team in Europe working on making Neo4j the best thing since sliced bread, that seems to become more and kore difficult to find the time to do that – but every now and again I just “get down on it” and take it for another spin.

So recently I was thinking about how Neo4j can actually help us with some of the fraud analysis and fraud detection use cases. This one has been getting a lot of attention recently, with the coming out of the Swissleaks papers from the International Consortium of Investigative Journalists (ICIJ). Our friends at Linkurio.us did some amazing work there. And we also have some other users that are doing cool stuff with Neo4j, OpenCorporates to name just one. So I wanted to do something in that “area” and started looking for a cool dataset.

The KBO dataset

I ended up downloading a dataset from the Belgian Ministry of Economics, who run the “Crossroads Database for Corporations” (“Kruispuntbank voor Ondernemingen”, in Dutch – aka as “the KBO”).  Turns out that all of the Ministry’s data on corporations is publicly available. All you need to do is register, and then you can download a ZIP file with all of the publicly registered organisations out there.
The full 200MB zip file contains a set of CSV files, holding all the data that we would possibly want for this exercise. Unzipped it’s about 1GB of data, and there’s quite a lot of it as you can see from this overview:

So you can see that this is a somewhat of a larger dataset. About 22 million CSV lines that would need to get processed – so surely that would require some thinking  … So I said “Challenge Accepted” and got going.

The KBO Model

The first thing I would need in any kind of import exercise would be a solid datamodel. So I thought a bit about what I wanted to do with the data afterwards, and I decided that it would be really interesting to look at two specific aspects in the data:
  • The activitiy types of the organisations in the dataset. The dataset has a lot of data about activity categorisations – definitely something to explore.
  • The addresses/locations of the organisations in the dataset. The thinking would be that I would want to understand interesting clusters of locations where lots of organisations are located.
So I created a model that could accommodate that. Here’s what I ended up with.

So as you can see, there’s quite  a few entities here, and they essentially form 3 distinct but interconnected hierarchies:
  • The “orange” hierarchy has all the addresses in Belgium where corporations are located. 
  • The “green” hierarchy has the different “Codes” used in the dataset, specifically the activity codes that use the NACE taxonomy. 
  • The “blue” hierarchy gives us a view of links between corporations/enterprises and establishments. 
So there we had it. A sizeable dataset and a pretty much real-world model that is somewhat complicated. Now we could start thinking about the Import operations themselves.

Preparing Neo4j for the Import

Now, I have done some interesting “import” jobs before. I know that I need to take care when writing to a graph, as we are effectively doing write operations that have a lot more intricate work going on – we are writing data AND structure at the same time. So that means that we need to have some specific settings adjusted in Neo4j that would really work in our favour. Here’s a couple of things that I tweaked:
  • In neo4j.properties, I adjusted the cache settings. Caching typically just introduces overhead, and when you are writing to the graph these cache really don’t help at all. So I added cache_type=weak to the configuration file.
  • In neo4j-wrapper.conf, I adjusted the Java heap settings. While Neo4j is making great strides to making memory management less dependent on the Java heap, today, you should still assign a large enough heap for import operations. Now my machine only has 8GB of RAM, so I had to leave it at a low-ish 4GB on my machine. The way to force that heap size is by having the initial memory assignment be equal to the maximum memory assignment by adding two lines:
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096

to the configuration file.

That’s the preparatory work done. Now onto the actual import job.

Importing the data using Cypher’s Load CSV

The default tool for loading CSV files into Neo4j of course is Cypher’s “LOAD CSV” command. So of course that is what I used at first. I looked at the model, looked at the CSV files, and wrote the following Cypher statements to load the “green” part of the datamodel first – the code hierarchy. Here’s what I did:

create index on :CodeCategory(name); 
using periodic commit 1000load csv with headers from"file:/…/code.csv" as csvwith distinct csv.Category as Categorymerge (:CodeCategory {name: Category});

Note the “with distinct” clause in this is really just geared to make the “merge” operation easier., as we will be ensuring uniqueness before doing the merge.  The “periodic commit” allows us to batch update operations together for increased throughput. 

So then we could continue with the rest of the code.csv file. Note that I am trying to make the import operations as simple as possible on every run – rather than trying to do everything in one go. This is just to make sure that we don’t run out of memory during the operation.
create index on :Code(name); 
using periodic commitload csv with headers from"file:/…/code.csv" as csvwith distinct csv.Code as Codemerge (c:Code {name: Code}); 
using periodic commitload csv with headers from"file:/…/code.csv" as csvwith distinct csv.Category as Category, csv.Code as Codematch (cc:CodeCategory {name: Category}), (c:Code {name: Code})merge (cc)<-[:PART_OF]-(c);
create index on :CodeMeaning(description); 
using periodic commitload csv with headers from"file:/…/code.csv" as csvmerge (cm:CodeMeaning {language: csv.Language, description: csv.Description}); 
using periodic commitload csv with headers from"file:/…/code.csv" as csvmatch (cc:CodeCategory {name: csv.Category})<-[:PART_OF]-(c:Code {name: csv.Code}), (cm:CodeMeaning {language: csv.Language, description: csv.Description})merge (c)<-[:MEANS]-(cm);
As you can see from the screenshot below, this takes a while. 

If we profile the query that is taking the time, then we see that it’s probably related to the CodeMeaning query – where we add Code-meanings to the bottom of the hierarchy. We see the “evil Eager” pipe come in, where we basically know that Cypher’s natural laziness is being overridden by a transactional integrity concern. It basically needs to pull everything into memory, taking a long time to do – even on this small data-file. 

This obviously already caused me some concerns. But I continued to add the enterprises and establishments to the database in pretty much the same manner:

//load the enterprises
create constraint on (e:Enterprise)assert e.EnterpriseNumber is unique;
using periodic commit 5000load csv with headers from"file:/…/enterprise.csv" as csvcreate (e:Enterprise {EnterpriseNumber: csv.EnterpriseNumber, Status: csv.Status, JuridicalSituation: csv.JuridicalSituation, TypeOfEnterprise: csv.TypeOfEnterprise, JuridicalForm: csv.JuridicalForm, StartDate: toInt(substring(csv.StartDate,0,2))+toInt(substring(csv.StartDate,3,2))*100 + toInt(substring(csv.StartDate,6,4))*10000});

Note that I used a nice trick (described in this blogpost) to convert the date information in the csv file to a number. 

//load the establishments
create constraint on (eb:Establishment)assert eb.EstablishmentNumber is unique; 
using periodic commitload csv with headers from"file:/…/establishment.csv" as csvcreate (es:Establishment {EstablishmentNumber: csv.EstablishmentNumber, StartDate: toInt(substring(csv.StartDate,0,2))+toInt(substring(csv.StartDate,3,2))*100+toInt(substring(csv.StartDate,6,4))*10000}); 
using periodic commitload csv with headers from"file:/…/establishment.csv" as csvmatch (e:Enterprise {EnterpriseNumber: csv.EnterpriseNumber}), (es:Establishment {EstablishmentNumber: csv.EstablishmentNumber})create (es)-[:PART_OF]->(e);
Interestingly, all of this is really fast. 
Especially if you look at what was happening above with the Code-meanings. The reson for this is of course the fact that we are doing a lot simpler operations here while adding the data. The entire execution time on my little laptop was  7 minutes and 10 seconds to add 3180356 nodes and 1602575 relationshipts. Not bad at all.

At this point the model looks like this:

Then we start working with the address data, and start adding this to the database. It works very well for the Cities and Zipcodes:

create constraint on (c:City)
assert c.name is unique;
 
create constraint on (z:Zip)
assert z.name is unique;
 
create index on :Street(name); 
//adding the cities
using periodic commit 100000
load csv with headers from
"file:/…/address.csv" as csv
with distinct toUpper(csv.MunicipalityNL) as MunicipalityNL
merge (city:City {name: MunicipalityNL});
 
//adding the zip-codes
using periodic commit 100000
load csv with headers from
"file:/…/address.csv" as csv
with distinct toUpper(csv.Zipcode) as Zipcode
merge (zip:Zip {name: Zipcode});
 
// connect the zips to the cities
using periodic commit 100000
load csv with headers from
"file:/…/address.csv" as csv
with distinct toUpper(csv.Zipcode) as Zipcode, toUpper(csv.MunicipalityNL) as MunicipalityNL
match (city:City {name: MunicipalityNL}), (zip:Zip {name: Zipcode})
create unique (city)-[:HAS_ZIP_CODE]->(zip);
Once we have this, we then wanted to add the streets to the Zip - not to the city, because we can have duplicate streetnames in different cities. And this is a problem:


It takes a very long time. The plan looks like this:

And unfortunately, it gets worse for adding the  HouseNumbers to every street. It still works – but it’s painfully slow.



I tried of different things, called the help of a friend, and finally got it to work by replacing “Merge”by “Create Unique”. That operation does a lot less checking on the total pattern that you are adding, and can therefore be more efficient. So oof. That worked. 

Unfortunately we aren’t done yet. We still need to connect the Enterprises and Establishments to their addresses. And that’s where the proverbial sh*t hit the air rotating device – and when things stopped working. So we needed to address that, and that;s why there will be a part 2 to this blogpost very soon explaining what happened. 

All of the material mentioned on this blog series is on github if you want to take a look at it.

Hope you already found this useful. As always, comments are very welcome.

Cheers

Rik

Thursday 9 July 2015

Podcast Interview with Jaroslaw Palka, Allegro Group

Waw. Since we started doing these Neo4j Graph Database podcast, I have spoken to 30 (!!) different people. Ah-may-zhing!!! They have been truly wonderful conversations, all of them, and I have truly enjoyed this ride :) ...

Today I am publishing the 30th episode, which is a great conversation with Jaroslaw Palka, of Allegro Group. Jaroslaw is a long time member of the Neo4j ecosystem, with a lot of interesting perspectives on it. Here's the recording:

Here's the transcript of our conversation:
RVB: Hello, everyone. My name is Rik, Rik Van Bruggen from Neo Technology, and here I am again recording a podcast for our graph database, Neo4j podcast series. Today I have a guest joining me on Skype, all the way from Poland. Hello, Jaroslaw, Jaroslaw Palka.
JP: Hello.
RVB: Hi, and thanks for joining me. I appreciate it. Very cool. Jaroslaw, you've been in the Neo4j ecosystem for a while. Do you mind introducing yourself, and what's your relationship to the wonderful world of graph databases?
JP: I work in Krakow and I'm with JVM and Java for the last, I think, since '99. I worked many years an architect and coach doing executive trainings for people and separate organizations. My journey with graphs, I think it started in 2005 or '06 - it's hard to remember all the dates [chuckles] - when we were trying-- when one of the organizations we were trying to migrate a large database which was supporting the online flight shopping - so basically searching for flights, the best flights, the shortest or the cheapest flights, and we were trying to migrate from the relational database to the graphs.  Because what we found out is basically the structure we were working with is a graph and the problems we are solving is a typical graph problem so finding the shortest, the cheapest, but it is not always shortest path. Sometimes you are looking for the quickest flight so you don't have a lot of stops or you are looking for a cheapest flight or you are looking for the flight when your flight with specific airlines or--
RVB: So that use sort of got you going in the world of graph databases. You've done some other use cases as well, right? You were telling me about about recommendations, access control, all that wonderful stuff as well.
JP: Yes. The problem is that when you start with it, from my perspective, after first project, I started to see graphs everywhere, and everything in my life started to be either a node or an edge [chuckles].
RVB: I know the feeling, Jaroslav.
JP: Yeah, I think this is one of so-called dangers of graph databases and thinking in graphs that it is pretty easy to translate your problem into the structure of graph. I think this is the most appealing thing for me, but you don't need specialized training and you don't need to read tons of books about the model because it is so natural to think about it, about things this way.
RVB: That sort of leads me into the second topic that I always ask on this podcast series. What do you like about graphs? What is so good about it in your opinion? I'm hearing the modeling advantages that you just mentioned right there. Want to give us your perspective there?
JP: Yeah, sure. So for first modelling, it is also important but also quite easy. If you don't get into much details about the directed and undirected graphs, hypergraphs and all this stuff, and just focus on graphs, you can explain to a nontechnical person how it works. It's pretty easy. When you work with business people, it would soon get common vocabulary. It is real easy to explain. You don't need advance modelling tools, and just whiteboards and brain to start drawing and planning the graph.
JP: The second thing, which is really close to my heart, I truly believe in emerging architecture so I don't believe we can plan everything ahead of time and knowing how it worked when the requirements change fast and the customer sometimes doesn't really know what he needs and we discover what he needs. Over the time, one wonderful thing is that I can build my initial structure in all connections and nodes, and during the time I can evolve the structure of the graph. Especially when the one thing I like is the graphs really start pretty simple, but as I start to write queries and add Cypher to it, I start to see that there are actually connections they haven't seen so I cannot materialize those connections and enrich my graph with additional connections or additional nodes just to make sure what my Cypher query is the fastest possible query I can have.
RVB: There's actually a really good match with things like Agile development methodologies and those types of things. Is that what I'm hearing?
JP: Yes, this is for me really important, that I don't have to plan everything ahead and I can build the queries, build the database, as I need, as the product changes.
RVB: I think that's really cool, cool perspective. I think you're totally right about it so I really appreciate that. What do you think this is going, Jaroslaw? Do you have any wishes or ideas about where this technology should be going in the next couple of years? Anything you want to throw in there?
JP: Yeah, sure. I think that the biggest challenge and it's not only a Neo4j problem because, let's be honest [chuckles], it's not the only graph database engine in the world, but for me, it's the best because I really like Cypher and this is one of the best decisions to the language based on others to query a database. I think the problem with the graph model - and we all need to really hard think about it - is the size of the data sets we are dealing with. As you know, we don't have a good way to split a graph into sub-graphs, having these things on the separate machines and because of the connected nature of the data, we need to be able to squeeze our data set onto one machine.
RVB: You're talking about graph partitioning, right? That's [crosstalk].
JP: Yeah, yeah. We don't have a good program for it developed still so [chuckles] no strong theoretical foundation.  I think the guys from academia needs to meet with the people that work with graphs...
RVB: Well, a lot of work has been done around partitioning specifically graphs so in a general case it's extremely difficult, and you can actually almost prove that it's impossible. But there has been a lot of work and also at Neo4j on coming up with partitioning algorithms that will be specific to your domain. So if you would tell us more about your data, then we would be able to make much more sensible judgments about where the data should go on which machine and as you probably know we've done our homework already and we're hoping that will lead into a product in a future version of Neo4j. But it is early days still. It's a very complicated [crosstalk]--
JP: You know, it is pretty easy to partition because you will treat your graph database, Neo4j database, as your usual data source for a separate application, so I think it would be easier but if you have the partition data and you want to run the shortest possible path over the whole graph, that can be a trick [chuckles].
RVB: Yeah, as soon as you hit the machine boundary you have a problem, right? So it's a very difficult problem to solve. We are trying to make a solid dent into that problem and we're-- there's a lot of work going into that. People like Jim Webber are really actively involved with that. But I think it's a difficult one from multiple perspectives. This is just my personal perspective, but on the one hand, there's this hugely complicated problem, and on the other hand you have a situation where the vast majority of users and clients don't really need that. You know what I mean?
JP: Yes, that's true. And one important thing, when, for example, on, my first Neo4j project, is that we pushed all of the data we had in SQL because it was basically migration from SQL to Neo4j, so we pushed everything to Neo4j, and I think it was the biggest single mistake-- one of the biggest single mistakes in my life because actually you don't need everything so I truly believe in the polyglot persistence so you push to the graph only the data you will need doing the queries. And all the additional, heavy things, you can have a separate store so you can manage. For us, fortunately, we are in a place where people start to think that having two, three different databases in a single system is not a bad thing so that's what we see at the moment with Neo4j.
RVB: That's completely along the lines that we're thinking. Combine different data stores for different problem sets and have a much more task-oriented setup. I think that's very much a recurring design pattern, I think.
JP: Yeah, so basically this is the pattern I see with this. Neo4j is quite often used as a supplemental database. I know what a tricky index what we can do [chuckles] so people can remain as SQL databases and they are thinking the many Neo4j instances and we are asking different questions because of the different structures in place. So at the moment this is where I see organizations are planning with Neo4j. As an engine you can ask really tricky questions so if you are asking about the future, I think the next step is to push organizations to think that actually your database can be your master data. Because at the moment it's mostly SQL and the database - the name we shouldn't use [chuckles] -is comfortable in this space in all the things like Neo4j, Cassandra, are just supplementary for the SQL model.
RVB: Very good. Thank you so much for sharing your thoughts on that. I think that was very, very interesting and useful. I really appreciate it, and I think we're going to wrap up the podcast now. I look forward to speaking to you again at one of the future events. Thank you, Jaroslaw.
JP: Yeah, thank you. Bye.
RVB: Have a nice day, bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Saturday 4 July 2015

AC/DC is on Spotify - need to celebrate with Graph Karaoke!

As reported this week in the mediaAC/DC is now available on different streaming sites. Finally. So we need to celebrate that with some GRAPH KARAOKE of course. Here's one of my all time favourite songs:



Hope you play it LOUD and on REPEAT :))

Cheers

Rik