Thursday 12 December 2013

Saint Nicolas brought me a new Batch Importer!!!

After my previous blogpost about import strategies, the inimitable Michael Hunger decided to take my pros/cons to heart and created a new version of the batch importer - which is now even updated to the very last GA version of neo4j 2.0. Previously you actually needed to use Maven to build the importer - which I did not have/know, and therefore never used it. But now, it's supposed to be as easy as download zip-file, unzip, run - so I of course HAD to test it out. Here's what happened.

Yet another dataset

First: I wanted to create a "large-ish" dataset (Michael actually calls it "tiny") with 1 millions nodes and 1 million relationships. So what do you do? MS Excel to the rescue. I created an Excel file with two worksheets, one for nodes and one relationships. The "nodes sheet" has nodes arranged in the following model of persons and animals that are each-other's friends (thanks Alistair again for the Arrows):

Creating the nodes sheet was easy, creating the relationship sheets I actually used a randomization function to create random relationships:

=RANDBETWEEN(nodes!A$2;nodes!A$1048576)

The Excel file that I made is over here. By doing that I actually get a fairly random graph structure - if I would manage to import it into neo4j. In order to do so with the batch importer, I simply had to export the file to two .csv files: one for nodes, one for relationships. And then there was one more step: I had to replace the semi-colons with tabs in order for the batch importer to like the files (I probably could have done it without this step, by editing the batch.properties file as in these instructions). Easy enough in any text editor - done in 2 seconds.

Drumroll: would it work?

So I downloaded the zip file, unzipped, and went

./import.sh graph.db nodes.csv rels.csv

Then I wait 20 seconds (apparently this is going to get a lot faster very soon - but I was already impressed!) and: TADAAH!

Job done!! 

All I had to do then was to copy the graph.db directory (download the zipped version from over here) to my shiny new 2.0 GA instance directory, fire up the server, and all was fun and games. Look at the queries in the neo4j browser, and you see a beautiful random animal-person social network. So cool!


What did I learn?

Thanks to Michael's work, the import process for large-ish datasets is now really easy. If I can do it you can. But. There was a but.

Turns out that the default neo4j install on my machine (with an outdated version of Java7, I must admit) actually ran painfully slow after a few queries. But as soon as I changed one little setting (the size of the initial/maximum Java Heap size = 4096, on my 8GB RAM machine) it was absolutely smoking hot fast.  Look for the neo4j-wrapper.conf file in your conf directory of the neo4j install.
I guess I just never played around with larger datasets in the past - this definitely made a HUGE difference on my machine.

UPDATE: I just updated my Java Virtual Machine to the latest version, and this problem has now gone away. You don't need the above step if you are on the latest version - just leave it with the default settings and it will work like a charm!

So: THANK YOU SAINT NICOLAS for bringing me these shiny new toys - I will try to continue to be a good boy!

Hope this was useful.

Rik

No comments:

Post a Comment