Friday 25 September 2015

Part 2/3: loading the Global Terrorism Database into Neo4j

In the previous post in this series, I explained what we were trying to do by taking the Global Terrorism Database and using it to create a Neo4j graph database based POLE database.

As you may remember, the GTD is a really big Excel file. So the first thing that I did was to do a minor clean-up operation on the XL file (there were a few columns in the file that did not make sense to import, and that were really causing me a lot of pain during the import process), and then I basically converted it into a nice CSV file that I could use with Cypher's LOAD CSV commands.

Creating the import script

Now, I have done this sort of thing before, and I can tell you - importing a dataset like this is never trivial. One thing that I have learned over the years with lots of trial and error, is that it is usually a good idea to import data in lots of smaller incremental steps. Rather than trying to do everything at once, I essentially for this particular GTD import task created 50 queries: 50 iterations through a csv file, piecing together the different parts of my graph model with different columns of my CSV file.
The entire import statement is on github, you can run through it pretty easily.

Running the import script

Now, you could run this import script very nicely in the Neo4j browser. There's one problem though: the browser only accepts one statement at a time, and that would mean a lot of copying and pasting. That's why I very often times still revert to the good-old neo4j-shell. Copying and pasting the entire import statement works like a charm:
Although I really should mention one thing that I learned during this excercise. The thing is, that because I actually import the data piece by piece, run by run, I had originally also created the relevant Neo4j indexes right before every query. Waw. That turned out to be kind of a problem.

Schema await!!!

The thing is, that the Neo4j Cypher query planner relies on these indexes to create the most efficient query execution plan for an operation. And what I noticed in this excercise is that my import queries were sometimes literally taking F-O-R-E-V-E-R, in simple situations where really they should not. I really had to do a bit of digging there and ask a little help from my friends, but I ended up finding out that the problem was simply that the INDEXES, which I had created in the statement right before the import operation, where not online yet at the moment when I was doing the import. The Cypher query planner, in the absense of index information, the proceed to do the import in the most inefficient way possible, doing full graph scans time and time again for the simplest of operations.

Two really easy solutions to this:

  • create your indexes all the way at the top of your import script, not between import statements. Probably a best practice.
  • add a very simple command after you create the index, to force for it to be "ONLINE" before the next statement is executed: 
neo4j-sh (?)$ schema await;

That was a very useful trick to learn - I hope it will be useful for you too.

The result of the Import

After running this script, we end up loading data into the model that we sketched earlier.
After a few minutes of importing, the meta-graph is exactly the same as the model we set out to populate, and is now ready for querying.
That's what we will do in the next and final blogpost of this series: do some real POLE database queries on the Global Terrorism Database. Should be fun. Already looking forward!

Hope this was useful for you.



No comments:

Post a Comment