As you may remember, the GTD is a really big Excel file. So the first thing that I did was to do a minor clean-up operation on the XL file (there were a few columns in the file that did not make sense to import, and that were really causing me a lot of pain during the import process), and then I basically converted it into a nice CSV file that I could use with Cypher's LOAD CSV commands.
Creating the import scriptNow, I have done this sort of thing before, and I can tell you - importing a dataset like this is never trivial. One thing that I have learned over the years with lots of trial and error, is that it is usually a good idea to import data in lots of smaller incremental steps. Rather than trying to do everything at once, I essentially for this particular GTD import task created 50 queries: 50 iterations through a csv file, piecing together the different parts of my graph model with different columns of my CSV file.
import statement is on github, you can run through it pretty easily.
Running the import scriptNow, you could run this import script very nicely in the Neo4j browser. There's one problem though: the browser only accepts one statement at a time, and that would mean a lot of copying and pasting. That's why I very often times still revert to the good-old neo4j-shell. Copying and pasting the entire import statement works like a charm:
Schema await!!!The thing is, that the Neo4j Cypher query planner relies on these indexes to create the most efficient query execution plan for an operation. And what I noticed in this excercise is that my import queries were sometimes literally taking F-O-R-E-V-E-R, in simple situations where really they should not. I really had to do a bit of digging there and ask a little help from my friends, but I ended up finding out that the problem was simply that the INDEXES, which I had created in the statement right before the import operation, where not online yet at the moment when I was doing the import. The Cypher query planner, in the absense of index information, the proceed to do the import in the most inefficient way possible, doing full graph scans time and time again for the simplest of operations.
Two really easy solutions to this:
- create your indexes all the way at the top of your import script, not between import statements. Probably a best practice.
- add a very simple command after you create the index, to force for it to be "ONLINE" before the next statement is executed:
neo4j-sh (?)$ schema await;
That was a very useful trick to learn - I hope it will be useful for you too.
The result of the Import
After running this script, we end up loading data into the model that we sketched earlier.
After a few minutes of importing, the meta-graph is exactly the same as the model we set out to populate, and is now ready for querying.
Hope this was useful for you.