Enter neo4j-shell-tools
Michael created a collection of utilities that basically plug into the neo4j-shell, and extend its
functionalities with things like... data import functionalities.
There are different options, and you should definitely read up on the
different capabilities, but for my specific Last.fm use case, what
was important was that it can easily import the csv files that I had
created at the time for the import using talend.
You can read up on the details of the
shell-tools in the readme (in contains very simple installation
instructions that you would need to go through beforehand –
essentially installing the .jar file in neo4j's lib directory). Once
you have done that and you shutdown/restart the neo4j server, you are
good to go.
Creating the database from scratch.
As you will see below, the steps are
quite simple:
Step 1: start with an empty neo4j database
What's important here is that the
neo4j-shell-tools work on a **running** neo4j database. You do not
need to introduce downtime, and you do not use the so-called “batchimporter” method – instead you are doing a full blow,
transactional, live update on the graph, using this toolset.
Step 2: prepare the .csv files
I had already prepared these files for
the previous blogpost – so that was easy. The only difference that
I had to make was that I
- had to make sure that the delimiter that I was using was right. The neo4j-shell-tool allows you to specify the type of delimiter, and getting that wrong will obviously lead to faulty imports
- had to add a “header” row at the top of the text files. The neo4j-shell-tool will assume that the first line of the .csv files defines the structure of the rest of the file. Which also then means, that I needed multiple files as both the nodes and relationships that I wanted to add have a different structure/type.
So I ended up with 2 .csv files to add
nodes to the graph, and 7 .csv files to add the relationships between
the nodes. You can download everything here.
Step 3: prepare the import commands
The node import commands look like this
import-cypher
-d ; -i
nodespart1.csv -o 1out.csv create
(n{name:{name}, type:{type}}) return n.name as name
import-cypher
-d ; -i
nodespart2.csv -o 2out.csv create
(n{title:{title}, name:{name}, type:{type}}) return n.name as name
The structure of these commands is
fairly simple:
- import-cypher: calls the shell tool that we want to use
- -d defines the delimiter of the file that we are importing. In these case a “;”.
- -i defines the input file. On OSX, not adding a path will just look for the file in the root of your neo4j installation directory. In many cases you will want to have an absolute, or relative path from there.
- -o defines an option output file where the result of the import commands will be written. This is intended for logging purposes.
- And then finally, with the highlighted “create...” section, we basically define the Cypher query that will do the import transaction – using the parameters from the csv file (between { }) as input.
Note that the neo4j-shell-tools provide
some separate functionalities for dealing with large input files and
for tuning the transaction throttling (how many updates in one
transaction), but that for this purpose we really did not need to do
that.
Then for the relationship import
commands, we have a very similar structure:
import-cypher
-d ; -i APPEARS_ON.csv -o 3out.csv start
n1=node:node_auto_index(name={mbid1}),
n2=node:node_auto_index(name={mbid2}) create
unique
n1-[:APPEARS_ON]->n2
return n1.name, n2.name
import-cypher
-d ; -i CREATES.csv -o 4out.csv start
n1=node:node_auto_index(name={mbid1}),
n2=node:node_auto_index(name={mbid2}) create
unique
n1-[:CREATES]->n2
return n1.name, n2.name
import-cypher
-d ; -i FEATURES.csv -o 5out.csv start
n1=node:node_auto_index(name={scrobble}),
n2=node:node_auto_index(name={mbid}) create
unique
n1-[:FEATURES]->n2
return n1.name, n2.name
import-cypher
-d ; -i LOGS.csv -o 6out.csv start
n1=node:node_auto_index(name={user}),
n2=node:node_auto_index(name={song}) create
n1-[:LOGS]->n2
return n1.name, n2.name
import-cypher
-d ; -i ON_DATE.csv -o 7out.csv start
n1=node:node_auto_index(name={scrobble}),
n2=node:node_auto_index(name={date}) create
n1-[:ON_DATE]->n2
return n1.name, n2.name
import-cypher
-d ; -i PERFORMS.csv -o 8out.csv start
n1=node:node_auto_index(name={mbid1}),
n2=node:node_auto_index(name={mbid2}) create
unique
n1-[:PERFORMS]->n2
return n1.name, n2.name
import-cypher
-d ; -i PRECEDES.csv -o 9out.csv start
n1=node:node_auto_index(name={date1}),
n2=node:node_auto_index(name={date2}) create
n1-[:PRECEDES]->n2
return n1.name, n2.name
Note
that, because of the domain model that we have from the last.fm
dataset, some relationships have to be unique and others don't –
hence the difference in the Cypher queries.
Step 4: executing the commands
Then
all we need to do is to put the files on the right locations, make
sure that autoindexing is correctly defined, and then copy/paste the
commands into the neo4j-shell.
On
my MacBook Pro, the entire import took about 35 seconds, and I ended
up with the database that I had previously created with the Talend
toolset:
And
then the same graphic/query exploration can begin. You can take the graphical tools for a spin, or alternatively create your own cypher queries and get going.
Conclusion
Overall,
I found this new process to be extremely intuitive and straightforward –
even simpler then what I had experienced using the Talend toolset. I
have put the zip-file
and the corresponding input statements over here – so feel to
download and experiment yourself. Just make sure that you put the
.csv files in the neo4j “home directory”, or adjust the paths as
you want (both relative and absolute paths seemed to work on my
machine).
Hope
this was useful. Until the next time!
Rik
Hi Rik,
ReplyDeleteI went thru the same path. I learned Talend and then I found Mike's Import-cypher. Truly, Amazing tool :)
Stan