Tuesday, 15 October 2013

Importing my Last.fm dataset - the neo4j way

Some time ago, I blogged about how you could create an interesting graph dataset in neo4j using the data from Last.fm. At the time, I used Talend as an ETL tool, to do the import into neo4j – as the dataset was quite large and the spreadsheet method would probably not cut it anymore. It worked great – the only downside (for this particular use case) was that ... I had to learn Talend. And not that that is terribly difficult – especially not if you are an experienced ETL professional, which I am clearly NOT – but there was definitely a learning curve involved. So: there continued to be a latent desire to do this import into neo4j natively – without separate tooling. And now, I think we have that, thanks to the ever-amazing Michael Hunger.

Enter neo4j-shell-tools

Michael created a collection of utilities that basically plug into the neo4j-shell, and extend its functionalities with things like... data import functionalities. There are different options, and you should definitely read up on the different capabilities, but for my specific Last.fm use case, what was important was that it can easily import the csv files that I had created at the time for the import using talend.

You can read up on the details of the shell-tools in the readme (in contains very simple installation instructions that you would need to go through beforehand – essentially installing the .jar file in neo4j's lib directory). Once you have done that and you shutdown/restart the neo4j server, you are good to go.

Creating the database from scratch.

As you will see below, the steps are quite simple:

Step 1: start with an empty neo4j database

What's important here is that the neo4j-shell-tools work on a **running** neo4j database. You do not need to introduce downtime, and you do not use the so-called “batchimporter” method – instead you are doing a full blow, transactional, live update on the graph, using this toolset.

Step 2: prepare the .csv files

I had already prepared these files for the previous blogpost – so that was easy. The only difference that I had to make was that I
  • had to make sure that the delimiter that I was using was right. The neo4j-shell-tool allows you to specify the type of delimiter, and getting that wrong will obviously lead to faulty imports
  • had to add a “header” row at the top of the text files. The neo4j-shell-tool will assume that the first line of the .csv files defines the structure of the rest of the file. Which also then means, that I needed multiple files as both the nodes and relationships that I wanted to add have a different structure/type.
So I ended up with 2 .csv files to add nodes to the graph, and 7 .csv files to add the relationships between the nodes. You can download everything here.

Step 3: prepare the import commands

The node import commands look like this

import-cypher -d ; -i nodespart1.csv -o 1out.csv create (n{name:{name}, type:{type}}) return n.name as name

import-cypher -d ; -i nodespart2.csv -o 2out.csv create (n{title:{title}, name:{name}, type:{type}}) return n.name as name

The structure of these commands is fairly simple:
  • import-cypher: calls the shell tool that we want to use
  • -d defines the delimiter of the file that we are importing. In these case a “;”.
  • -i defines the input file. On OSX, not adding a path will just look for the file in the root of your neo4j installation directory. In many cases you will want to have an absolute, or relative path from there.
  • -o defines an option output file where the result of the import commands will be written. This is intended for logging purposes.
  • And then finally, with the highlighted “create...” section, we basically define the Cypher query that will do the import transaction – using the parameters from the csv file (between { }) as input.
Note that the neo4j-shell-tools provide some separate functionalities for dealing with large input files and for tuning the transaction throttling (how many updates in one transaction), but that for this purpose we really did not need to do that.

Then for the relationship import commands, we have a very similar structure:

import-cypher -d ; -i APPEARS_ON.csv -o 3out.csv start n1=node:node_auto_index(name={mbid1}), n2=node:node_auto_index(name={mbid2}) create unique n1-[:APPEARS_ON]->n2 return n1.name, n2.name

import-cypher -d ; -i CREATES.csv -o 4out.csv start n1=node:node_auto_index(name={mbid1}), n2=node:node_auto_index(name={mbid2}) create unique n1-[:CREATES]->n2 return n1.name, n2.name

import-cypher -d ; -i FEATURES.csv -o 5out.csv start n1=node:node_auto_index(name={scrobble}), n2=node:node_auto_index(name={mbid}) create unique n1-[:FEATURES]->n2 return n1.name, n2.name

import-cypher -d ; -i LOGS.csv -o 6out.csv start n1=node:node_auto_index(name={user}), n2=node:node_auto_index(name={song}) create n1-[:LOGS]->n2 return n1.name, n2.name

import-cypher -d ; -i ON_DATE.csv -o 7out.csv start n1=node:node_auto_index(name={scrobble}), n2=node:node_auto_index(name={date}) create n1-[:ON_DATE]->n2 return n1.name, n2.name

import-cypher -d ; -i PERFORMS.csv -o 8out.csv start n1=node:node_auto_index(name={mbid1}), n2=node:node_auto_index(name={mbid2}) create unique n1-[:PERFORMS]->n2 return n1.name, n2.name

import-cypher -d ; -i PRECEDES.csv -o 9out.csv start n1=node:node_auto_index(name={date1}), n2=node:node_auto_index(name={date2}) create n1-[:PRECEDES]->n2 return n1.name, n2.name

Note that, because of the domain model that we have from the last.fm dataset, some relationships have to be unique and others don't – hence the difference in the Cypher queries.

Step 4: executing the commands

Then all we need to do is to put the files on the right locations, make sure that autoindexing is correctly defined, and then copy/paste the commands into the neo4j-shell.
On my MacBook Pro, the entire import took about 35 seconds, and I ended up with the database that I had previously created with the Talend toolset:

And then the same graphic/query exploration can begin. You can take the graphical tools for a spin, or alternatively create your own cypher queries and get going.

Conclusion

Overall, I found this new process to be extremely intuitive and straightforward – even simpler then what I had experienced using the Talend toolset. I have put the zip-file and the corresponding input statements over here – so feel to download and experiment yourself. Just make sure that you put the .csv files in the neo4j “home directory”, or adjust the paths as you want (both relative and absolute paths seemed to work on my machine).

Hope this was useful. Until the next time!


Rik

1 comment:

  1. Hi Rik,
    I went thru the same path. I learned Talend and then I found Mike's Import-cypher. Truly, Amazing tool :)

    Stan

    ReplyDelete