Monday 13 April 2015

Importing the SNAP Beeradvocate dataset into Neo4j - part 1

As you may or may not know, I am a big fan of Beer. And I am a big fan of Neo4j. I did a talk about this at GraphConnect two years ago ...
- and I have been doing that demo a lot with lots of users, customers and meetups. Up to the point where I was really working up a reputation as a "Beer Guy". Not sure if that is good or bad...

So some time ago I learned about the SNAP: Stanford Network Analysis Project, who mention an example dataset that is really interesting: 1.5 million Beer-reviews from Beeradvocate between January 1998 and November 2011. Here are some stats:
And then at the bottom of the page it says:
Disappointment!!! But then one simple Google search led me to this blob and a 445 Mbyte download later, we were in business. Unzip the thing, and we have a 1.5Gbyte text file to play with.

Importing the data? Not quite, yet!

Once I had downloaded the data, I found that there were two big "bears" on the road. Two problems that I would have to solve.
  • Problem 1: the size of the data. I would not really call this file huge, but on my little laptop (I "only" have 8gbyte of RAM on my machine) it can get kind of tricky to work with biggish files like that... Based on experience - I just know I will get into trouble, and will need to work around that.
  • Problem 2: the structure of the data. Here's a look at the file:


This definitely does not look very Neo4j-import friendly. I need to do some transformations there to make the thing easy to import.

So how did I go about solving this? Read on.

Solution 1: splitting the text file

The first thing I set out to do was to split the text file into different parts, so that it would be easier to handle and transform along the way. I looked around for a while, and then found the split bash command - by far the simplest, most performant and straightforward option.  Played around with different options (different sizes for the splits - I first had 6 files, then 3, and ended up choosing to split into 15 100 Mbyte files), and eventualy used this simple command:

split -b 100m beeradvocate.txt

This was the result
Nice - 15 perfectly manageable files! Once done that, needed to assure that the split was not completely arbitrary, and that "whole records" were included in every file. Easy peasy with some simple copying and pasting from the end of each file to the beginning of the next file - done in less than 5 minutes! Job done!

Solution 2: transforming the .txt files

Then I needed to get these manageable files into a .csv format that I could import into Neo4j. This was more complicated. I needed to go 
  • from a structure that had records in blocks of 14 lines, with a property field on every line
  • to a .csv file that had a row per record, and fields in comma-separated columns
That's quite a transformation, for which I would need some tooling. I decided on trying OpenRefine. I had heard about this tool a couple of years ago already, but never made any serious use of it. Now I thought it would come in handy, as I would have to go through every step of the transformation 15 times - once for every .txt file that we generated above.

So I fired it up, and created the first of 15 "projects" in Refine. This is what it looked like before the transformation :

After one parsing operation using the "line-based text file" transformation, I already got a preview that looked like this:
Already looks kind of right - at least I already have a row per record now.

Now I would need to do some manipulations before the text file became usable. Google refine has this really wonderful transformation tool that allows you to create manipulation steps that you can execute and process step after step. The main steps were:
  • extracting the "field name" from every cell, and just leave the data in there. Every cell currently was structured as "<<field name>>: <<field value>>", and I would want every cell to just contain the "<<field value>>".
  • renaming the columns so that my .csv file would have nice workable headers. This is not mandatory - but I found that easier.
When you do this manually, it can be a bit cumbersome and repetitive - so you definitely don't want to do this 15 times. That's where Refine is so nice: you can extract a .json file that specifies all the operations, and then apply these operations to forthcoming "projects" afterwards time and time again.

So you do it once, and then you can extract that .json so that you can use it later on the other 14 files - which is exactly what I wanted.

That .json file is also on github.

The result was exactly what we wanted: a comma-separated file that would be ready for import into Neo4j.
Download the file and we are done with the import preparations for this file. Of course you would need to do this 15 time, but thanks to the .json file with the operations steps, that was really easy.

You can find the sample files (of both the .txt file, the Refine operations .json and the resulting .csv file) are on github.

That was it for part 1. We now have everything ready to do an import. In part 2 we will do the actual import into Neo4j and start playing around with it.

Hope you enjoyed this.

Cheers

Rik

PS: Here are the links to

No comments:

Post a Comment