Wednesday 15 April 2015

Importing the SNAP Beeradvocate dataset into Neo4j - part 2

After the previous post on the SNAP Beeradvocate dataset, we were ready to import the dataset into Neo4j. We had 15 .csv files, perfect for a couple of runs of Load CSV.

The first I needed to do was to create a graph model out of my CSV files. Here's what I picked:
So then I would need to create a series of Load CSV commands to import these. And this is where it got it got interesting. I created the Cypher queries myself, and found that they worked fine - except for one part. This was the part where I had to add the reviews to the graph. This was my query:
 using periodic commit  
 load csv with headers  
 from "file:/Users/rvanbruggen/Dropbox/Neo Technology/Demo/BEER/BeerAdvocate/real/ba7.csv" as csv  
 fieldterminator ';'  
 with csv  
 where csv.review_profileName is not null  
 match (b:Beer {name: csv.beer_name}), (p:Profile {name: csv.review_profileName})  
 create (p)-[:CREATES_REVIEW]->(r:Review {taste: toFloat(csv.review_taste), appearance: toFloat(csv.review_appearance), text: csv.review_text, time: toInt(csv.review_time), aroma: toFloat(csv.review_aroma), palate: toFloat(csv.review_palate), overall: toFloat(csv.review_overall)})-[:REVIEW_COVERS]->(b);  

On some of the import files (remember I had 15) this query would fail - run out of Heap space. Now this is a very tricky thing to troubleshoot in Neo4j, so I had to call for help. My colleagues all immediately volunteered, and of course within the hour Michael had reengineered everything.

The first thing Michael asked was my query plan (using the EXPLAIN) commando: this was particularly interesting. Michael saw that there was a step in there that was called "Eager". Mark has blogged about this elsewhere already, and it was clear that we had to get rid of this.



Here's the query that Michael suggested:
 //the query below is NO LONGER PROBLEMATIC  
 using periodic commit  
 load csv with headers from "file:/Users/rvanbruggen/Dropbox/Neo Technology/Demo/BEER/BeerAdvocate/real/ba15.csv" as csv fieldterminator ';'  
 with csv where csv.review_profileName is not null  
 create (r:Review {taste: toFloat(csv.review_taste), appearance: toFloat(csv.review_appearance), text: csv.review_text, time: toInt(csv.review_time), aroma: toFloat(csv.review_aroma), palate: toFloat(csv.review_palate), overall: toFloat(csv.review_overall)})  
 with r,csv  
 match (b:Beer {name: csv.beer_name})  
 match (p:Profile {name: csv.review_profileName})  
 create (p)-[:CREATES_REVIEW]->(r)  
 create (r)-[:REVIEW_COVERS]->(b);  
 // takes 13s  

You can find the two import scripts on github:
  • this is my old version (which DID NOT WORK, at least not always)
    UPDATE: in the original version of this blogpost, I was working with version 2.2.0 of Neo4j. Recently, 2.2.1 was released - and guess what: the queries run just fine. Apparently the team had made some change to how Neo4j handles composite merge updates - and it now just flies through all queries, even with my old, sub-optimal version of the queries.. Kudos!
  • this is Michael's version (which, of course, WORKS)
    UPDATE: I would still recommend using this version of the queries :) 
Let's explore some of the differences.
  1. Michael's version included the same indexes as mine - but also included a UNIQUENESS CONSTRAINT. This seems to be a good idea because it makes the MERGE-ing of the data unnecessary - you can just CREATE instead.
  2. Michael's version does "one MERGE at a time". Rather than merging in entire patterns like
merge (b)-[:HAS_STYLE]-(s:Style {name: csv.beer_style})

you instead do

merge (s:Style {name: csv.beer_style})
and then merge (b)-[:HAS_STYLE]->(s)
  1. you reorder certain parts of the query to come earlier in the sequence. I noticed that he did the CREATE of the review first,  then transferred that result into the next part of the query with WITH, and then did two matches (for Beers and Profiles) to connect the Review to the appropriate Beer and Profile. To be honest, this seems to have been a bit of a trial and error search - but we found out after talking to the awesome devteam that this should no longer be necessary as from the forthcoming version 2.2.1 of Neo4j. 
The result is pretty awesome. After having run the entire import (which means running the same import 15 times - see over here for the complete script) I got a pretty shiny new database to play around with:
In the last part of this blog-series, I will be doing some fancy queries. Really looking forward to it :))

Hope you found this useful.

Cheers

Rik

PS: Here are the links to

No comments:

Post a Comment