Thursday 16 July 2015

Loading the Belgian Corporate Registry into Neo4j - part 2

In the previous blogpost of this series, I was trying to import the Belgian Corporate Registry dataset into Neo4j - all in preparation of some interesting querying. Unfortunately, as you could read in part 1, this was not that easy. Load CSV queries were starting to take a very long time, and my initial reaction was that the problem must be down to the size of the CSV file. In this blogpost, I will take you through that experience - and show you what I did, and how I - again - only partially succeeded.

As a refresher, here's the dataset that we are looking at:

As you can see, the address.csv file is quite big - and that was already starting to be a problem in part 1. But I quickly realised that if I then would want to connect the Enterprises and Establishments to the respective Activities by loading the 669 MB activity.csv file, the problems would just get even bigger. I needed a solution.

Bash and Python to the rescue

So here was my first idea on how to solve this:

  • I would figure out a way to split the address.csv and/or activity.csv file into multiple smaller csv files
  • I would then create a python script that would iterate over all of the generated files, and execute the loading transactions over these much smaller CSV files.

Sounded like a good idea to me, so I explored it and actually partially succeeded - and learned some more bash and python along the way :) ... what's not to like?

Here's the Bash script to split the csv file, in several distinct steps:

1. First create the split files with 25000 files each

tail -n +2 ./sourcecsv/address.csv | split -l 25000 - ./splitcsv/splitaddress2/splitaddress_
This commend takes the "tail" of the address.csv file starting from line 2, and pipes that into the split command and generates a separate file for ever 25000 lines that it encounters. The output looks like this:

Then of course I also needed to copy the header row of the original address.csv file to each of the splitaddress_ files. That one took some time for me to figure out, but I managed it with this simple script:
for file in ./splitcsv/splitaddress_*do    head -n 1 ./sourcecsv/address.csv > tmp_file    cat $file >> tmp_file    mv -f tmp_file $filedone
What this does is simple: it loops through all the splitaddress_* files, takes the first line of the address.csv file, copies that into a tmp_file and then concatinates the splitaddress_* file with the tmp_file and renames it to the splitaddress_* file... easy! So then you get a bunch of smaller, 25000 line csv files looking like this one:

The last step in my Shell wizardry was to then rename the files to have numeric increments instead of alpha ones - just so that we can process them in a python script and iterate over the list of files. This turned out to be a bit trickier and I had ato do some significant googling and then copying and pasting :) ... Here's what I ended up with:

ls -trU ./splitcsv/splitaddress_*| awk 'BEGIN{ a=0 }{ printf "mv %s ./splitcsv/splitaddress_%d\n", $0, a++ }' | bash
So basically three commands:

  • listing the files splitaddress_*
  • passing them to awk and letting it iterate over it and renaming the files one by one
  • piping that to bash
Not trivial - but it works:
So that gave me a bunch of smaller, 25k line files that I could work with in python. So let's see how I did that.

Iterating over CSV files with Python

I created a simple python script to iterate over the files - here it is:
1:  import datetime  
2:  from py2neo import Graph  
3:  from py2neo.packages.httpstream import http  
4:  http.socket_timeout = 9999  
5:    
6:  graph = Graph()  
7:    
8:  print "Starting to process links between Enterprises and Addresses..."  
9:    
10:  for filenr in range(0,113):  
11:      tx1 = graph.cypher.begin()  
12:      statement1 = """  
13:           load csv with headers from  
14:            "file:/<path>/splitcsv/splitaddress/splitaddress_"""+str(filenr)+"""" as csv  
15:            with distinct toUpper(csv.Zipcode) as Zipcode, toUpper(csv.StreetNL) as StreetNL, toUpper(csv.HouseNumber) as HouseNumber, csv.EntityNumber as EntityNumber  
16:            match (e:Enterprise {EnterpriseNumber: EntityNumber}),  
17:            (street:Street {name: StreetNL, zip:Zipcode})<-[:PART_OF]-(h:HouseNumber {houseNumber: HouseNumber})  
18:            create (e)-[:HAS_ADDRESS]->(h);  
19:            """  
20:    
21:      tx1.append(statement1)  
22:    
23:      tx1.process()  
24:      tx1.commit()  
25:      print "Enterprise Filenr: "+str(filenr)+" processed, at "+str(datetime.datetime.now())  
26:    
27:  print "Starting to process links between Establishments and Addresses..."  
28:    
29:  for filenr in range(0,113):  
30:      tx2 = graph.cypher.begin()  
31:      statement2 = """  
32:            load csv with headers from  
33:            "file:/<path>/splitcsv/splitaddress/splitaddress_"""+str(filenr)+"""" as csv  
34:            with distinct toUpper(csv.Zipcode) as Zipcode, toUpper(csv.StreetNL) as StreetNL, toUpper(csv.HouseNumber) as HouseNumber, csv.EntityNumber as EntityNumber  
35:            match (e:Establishment {EstablishmentNumber: EntityNumber}),  
36:            (street:Street {name: StreetNL, zip:Zipcode})<-[:PART_OF]-(h:HouseNumber {houseNumber: HouseNumber})  
37:            create (e)-[:HAS_ADDRESS]->(h);  
38:            """  
39:    
40:      tx2.append(statement2)  
41:    
42:      tx2.process()  
43:      tx2.commit()  
44:      print "Establishment Filenr: "+str(filenr)+" processed, at "+str(datetime.datetime.now())  

Yey! This actually worked! Here's how the script started running over the files:


And then as you can see below, 22 minutes later all the Enterprises and Establishments were correctly linked to their addresses! Now we are getting somewhere!

But... there is a but. The last step of this excercise is to connect the Enterprises to their "Activities", which are part of the Code-tree in our model. And: although I actually created a Python script to do that, and that script actually worked quite well - it was just too slow.

So that meant back to the drawing board and figuring another way to do this in a reasonable amount of time. In hindsight, everything I wrote about in this blogpost was not really used for the actually import - but I wanted to show you everything that I did and all of the stuff that I learned about bash and python and Neo4j...

All of the material mentioned on this blog series is on github if you want to take a look at it.

Hope this was still useful.

Cheers

Rik

No comments:

Post a Comment