As a refresher, here's the dataset that we are looking at:
As you can see, the address.csv file is quite big - and that was already starting to be a problem in part 1. But I quickly realised that if I then would want to connect the Enterprises and Establishments to the respective Activities by loading the 669 MB activity.csv file, the problems would just get even bigger. I needed a solution.
Bash and Python to the rescue
So here was my first idea on how to solve this:- I would figure out a way to split the address.csv and/or activity.csv file into multiple smaller csv files
- I would then create a python script that would iterate over all of the generated files, and execute the loading transactions over these much smaller CSV files.
Sounded like a good idea to me, so I explored it and actually partially succeeded - and learned some more bash and python along the way :) ... what's not to like?
Here's the Bash script to split the csv file, in several distinct steps:
1. First create the split files with 25000 files each
tail -n +2 ./sourcecsv/address.csv | split -l 25000 - ./splitcsv/splitaddress2/splitaddress_This commend takes the "tail" of the address.csv file starting from line 2, and pipes that into the split command and generates a separate file for ever 25000 lines that it encounters. The output looks like this:
Then of course I also needed to copy the header row of the original address.csv file to each of the splitaddress_ files. That one took some time for me to figure out, but I managed it with this simple script:
for file in ./splitcsv/splitaddress_*do head -n 1 ./sourcecsv/address.csv > tmp_file cat $file >> tmp_file mv -f tmp_file $filedoneWhat this does is simple: it loops through all the splitaddress_* files, takes the first line of the address.csv file, copies that into a tmp_file and then concatinates the splitaddress_* file with the tmp_file and renames it to the splitaddress_* file... easy! So then you get a bunch of smaller, 25000 line csv files looking like this one:
The last step in my Shell wizardry was to then rename the files to have numeric increments instead of alpha ones - just so that we can process them in a python script and iterate over the list of files. This turned out to be a bit trickier and I had ato do some significant googling and then copying and pasting :) ... Here's what I ended up with:
ls -trU ./splitcsv/splitaddress_*| awk 'BEGIN{ a=0 }{ printf "mv %s ./splitcsv/splitaddress_%d\n", $0, a++ }' | bash
So basically three commands:
- listing the files splitaddress_*
- passing them to awk and letting it iterate over it and renaming the files one by one
- piping that to bash
Not trivial - but it works:
So that gave me a bunch of smaller, 25k line files that I could work with in python. So let's see how I did that.
Iterating over CSV files with Python
I created a simple python script to iterate over the files - here it is:1: import datetime
2: from py2neo import Graph
3: from py2neo.packages.httpstream import http
4: http.socket_timeout = 9999
5:
6: graph = Graph()
7:
8: print "Starting to process links between Enterprises and Addresses..."
9:
10: for filenr in range(0,113):
11: tx1 = graph.cypher.begin()
12: statement1 = """
13: load csv with headers from
14: "file:/<path>/splitcsv/splitaddress/splitaddress_"""+str(filenr)+"""" as csv
15: with distinct toUpper(csv.Zipcode) as Zipcode, toUpper(csv.StreetNL) as StreetNL, toUpper(csv.HouseNumber) as HouseNumber, csv.EntityNumber as EntityNumber
16: match (e:Enterprise {EnterpriseNumber: EntityNumber}),
17: (street:Street {name: StreetNL, zip:Zipcode})<-[:PART_OF]-(h:HouseNumber {houseNumber: HouseNumber})
18: create (e)-[:HAS_ADDRESS]->(h);
19: """
20:
21: tx1.append(statement1)
22:
23: tx1.process()
24: tx1.commit()
25: print "Enterprise Filenr: "+str(filenr)+" processed, at "+str(datetime.datetime.now())
26:
27: print "Starting to process links between Establishments and Addresses..."
28:
29: for filenr in range(0,113):
30: tx2 = graph.cypher.begin()
31: statement2 = """
32: load csv with headers from
33: "file:/<path>/splitcsv/splitaddress/splitaddress_"""+str(filenr)+"""" as csv
34: with distinct toUpper(csv.Zipcode) as Zipcode, toUpper(csv.StreetNL) as StreetNL, toUpper(csv.HouseNumber) as HouseNumber, csv.EntityNumber as EntityNumber
35: match (e:Establishment {EstablishmentNumber: EntityNumber}),
36: (street:Street {name: StreetNL, zip:Zipcode})<-[:PART_OF]-(h:HouseNumber {houseNumber: HouseNumber})
37: create (e)-[:HAS_ADDRESS]->(h);
38: """
39:
40: tx2.append(statement2)
41:
42: tx2.process()
43: tx2.commit()
44: print "Establishment Filenr: "+str(filenr)+" processed, at "+str(datetime.datetime.now())
Yey! This actually worked! Here's how the script started running over the files:
But... there is a but. The last step of this excercise is to connect the Enterprises to their "Activities", which are part of the Code-tree in our model. And: although I actually created a Python script to do that, and that script actually worked quite well - it was just too slow.
So that meant back to the drawing board and figuring another way to do this in a reasonable amount of time. In hindsight, everything I wrote about in this blogpost was not really used for the actually import - but I wanted to show you everything that I did and all of the stuff that I learned about bash and python and Neo4j...
All of the material mentioned on this blog series is on github if you want to take a look at it.
Hope this was still useful.
Cheers
Rik
No comments:
Post a Comment