Friday, 21 February 2014

Some Neo4j import tweaks - what and where


As you probably know, importing data into Neo4j can be a bit tricky, in spite of some of the wonderful tools that we have these days. I blogged about this last year, and if you are looking for some guidance then please go there

Turns out that, in order to get the most out of your import efforts, there's actually a few settings that you should be aware of and tweak - depending on your specific environment. Your machine's memory will be of paramount importance, and your dataset will also determine some of the optimization characteristics that we will discuss below.

Essentially there's three parameters to tweak:
  • the Java heap size
  • the Memory-mapping of neo4j files
  • the neo4j cache configuration.
The following table will try to provide an overview of a number of settings that you can add to your neo4j installation/tooling to optimize your data import performance. Let me start by explaining some of these settings for the Batch Importer:


Settings

Batch Importer

Heap size
Add parameters to the batch importers command line start statement:

-Xms<size> :  this sets initial Java heap size
-Xmx<size> : this sets maximum Java heap size

Memory mapping settings
Important Note: for the Batch importer, memory mapping settings are PART OF the heap settings above - you use a part of the heap size by using memory mapped files. That’s why you should try to give as much memory as possible as heap to the batch-importer. Leave 1-4GB to the operating system.

Try to memory map all of the node store, and as much of the relationship store files as possible.

Edit: /path/to/importer/batch.properties

use_memory_mapped_buffers=true
# 14 bytes per node
neostore.nodestore.db.mapped_memory=200M
# 33 bytes per relationship
neostore.relationshipstore.db.mapped_memory=3G
# 38 bytes per property
neostore.propertystore.db.mapped_memory=500M
# 60 bytes per long-string block
neostore.propertystore.db.strings.mapped_memory=500M
Cache settings
For bulk update/import operations, the cache should be disabled as you write only and no node or relationship objects are loaded.

Edit: /path/to/importer/batch.properties

cache_type=none


Then, let's explore the same setting for the running neo4j server import capabilities, for example using neo4j-shell-tools:


Settings

Neo4j Server / Neo4j-shell-tools

Heap size
Edit: /path/to/neo4j/conf/neo4j-wrapper.conf


# Initial Java Heap Size (in MB)
wrapper.java.initmemory=4096

# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=4096

Memory mapping settings
Important Note: for the Neo4j server (and neo4j-shell-tools that run against a server), memory mapping settings are SEPARATE of the heap settings. Your heap memory allocation will be additional to the memory mapping allocation. Usually you use between 4 and 8GB as heap. The remainder of your RAM is used for memory mapping.

Note that on users that run Neo4j on Windows, there is a significant difference: there, the memory mapping is part of the heap, and the principle explained in the batch-importer section should be followed.

Try to memory map all of the node store, and as much of the relationship store files as possible.

Edit: /path/to/neo4j/conf/neo4j.properties

The settings and settings to be added to this file are identical to the ones mentioned for the Batch Importer:

use_memory_mapped_buffers=true
# 14 bytes per node
neostore.nodestore.db.mapped_memory=200M
# 33 bytes per relationship
neostore.relationshipstore.db.mapped_memory=3G
# 38 bytes per property
neostore.propertystore.db.mapped_memory=500M
# 60 bytes per long-string block
neostore.propertystore.db.strings.mapped_memory=500M


Cache settings
As you create relationships by looking up and updating nodes, the cache should be kept active on a running neo4j server that you are loading data into. Here we have a difference between the Community and Enterprise editions of neo4j: the Enterprise edition has a better cache that is not present in Community - the “High Performance Cache”. Therefore, for bulk update/import operations, you should Edit the neo4j.properties file in the conf directory of your neo4j installation:

Edit: /path/to/neo4j/conf/neo4j.properties

# Setting for Community Edition:
cache_type=weak

# Setting for Enterprise Edition:
cache_type=hpc


I am hoping that this was a good overview of the different setting that you should keep in mind and tweak - and where you should tweak them - in your specific environment. 

Hope this was useful

Rik