Wednesday 10 August 2016

The Great Olympian Graph - part 2/3

In the previous blogpost of this Olympic series I explained how I got to the dataset in 4 distinct .csv files that get generated from a 4-worksheet Google Spreadsheet. Here are the links to the 4 sheets:
Now, in order to load that data into Neo4j, I had to come up with a meaningful graph model.

My Olympic Graph Model

After a couple of iterations, I settled on the following model:

In order to load the data like that, I would obviously have to prepare a few things, more specifically make sure that the schema (indexes, constraints) of our Neo4j graph would be nicely prepared. 

Set up the indexes

I ended up creating the following schema:
create index on :Country(name);create index on :City(name);create index on :Sport(name);create index on :Discipline(name);create index on :Athlete(name);create index on :Event(name);create index on :Year(name);create index on :Gender(name);create constraint on (c:Country) assert c.noc is unique;
Which would allow me to load the data into Neo4j efficiently. Then we can start loading the actual data from the different CSV files.

Load the data into the model

Loading the data into Neo4j was not very difficult, and as usual I used my best practice of doing multiple runs of the LOAD CSV data loading processes, each of which would be doing fairly small operations:
  • Loading the Genders of the Athletes
  • Loading the Countries
  • Loading the Cities in these Countries
  • Loading the sports taxonomy
  • Loading the different games
  • Loading the different events in different disciplines of the taxonomy at the different games
  • Loading the different athletes / medallists and their genders
All of that went pretty smoothly.
//add the genders 
create (g:Gender {name:"Men"});create (g:Gender {name:"Women"}); 
//load countries
load csv with headers from "" as csv 
create (c:Country {name: csv.Country, noc: csv.NOC, iso: csv.ISOcode}); 
//load cities 
load csv with headers from "" as csv 
match (c:Country {noc: csv.NOC})merge (ci:City {name: csv.City})merge (ci)-[:LOCATED_IN]->(c); 
//load sports taxonomy 
load csv with headers from "" as csv 
merge (s:Sport {name: csv.Sport})merge (s)<-[:PART_OF]-(d:Discipline {name: csv.Discipline}); 
//load the games 
load csv with headers from "" as csv 
match (c:City {name: csv.City})merge (c)-[:HOSTS_GAMES]-(y:Year {name: csv.Edition}); 
//load the events at the games 
load csv with headers from "" as csv 
match (y:Year {name: csv.Edition}), (d:Discipline {name: csv.Discipline})merge (d)<-[:PART_OF]-(e:Event {name: csv.Event})-[:AT_GAMES]-(y); 
//load the medallists 
load csv with headers from "" as csv 
match (g:Gender {name: csv.Gender})merge (a:Athlete {name: csv.Athlete})-[:HAS_GENDER]->(g);
But then of course, I hit a problem. Specifically, when I tried to connect the Athletes to the Countries they represent, I found that the loading process would take forever. 

The loading process did eventually finish, but only after like 800+ seconds on my tiny little laptop. What. Was. Going. On???

The Cost vs the Rule based planner

After a bit of troubleshooting, I found out that I had in fact stumbled on a little bug in Neo4j. In order to explain what happened, I'd like to step back a little bit, and explain a couple of fundamental things about Neo4j's Cypher query language and the fundamentals underneath it.

For those of you familiar with CYPHER and declarative database query languages in general, you know that there are several steps for a query to get executed in a database like Neo4j. One of the most crucial steps in that process is the query planning - where the database tries to make sense of the question that you are trying to ask, and comes up with a strategy to retrieve that data that you declared that you want, from disk/memory. It’s a crucial step, because choosing the right or wrong strategy can mean a whole lot to a database query: think of it like a route planning decision making process - if you at one point in your route planning take a wrong turn, it can make the difference between fast/slow routing, for sure.
Initially, Cypher used a “rule-based system” for coming up with that strategy. Using a set of predetermined rules, it would try to use general, non-domain-specific knowledge about query planning to come up with a reasonable strategy. That worked well for certain operations, but could lead to serious query performance loss on specific types of datamodels. Which is why we wanted to come up with a query planning system that would use specific knowledge about the data in the database to derive a more appropriate strategy. This is what we call “COST-based” query planning: using the heuristics and structure of the dataset at hand to come up with a better strategy. Here's how the Neo4j manual outlines the cost vs. rule based planners:
  • Rule: This planner has rules that are used to produce execution plans. The planner considers available indexes, but does not use statistical information to guide the query compilation. 
  • Cost: This planner uses the statistics service in Neo4j to assign cost to alternative plans and picks the cheapest one. While this should lead to superior execution plans in most cases, it is still under development.
The Neo4j Cypher team has been hard at work for the past couple of versions of the database to introduce this cost-based querying methodology - and has been very successful at it. But: it was a staged introduction (starting with READ operations, and only recently introducing COST-based write operations), and… it contained some issues here and there. And one of these issues was hit during my Olympic data loading. The problem I was seeing with my LOAD CSV operation on the country representations, was caused by the new Cost-based query planner for WRITE operations, which was producing a cr@ppy plan for the loading. Let's illustrate that by using Neo4j's nice and convenient way to switch query planners. All I need to do is prefix my cypher statement with a specifc instruction, and it will switch back from the default COST-based system to the (in this case, temporarily) more appropriate RULE-based system.

Different planners yield different plans

If I try to connect Athletes to their countries with the default cost based planner:
load csv with headers from
"" as csv
match (a:Athlete {name: csv.Athlete}), (c:Country {noc: csv.NOC})
merge (a)-[:REPRESENTS]->(c);
I will get a plan like this
And this takes forever to execute. If however I prefix this with the instruction to choose the RULE-based planner:
cypher planner=rule 
load csv with headers from "" as csv 
match (a:Athlete {name: csv.Athlete}), (c:Country {noc: csv.NOC})merge (a)-[:REPRESENTS]->(c);
The plan looks very different, much simpler, and therefore also much faster.
No doubt this problem will be fixed in the next few update releases to Neo4j, but for know, I will do the same thing with the last, more connecting query:
cypher planner=rule 
load csv with headers from"" as csv 
match (d:Discipline {name: csv.Discipline})-[:PART_OF]->(s:Sport {name: csv.Sport})match (a:Athlete {name: csv.Athlete}) 
match (e:Event {name: csv.Event})-->(y:Year {name: csv.Edition}) 
create (a)-[:WINS]->(m:Medal {type: csv.Medal})-[:REWARD_FOR]->(e);
I can quickly verify that by looking at the model and some stats using the Awesome Apocs:

Looking good! The entire dataset has been loaded and is now ready for querying. That’s of course what we will do in part 3 of this blogpost series.

No comments:

Post a Comment