- The Countries sheet, including the country codes
- The Cities sheet, mapped to the countries
- The Sports Taxonomy sheet, with the sports and disciplines
- and finally: The Medallists sheet.
My Olympic Graph Model
After a couple of iterations, I settled on the following model:
In order to load the data like that, I would obviously have to prepare a few things, more specifically make sure that the schema (indexes, constraints) of our Neo4j graph would be nicely prepared.
Set up the indexes
I ended up creating the following schema:
create index on :Country(name);create index on :City(name);create index on :Sport(name);create index on :Discipline(name);create index on :Athlete(name);create index on :Event(name);create index on :Year(name);create index on :Gender(name);create constraint on (c:Country) assert c.noc is unique;
Which would allow me to load the data into Neo4j efficiently. Then we can start loading the actual data from the different CSV files.
Load the data into the model
Loading the data into Neo4j was not very difficult, and as usual I used my best practice of doing multiple runs of the LOAD CSV data loading processes, each of which would be doing fairly small operations:
- Loading the Genders of the Athletes
- Loading the Countries
- Loading the Cities in these Countries
- Loading the sports taxonomy
- Loading the different games
- Loading the different events in different disciplines of the taxonomy at the different games
- Loading the different athletes / medallists and their genders
All of that went pretty smoothly.
//add the genders
create (g:Gender {name:"Men"});create (g:Gender {name:"Women"});
//load countries
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=2" as csv
create (c:Country {name: csv.Country, noc: csv.NOC, iso: csv.ISOcode});
//load cities
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=2027901504" as csv
match (c:Country {noc: csv.NOC})merge (ci:City {name: csv.City})merge (ci)-[:LOCATED_IN]->(c);
//load sports taxonomy
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=1457987568" as csv
merge (s:Sport {name: csv.Sport})merge (s)<-[:PART_OF]-(d:Discipline {name: csv.Discipline});
//load the games
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=0" as csv
match (c:City {name: csv.City})merge (c)-[:HOSTS_GAMES]-(y:Year {name: csv.Edition});
//load the events at the games
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=0" as csv
match (y:Year {name: csv.Edition}), (d:Discipline {name: csv.Discipline})merge (d)<-[:PART_OF]-(e:Event {name: csv.Event})-[:AT_GAMES]-(y);
//load the medallists
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=0" as csv
match (g:Gender {name: csv.Gender})merge (a:Athlete {name: csv.Athlete})-[:HAS_GENDER]->(g);
But then of course, I hit a problem. Specifically, when I tried to connect the Athletes to the Countries they represent, I found that the loading process would take forever.
The loading process did eventually finish, but only after like 800+ seconds on my tiny little laptop. What. Was. Going. On???
The Cost vs the Rule based planner
After a bit of troubleshooting, I found out that I had in fact stumbled on a little bug in Neo4j. In order to explain what happened, I'd like to step back a little bit, and explain a couple of fundamental things about Neo4j's Cypher query language and the fundamentals underneath it.
For those of you familiar with CYPHER and declarative database query languages in general, you know that there are several steps for a query to get executed in a database like Neo4j. One of the most crucial steps in that process is the query planning - where the database tries to make sense of the question that you are trying to ask, and comes up with a strategy to retrieve that data that you declared that you want, from disk/memory. It’s a crucial step, because choosing the right or wrong strategy can mean a whole lot to a database query: think of it like a route planning decision making process - if you at one point in your route planning take a wrong turn, it can make the difference between fast/slow routing, for sure.
Initially, Cypher used a “rule-based system” for coming up with that strategy. Using a set of predetermined rules, it would try to use general, non-domain-specific knowledge about query planning to come up with a reasonable strategy. That worked well for certain operations, but could lead to serious query performance loss on specific types of datamodels. Which is why we wanted to come up with a query planning system that would use specific knowledge about the data in the database to derive a more appropriate strategy. This is what we call “COST-based” query planning: using the heuristics and structure of the dataset at hand to come up with a better strategy. Here's how the Neo4j manual outlines the cost vs. rule based planners:
- Rule: This planner has rules that are used to produce execution plans. The planner considers available indexes, but does not use statistical information to guide the query compilation.
- Cost: This planner uses the statistics service in Neo4j to assign cost to alternative plans and picks the cheapest one. While this should lead to superior execution plans in most cases, it is still under development.
Different planners yield different plans
If I try to connect Athletes to their countries with the default cost based planner:
load csv with headers from
"https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=0" as csv
match (a:Athlete {name: csv.Athlete}), (c:Country {noc: csv.NOC})
merge (a)-[:REPRESENTS]->(c);
I will get a plan like this
And this takes forever to execute. If however I prefix this with the instruction to choose the RULE-based planner:
cypher planner=rule
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=0" as csv
match (a:Athlete {name: csv.Athlete}), (c:Country {noc: csv.NOC})merge (a)-[:REPRESENTS]->(c);
The plan looks very different, much simpler, and therefore also much faster.
No doubt this problem will be fixed in the next few update releases to Neo4j, but for know, I will do the same thing with the last, more connecting query:
cypher planner=rule
load csv with headers from"https://docs.google.com/a/neotechnology.com/spreadsheets/d/1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w/export?format=csv&id=1l-23m3ZPTkz5HNGp-1uHCB1qqhLxq5LSrqNDqBD_T3w&gid=0" as csv
match (d:Discipline {name: csv.Discipline})-[:PART_OF]->(s:Sport {name: csv.Sport})match (a:Athlete {name: csv.Athlete})
match (e:Event {name: csv.Event})-->(y:Year {name: csv.Edition})
create (a)-[:WINS]->(m:Medal {type: csv.Medal})-[:REWARD_FOR]->(e);
I can quickly verify that by looking at the model and some stats using the Awesome Apocs:
Looking good! The entire dataset has been loaded and is now ready for querying. That’s of course what we will do in part 3 of this blogpost series.
Cheers
Rik
PS: you can find the entire load script in this gist on github.
No comments:
Post a Comment