Monday 18 July 2016

Graphing the Tour de France - part 2/3

In a previous blog post, I created a couple of Google spreadsheets with some of the results data of the 2016 Tour de France. These spreadsheets can be very easily downloaded as two comma-separated files that hold the data:
I will be updating the stages.csv files as the Tour progresses, so we can keep updating the graph as well.

Creating a model

To import these CSV files into Neo4j, I actually went through multiple iterations of the model. Here's two of them that I wanted to share with you - not because of the fact that one of them would be "right" and the other one would be "wrong", but because it really reflects the fact that your use case - the questions that you want to ask of your data and what you want to be doing with the data - is going to determine the model. Underlined. In Bold. Because it's so important.

Here's the first version of the model:
When I started thinking about this, however, I found that this model - while quite interesting because of its simplicity, it would not really give me a lot of expressive query capabilities... I decided to turn it around a bit, and go for a much more granular model.

Here's what I ended up going for:


I am sure there's stuff that I will want to change later on - but conceptually it's much better I think. It allows for many more queries about important Tour de France concepts - like the day's stage podium or the day's bearer of the important jerseys. So I think we can start with that, and use that as a a basis for our import work. Let's go there now.

Importing the data

At the end of the day the actual import process is NOT a very complicated one. I use the same technique time and time again - I take the CSV, look at which columns are logically bount together, and import those one by one into the graph model. 

You can find the two import scripts in this gist.
You will see quite a few differences - and really it's up to you to see what you think is most useful.

Running the Import

If you want to understand every step of the import, then of course you can step through the import script and paste every part of the script into the Neo4j Browser. Or, you could do like we have done in the past, and just copy/paste the entire import script into the neo4j-shell - works like a charm. However, here's another great way of doing this that I want to introduce to you. 

As you may know, we now have some great new tools in Neo4j 3.x based on the "Awesome Procedures", aka APOCs. Procedures are a feature that was introduced in Neo4j 3.x, and provides this unbelievably convenient and extensible way to integrate your Java code with Cypher - there really is no end to what you cannot do with it. All you need to do is grab the latest release (a .jar file - to be found over here), drop this in the plugins directory of your Neo4j server, and you should be good to go.

In this import exercise, I am going to use two Awesome Apocs (AAPOCs) that really make the world spin - at least for me.

Setting up the indexes

In order to prepare for the import to work well, it really helps if we set up our indexes and constraints appropriately. In the datamodel that I want to import I want to create 4 indexes and one constraint. Normally I would have to run these 5 cypher statements:
create index on :StageType(name); 
create index on :Rider(fullname); 
create index on :Team(name); 
create index on :Stage(seq); 
create constraint on (c:City) assert c.name is unique;
In the Neo4j browser, that would be 5 different statements, or in the Neo4j-shell, I would have to copy/paste these commands separately. But now, NOW WE HAVE APOCs, and as we can read in the APOC manual, there's a specific procedure that allows us to do these 5 operations in one go. All we need to do is
call apoc.schema.assert({StageType:["name"], Rider:["fullname"], Team:["name"], Stage:["seq"]},{City:["name"]}) yield label, key, unique, action;
and all of the schema changes will be implemented with one concise command:
Or if we do the same thing from the browser, we get


Now we are ready to do the actual import - so who knows, maybe now we can actually do something similar with the import itself.

Runnng the import statements

As I mentioned above, all the import statements (for the two different model implementations) are in the gist. Now, if we pick one of them (the second gist, at this location, raw file over here), then we can look for another APOC that allows us to execute all of these import statements in one go. Here it is - so all we need to do is use it from the Neo4j browser by calling
call apoc.cypher.runFile('https://gist.githubusercontent.com/rvanbruggen/c8d09f2c2fe174ebf818c344adad4fee/raw/840b9a3d92c32139b3908e78a99214894612399a/2%2520-%2520import_tdf2016_v2.cql') yield row, result;
The APOC would then grab the .cql file on github, execute all of the ; -separated statements, and give us a nice overview of the result:

The APOC is basically reporting about every operation that it found in the .cql file on the URL above, and telling us what changes that it made to the graph. If I now start exploring the graph, I will find all the data there:


So that's where this blogpost series is going to go next: exploring the graph that we have just imported, and see if we can do any nice queries on it.

So: stay tuned for the next post!

Cheers

Rik



No comments:

Post a Comment