Tuesday 12 November 2019

Playing with the Colruyt Data Science assignment

If you spend any time in the Wonderful World of Graphs, I am sure you have noticed that the landscape has been changing in the past few years. I have definitely seen a change: the interest in using graphs has shifted from wanting to use graph databases for "data retrieval" purposes, to now also wanting to make use of it ton "make sense of" the data - basically doing data analytics. Of course data retrieval and data analysis are related, and in many cases we nowadays talk about all of this under the umbrella of data science. Sounds great, and at Neo4j we have made fantastic strides in making new functionality (think the Algo library that you can install on every Neo4j server, or think the Neuler graphapp that makes using the Algo library a walk in the park) available to enable these workloads - a work in progress that will only accelerate.

So recently I came across a post from Dries De Dauw on LinkedIn: he was mentioning how was working on a set of exercises that his new data science team members needed to work on to get a feel for their skills and competencies, and also familiarise them with the data domain of Colruyt Group.

He was referring to this particular page on github, containing a python notebook with a number of interesting exercises. One of the exercises uses a sample dataset that is contained in a .json file that you can download from a url.

Given the fact that it had been some time since I had played around with some example datasets, and that I really wanted to try out some new tools that I had recently learned about at the Neo4j Online Developer Summit (NODES) (like for example the new graphapp that allows you to explore geospatial data really easily, by Estelle Scifo),  I decided to take the dataset for a spin in Neo4j.

Importing the Colruyt dataset

The .json file is not 100% straightforward, as it contains some substructures embedded in every part: there's Branches, Branch addresses, and some type information available in the dataset. In the Colruyt assignment people were asked to "flatten" that - and effectively de-normalise the dataset, but ... why would we do that in a graph database. In Neo4j joins are cheap, normalisation is cheap, and flattening is just plain unnecessary. So I would not do that. Here's how I imported the file.

The model is really simple (created with the new Arrows tool):

First set up the indexes in Neo4j:

//create indexes and constraint
create index on :Address(streetName);
create index on :City(name);
create index on :Country(name);
create index on :Branch(commercialName);
create index on :Branch(geoCoordinates);
create constraint on (b:Branch) assert b.id is unique;

Then we can actually read the data from the .json file, using an APOC procedure called apoc.load.json:

//import addresses, placetypes, branches, and connect the branches to addresses and placetypes using apoc
WITH "https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places" AS url
call apoc.load.json(url) yield value
merge ( a:Address { houseNumber: value.address.houseNumber, streetName: value.address.streetName, countryName: value.address.countryName, cityName: value.address.cityName, postalcode: value.address.postalcode, geoCoordinates: point({latitude: toFloat(value.geoCoordinates.latitude), longitude: toFloat(value.geoCoordinates.longitude) } ) } )
merge (pt:PlaceType {id: value.placeType.id, description: value.placeType.placeTypeDescription, name: value.placeType.longName})
merge (b:Branch
  {id: value.branchId})
  set b.placeId=value.placeId
  set b.isActive=value.isActive
  set b.sourceStatus=value.sourceStatus
  set b.routeUrl=value.routeUrl
  set b.moreInfoUrl=value.moreInfoUrl
  set b.commercialName=value.commercialName
  set b.handoverServices=value.handoverServices
  set b.sellingPartners=value.sellingPartners
merge (b)-[:IN_STREET]->(a)
merge (b)-[:HAS_PLACETYPE]->(pt)
return b,a,pt;

As you can see from the above, the only thing special here is that I used the new-ish spatial datatypes in Neo4j to create a geoCoordinates property that has actually spatial characteristics. I can do some basic geospatial querying on that, straight out of the box.

Next Make the data a bit graphier by extract recurring patterns and making them part of a more normalised structure.

//extract city from addresses
match (a:Address)
merge (c:City {name: a.cityName, postalcode: a.postalcode})
merge (a)-[:IN_CITY]->(c);

//extract countries from addresses
match (a:Address)--(c:City)
merge (co:Country {name: a.countryName})
merge (c)-[:IN_COUNTRY]->(co);

That's it really. Just looking at the data it really is a set of hierarchies that we can now start querying with Neo4j.
The model is what we said it would be:

Now for some real fun.

Querying this dataset in Neo4j

Obviously the dataset is not very elaborate, but there were some interesting queries that we can start doing, specifically with the geospatial data that is in the dataset. Let's start with some easu stuff: let's find a Colruyt branch close (< 2kms) to my home - the query really is totally simple:

//find branches close to homewith point({latitude: 51.2081948, longitude: 4.4049826}) as home
match path = (:Country)--(:City)--(a:Address)--(:Branch)--(:PlaceType)
where distance(a.geoCoordinates, home) < 2000
return path;

I immediately get the branch info:

Now let's make it a bit more complicated. Let's find all the branches that are close to one another:

//find all addresses/branches that are within 2000m from one anothermatch (a1:Address)--(b1:Branch), (a2:Address)--(b2:Branch)
where id(a1)<id(a2)
and distance(a1.geoCoordinates, a2.geoCoordinates) < 2000
return b1.commercialName, b2.commercialName, distance(a1.geoCoordinates, a2.geoCoordinates)/1000+" km"

This gives me an interesting list:

Now, what if we wanted to visualise that data on an actual map. Fortunately, plotting the geo data just got a whole lot easier with this simple graph app that you can install into the Neo4j Desktop: neo-map.

I did have to do one hack to use neo-map: I had to convert the point properties (=geospatial datatype in Neo4j, which allows me to do the distance calculations above really easily) into simpler "latitude" and "logitude" properties to hook that up to neo-map. It was a very simple query to make the point-properties into number-properties:

//duplicate geocoordinates: stored both as geospatial data types, and as lat/long numbers
match (a:Address) 
 set a.longitude = a.geoCoordinates.longitude
 set a.latitude = a.geoCoordinates.latitude;
and then neo-map allows me to really quickly generate a marker map:
and a heatmap:
or a combination of both layers:

You can easily see how we could overlay that with other data and get some interesting insights really quickly!

Of course there's lots of other stuff possible - but I just wanted to share this quick experiment.

All data and scripts are available on github, as usual.

Look forward to hearing your feedback, if any!



No comments:

Post a Comment