So recently I came across a post from Dries De Dauw on LinkedIn: he was mentioning how was working on a set of exercises that his new data science team members needed to work on to get a feel for their skills and competencies, and also familiarise them with the data domain of Colruyt Group.
He was referring to this particular page on github, containing a python notebook with a number of interesting exercises. One of the exercises uses a sample dataset that is contained in a .json file that you can download from a url.
Given the fact that it had been some time since I had played around with some example datasets, and that I really wanted to try out some new tools that I had recently learned about at the Neo4j Online Developer Summit (NODES) (like for example the new graphapp that allows you to explore geospatial data really easily, by Estelle Scifo), I decided to take the dataset for a spin in Neo4j.
Importing the Colruyt dataset
The .json file is not 100% straightforward, as it contains some substructures embedded in every part: there's Branches, Branch addresses, and some type information available in the dataset. In the Colruyt assignment people were asked to "flatten" that - and effectively de-normalise the dataset, but ... why would we do that in a graph database. In Neo4j joins are cheap, normalisation is cheap, and flattening is just plain unnecessary. So I would not do that. Here's how I imported the file.The model is really simple (created with the new Arrows tool):
First set up the indexes in Neo4j:
//create indexes and constraint
create index on :Address(streetName);
create index on :City(name);
create index on :Country(name);
create index on :Branch(commercialName);
create index on :Branch(geoCoordinates);
create constraint on (b:Branch) assert b.id is unique;
//import addresses, placetypes, branches, and connect the branches to addresses and placetypes using apoc
WITH "https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places" AS url
call apoc.load.json(url) yield value
merge ( a:Address { houseNumber: value.address.houseNumber, streetName: value.address.streetName, countryName: value.address.countryName, cityName: value.address.cityName, postalcode: value.address.postalcode, geoCoordinates: point({latitude: toFloat(value.geoCoordinates.latitude), longitude: toFloat(value.geoCoordinates.longitude) } ) } )
merge (pt:PlaceType {id: value.placeType.id, description: value.placeType.placeTypeDescription, name: value.placeType.longName})
merge (b:Branch
{id: value.branchId})
set b.placeId=value.placeId
set b.isActive=value.isActive
set b.sourceStatus=value.sourceStatus
set b.routeUrl=value.routeUrl
set b.moreInfoUrl=value.moreInfoUrl
set b.commercialName=value.commercialName
set b.handoverServices=value.handoverServices
set b.sellingPartners=value.sellingPartners
merge (b)-[:IN_STREET]->(a)
merge (b)-[:HAS_PLACETYPE]->(pt)
return b,a,pt;
As you can see from the above, the only thing special here is that I used the new-ish spatial datatypes in Neo4j to create a geoCoordinates property that has actually spatial characteristics. I can do some basic geospatial querying on that, straight out of the box.
Next Make the data a bit graphier by extract recurring patterns and making them part of a more normalised structure.
//extract city from addresses
match (a:Address)
merge (c:City {name: a.cityName, postalcode: a.postalcode})
merge (a)-[:IN_CITY]->(c);
//extract countries from addresses
match (a:Address)--(c:City)
merge (co:Country {name: a.countryName})
merge (c)-[:IN_COUNTRY]->(co);
That's it really. Just looking at the data it really is a set of hierarchies that we can now start querying with Neo4j.
The model is what we said it would be:Now for some real fun.
Querying this dataset in Neo4j
Obviously the dataset is not very elaborate, but there were some interesting queries that we can start doing, specifically with the geospatial data that is in the dataset. Let's start with some easu stuff: let's find a Colruyt branch close (< 2kms) to my home - the query really is totally simple://find branches close to homewith point({latitude: 51.2081948, longitude: 4.4049826}) as home
match path = (:Country)--(:City)--(a:Address)--(:Branch)--(:PlaceType)
where distance(a.geoCoordinates, home) < 2000
return path;
I immediately get the branch info:
Now let's make it a bit more complicated. Let's find all the branches that are close to one another:
//find all addresses/branches that are within 2000m from one anothermatch (a1:Address)--(b1:Branch), (a2:Address)--(b2:Branch)
where id(a1)<id(a2)
and distance(a1.geoCoordinates, a2.geoCoordinates) < 2000
return b1.commercialName, b2.commercialName, distance(a1.geoCoordinates, a2.geoCoordinates)/1000+" km"
This gives me an interesting list:
Now, what if we wanted to visualise that data on an actual map. Fortunately, plotting the geo data just got a whole lot easier with this simple graph app that you can install into the Neo4j Desktop: neo-map.
I did have to do one hack to use neo-map: I had to convert the point properties (=geospatial datatype in Neo4j, which allows me to do the distance calculations above really easily) into simpler "latitude" and "logitude" properties to hook that up to neo-map. It was a very simple query to make the point-properties into number-properties:
//duplicate geocoordinates: stored both as geospatial data types, and as lat/long numbers
match (a:Address)
set a.longitude = a.geoCoordinates.longitude
set a.latitude = a.geoCoordinates.latitude;
and then neo-map allows me to really quickly generate a marker map:
and a heatmap:
You can easily see how we could overlay that with other data and get some interesting insights really quickly!
Of course there's lots of other stuff possible - but I just wanted to share this quick experiment.
All data and scripts are available on github, as usual.
Look forward to hearing your feedback, if any!
Cheers
Rik
No comments:
Post a Comment