Tuesday, 12 November 2019

Playing with the Colruyt Data Science assignment

If you spend any time in the Wonderful World of Graphs, I am sure you have noticed that the landscape has been changing in the past few years. I have definitely seen a change: the interest in using graphs has shifted from wanting to use graph databases for "data retrieval" purposes, to now also wanting to make use of it ton "make sense of" the data - basically doing data analytics. Of course data retrieval and data analysis are related, and in many cases we nowadays talk about all of this under the umbrella of data science. Sounds great, and at Neo4j we have made fantastic strides in making new functionality (think the Algo library that you can install on every Neo4j server, or think the Neuler graphapp that makes using the Algo library a walk in the park) available to enable these workloads - a work in progress that will only accelerate.

So recently I came across a post from Dries De Dauw on LinkedIn: he was mentioning how was working on a set of exercises that his new data science team members needed to work on to get a feel for their skills and competencies, and also familiarise them with the data domain of Colruyt Group.

He was referring to this particular page on github, containing a python notebook with a number of interesting exercises. One of the exercises uses a sample dataset that is contained in a .json file that you can download from a url.

Given the fact that it had been some time since I had played around with some example datasets, and that I really wanted to try out some new tools that I had recently learned about at the Neo4j Online Developer Summit (NODES) (like for example the new graphapp that allows you to explore geospatial data really easily, by Estelle Scifo),  I decided to take the dataset for a spin in Neo4j.

Importing the Colruyt dataset

The .json file is not 100% straightforward, as it contains some substructures embedded in every part: there's Branches, Branch addresses, and some type information available in the dataset. In the Colruyt assignment people were asked to "flatten" that - and effectively de-normalise the dataset, but ... why would we do that in a graph database. In Neo4j joins are cheap, normalisation is cheap, and flattening is just plain unnecessary. So I would not do that. Here's how I imported the file.

The model is really simple (created with the new Arrows tool):

First set up the indexes in Neo4j:

//create indexes and constraint
create index on :Address(streetName);
create index on :City(name);
create index on :Country(name);
create index on :Branch(commercialName);
create index on :Branch(geoCoordinates);
create constraint on (b:Branch) assert b.id is unique;

Then we can actually read the data from the .json file, using an APOC procedure called apoc.load.json:

//import addresses, placetypes, branches, and connect the branches to addresses and placetypes using apoc
WITH "https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places" AS url
call apoc.load.json(url) yield value
merge ( a:Address { houseNumber: value.address.houseNumber, streetName: value.address.streetName, countryName: value.address.countryName, cityName: value.address.cityName, postalcode: value.address.postalcode, geoCoordinates: point({latitude: toFloat(value.geoCoordinates.latitude), longitude: toFloat(value.geoCoordinates.longitude) } ) } )
merge (pt:PlaceType {id: value.placeType.id, description: value.placeType.placeTypeDescription, name: value.placeType.longName})
merge (b:Branch
  {id: value.branchId})
  set b.placeId=value.placeId
  set b.isActive=value.isActive
  set b.sourceStatus=value.sourceStatus
  set b.routeUrl=value.routeUrl
  set b.moreInfoUrl=value.moreInfoUrl
  set b.commercialName=value.commercialName
  set b.handoverServices=value.handoverServices
  set b.sellingPartners=value.sellingPartners
merge (b)-[:IN_STREET]->(a)
merge (b)-[:HAS_PLACETYPE]->(pt)
return b,a,pt;

As you can see from the above, the only thing special here is that I used the new-ish spatial datatypes in Neo4j to create a geoCoordinates property that has actually spatial characteristics. I can do some basic geospatial querying on that, straight out of the box.

Next Make the data a bit graphier by extract recurring patterns and making them part of a more normalised structure.

//extract city from addresses
match (a:Address)
merge (c:City {name: a.cityName, postalcode: a.postalcode})
merge (a)-[:IN_CITY]->(c);

//extract countries from addresses
match (a:Address)--(c:City)
merge (co:Country {name: a.countryName})
merge (c)-[:IN_COUNTRY]->(co);

That's it really. Just looking at the data it really is a set of hierarchies that we can now start querying with Neo4j.
The model is what we said it would be:

Now for some real fun.

Querying this dataset in Neo4j

Obviously the dataset is not very elaborate, but there were some interesting queries that we can start doing, specifically with the geospatial data that is in the dataset. Let's start with some easu stuff: let's find a Colruyt branch close (< 2kms) to my home - the query really is totally simple:

//find branches close to homewith point({latitude: 51.2081948, longitude: 4.4049826}) as home
match path = (:Country)--(:City)--(a:Address)--(:Branch)--(:PlaceType)
where distance(a.geoCoordinates, home) < 2000
return path;

I immediately get the branch info:

Now let's make it a bit more complicated. Let's find all the branches that are close to one another:

//find all addresses/branches that are within 2000m from one anothermatch (a1:Address)--(b1:Branch), (a2:Address)--(b2:Branch)
where id(a1)<id(a2)
and distance(a1.geoCoordinates, a2.geoCoordinates) < 2000
return b1.commercialName, b2.commercialName, distance(a1.geoCoordinates, a2.geoCoordinates)/1000+" km"

This gives me an interesting list:

Now, what if we wanted to visualise that data on an actual map. Fortunately, plotting the geo data just got a whole lot easier with this simple graph app that you can install into the Neo4j Desktop: neo-map.

I did have to do one hack to use neo-map: I had to convert the point properties (=geospatial datatype in Neo4j, which allows me to do the distance calculations above really easily) into simpler "latitude" and "logitude" properties to hook that up to neo-map. It was a very simple query to make the point-properties into number-properties:

//duplicate geocoordinates: stored both as geospatial data types, and as lat/long numbers
match (a:Address) 
 set a.longitude = a.geoCoordinates.longitude
 set a.latitude = a.geoCoordinates.latitude;
and then neo-map allows me to really quickly generate a marker map:
and a heatmap:
or a combination of both layers:

You can easily see how we could overlay that with other data and get some interesting insights really quickly!

Of course there's lots of other stuff possible - but I just wanted to share this quick experiment.

All data and scripts are available on github, as usual.

Look forward to hearing your feedback, if any!



Thursday, 7 November 2019

Graphistania 2.0 - Episode 1 - This Month in Neo4j

Hello everyone!

it has been deadly quiet on this page, hasn't it. That's really oh so true, and I am / was not happy with that. This blog, the podcast, and everything around has always been my humble contribution to our awesome Neo4j community, and in the past 6+ month or so, I have not been doing my part. Sorry for that. Lots of excuses that I will not bore you with, but I am going to try to do better.

Part of the reason for the silence was of course that I thought that the podcast formula (in which I always asked for the three same basic things: who are you, why graphs, what's coming in the future) had kind of run its course. 100+ episodes had given me lots of fantastic conversations, but it was time to move on. I needed a new formula.

A couple of weeks ago, while doing absolutely NOTHING graph related - unless you want to imagine a graph of a bathroom, a shower, soap, and yours truly - I came up with an idea. What if we did episodes about all of the cool, innovative things that are popping up in our community on a daily basis? Sure. But where could I find those? Well, on the Neo4j developer relations "This week in Neo4j" (TWIN4J) newsletter probably, right! But who would I talk to that about? Well... this is where I found a great partner in crime. I thought about one of my most creative colleagues, someone who is paid to be creative and is really good at it - and came up with noone other than Stefan Wendin. Stefan leads our Innovation Labs in EMEA, and has presented on that topic extensively in the past.

So we have lots of innovation. We have someone who KNOWS a lot about innovation. So let's then have a chat about some of these innovative graph database applications, shall we? Here goes.

Wednesday, 3 July 2019

Finally: someone interviewed me on their podcast

This is worth a small celebration. Super conversation (in Dutch) with Jurjen Helmus, Walter van der Scheer and wingman Ron van Weverwijk on the Dataloog podcast about the use of graph databases and Neo4j. Listen to it over here over here - or find it on itunes/spotify.

Lots of fun to do - hope you enjoy it as much too!



Tuesday, 26 March 2019

Podcast Interview with János Szendi-Varga, GraphAware

Finally - managed to get out another episode of our podcast. Early February, our friends of GraphAware published a great blogpost about their view of the "Graph Technology Landscape". The main author János Szendi-Varga, was kind enough to spend some time with me on the phone talking about our industry and where it's going. Hope you enjoy it as much as I did!

Here's the transcript of our conversation:
RVB: 00:00:03.184 Hello everyone. My name is Rik, Rik Van Bruggen from Neo4j, and here we are again recording another episode of the Graphistania, Neo4j graph database podcast. And today I have a special guest on the podcast all the way from the UAE in the Middle East and that's a very, very interesting person that did some amazing work recently with the graph database landscape. It's János Szendi-Varga. Hi János.

Tuesday, 26 February 2019

Podcast Interview with Amy Hodler, Neo4j

Yes! Another great interview for our podcast, this time with a great colleague of mine who has a big mind and many ideas around one of the most fascinating and under-used topics evah: Graph Algorithms and Artificial Intelligence. Amy Hodler has a lot to say on this topic, and is about to publish a fantastic new book on the topic together with Mark Needham. I think you will enjoy this one - even though it's a bit longer than usual. Here goes:

Here's the transcript of our conversation:
RVB:  00:00:00.938 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j, and here we are again, recording another episode of our Graphistania Graph Database Podcast. And tonight I have a dear colleague of mine on the other side of this Google call, and that's Amy Hodler. Amy, how are you?

Thursday, 7 February 2019

The Graph Technology Landscape Graph

Our friends at Graphaware, and specifically Janos Szendi-Varga created a fantastic overview of the Graph Technology Landscape in 2019. It's actually pretty cool:
There's lots of interesting data in there, which Janos put into a CSV file over here. I thought that was really cool, and took the data for a spin in a Google Sheet, and then modified it a little bit from there.

Next thing for me was to create a

Which I have now made available as an import script and a graphgist over here. You can explore the data in the deployed graphgist yourself, and figure out some of the hidden and not-so-hidden connections in the Graph Technology Landscape. 

Hope you will have as much fun with this as I had.



Tuesday, 5 February 2019

Podcast Interview with Jess Mason and Jason Cox, Untiltled Folder

Yeah,  I think I am going to repeat myself. Another podcast that was long overdue waiting to be published. I interviewed my next guests middle of December 2018, and it's hard to believe that I only just MADE the time to publish the episode. My bad - again. It's another great chat though - so you should still listen and look at some of the very cool links included below - you will not regret it. Jess and Jason are doing some amazing stuff with Neo4j over in Philly, and check it out. Here it goes:

Here's the transcript of our conversation:
RVB: 00:00:26.434 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j, and here I am, again, recording another Neo4j Graph Database Graphistania Podcast. And tonight I am joined by two guys from Philadelphia in the USA that have been doing some amazing work with Neo4j. And on top of that, they have the funniest company name, I think at least. It's called Untitled Folder. That's Jess and Jason. Jess and Jason, welcome to the podcast.

JM: 00:01:04.071 Hi. Thanks for having us on the podcast.

JC: 00:01:04.525 Hello.