Friday 29 November 2019

Part 3/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1Part 2Part 3Part 4 are all published!)

Alright here goes part 3 of 4 of my work on the Carrefour shopping receipts dataset.

In part 1 I wrangled the data, and imported it into Neo4j. In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset. Now in this article I want to do some more interesting work - specifically around product combinations. Which products are being bought together? Who is buying which combinations together? You can just sense that this would be some interesting stuff.

And I must say that this was quite an interesting "assignment". Originally, I wanted to actually look at all the combinations of products that we found in our dataset, and I wrote a nice little query for it:

call apoc.periodic.iterate("
match (p1:Product)<-[:TICKET_HAS_PRODUCT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p2:Product)
where id(p1)>id(p2)
return p1, p2","
merge (pc:ProductCombo {combo: p1.description+ ' with '+ p2.description, product1: p1.description, product2: p2.description})
on create set pc.frequency = 1
on match set pc.frequency = pc.frequency + 1
{batchsize:50000, iterateList: true, parallel: false})

In theory, this works just fine - and the db starts churning away and writing back ProductCombo nodes - but it never finishes. Or maybe I lost my patience :) ... but then I realised that the math is very much working against me: I have 53588 products in this dataset. If I remember my maths correctly, that means that
nCr = n(n - 1)(n - 2) ... (n - r + 1)/r! = n! / r!(n - r)!
I would have 53588! / (2! * 53586!) = 1435810078 combinations of products possible. See the StatTrek website for the calculator :) ... on top of that I realised that ALL of these combinations are probably not that interesting for us - maybe we should try to make this a bit more specific?

Thursday 28 November 2019

Part 2/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1Part 2Part 3Part 4 are all published!)

In the previous article in this series, we had started to play around with the Carrefour shopping receipts dataset that I found from a hackathon in 2016. It's a pretty cool dataset, and with some text wizardry and some Neo4j procedures, we quickly had a running database of Tickets, TicketItems, Clients, Malls and Products. The model looks like this:
In summary, we have
  • about 585k shopping tickets in the dataset, 
  • that hold about 6.8M ticketitems (so 11-12 ticketitems/ticket, on average)
  • from 2 different Carrefour malls, 
  • from 66k different Carrefour clients
  • with about 53k different products
This clearly is not "big data" yet, but it's big enough to be interesting and to have a bit of a meaningful play with. So let's run some queries!

Wednesday 27 November 2019

Part 1/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1Part 2Part 3Part 4 are all published!)

Alright here we go again. In an effort to do some more writing, blogging, podcasting, for our wonderful Neo4j community, I wanted to get back into a routine of playing with some more datasets in Neo4j. A couple of weeks ago I was able to play a bit with a small dataset from Colruyt Group, and I wrote about it over here. And I don't know exactly how it happened, but in some weird way I got my hand on another retailer's data assignment - this time from Carrefour.

You will notice that this will be another series of blogs: there's just too much stuff here to put into one simple post. So after having done all the technical prep for this article, it seems most logical to split it into 4 parts:

  1. part 1 (this article) will cover the the data modeling, the import of the dataset, and some minor wrangling to get the dataset into a workable format.
  2. part 2 (to follow) will cover a couple of cool queries to acquaint ourselves with the dataset.
  3. part 3 (to follow) will cover a specific - and quite complicated - investigation into the product combinations that people have been buying at Carrefour - to see if we can find some patterns in there.
  4. part 4 (to follow - and this is the final part) will look at some simple graph algorithms for analytics that we ran.

That should be plenty of fun for all of us. So let's get right into it.

The Carrefour Basket dataset

As I finished up the Colruyt article referenced above, I was actually originally just looking for some spatial information on other supermarket chain's positioning of shops in Belgium. I wanted to see if I could create some simple overlay views of where which shops were - and started browsing the interweb for data on supermarket locations. That very quickly lead to something completely different: I found this website for TADHack Global ("Telecom Application Developer Hackathon", apparently is what it stands for), a 2016 event where people could investigate different datasets and use it to hack together some cool stuff. In that 2016 event, there was an assignment from Carrefour: the Carrefour Delighting Customers Challenge Basket Data set.

Tuesday 12 November 2019

Playing with the Colruyt Data Science assignment

If you spend any time in the Wonderful World of Graphs, I am sure you have noticed that the landscape has been changing in the past few years. I have definitely seen a change: the interest in using graphs has shifted from wanting to use graph databases for "data retrieval" purposes, to now also wanting to make use of it ton "make sense of" the data - basically doing data analytics. Of course data retrieval and data analysis are related, and in many cases we nowadays talk about all of this under the umbrella of data science. Sounds great, and at Neo4j we have made fantastic strides in making new functionality (think the Algo library that you can install on every Neo4j server, or think the Neuler graphapp that makes using the Algo library a walk in the park) available to enable these workloads - a work in progress that will only accelerate.

Thursday 7 November 2019

Graphistania 2.0 - Episode 1 - This Month in Neo4j

Hello everyone!

it has been deadly quiet on this page, hasn't it. That's really oh so true, and I am / was not happy with that. This blog, the podcast, and everything around has always been my humble contribution to our awesome Neo4j community, and in the past 6+ month or so, I have not been doing my part. Sorry for that. Lots of excuses that I will not bore you with, but I am going to try to do better.

Part of the reason for the silence was of course that I thought that the podcast formula (in which I always asked for the three same basic things: who are you, why graphs, what's coming in the future) had kind of run its course. 100+ episodes had given me lots of fantastic conversations, but it was time to move on. I needed a new formula.

A couple of weeks ago, while doing absolutely NOTHING graph related - unless you want to imagine a graph of a bathroom, a shower, soap, and yours truly - I came up with an idea. What if we did episodes about all of the cool, innovative things that are popping up in our community on a daily basis? Sure. But where could I find those? Well, on the Neo4j developer relations "This week in Neo4j" (TWIN4J) newsletter probably, right! But who would I talk to that about? Well... this is where I found a great partner in crime. I thought about one of my most creative colleagues, someone who is paid to be creative and is really good at it - and came up with noone other than Stefan Wendin. Stefan leads our Innovation Labs in EMEA, and has presented on that topic extensively in the past.

So we have lots of innovation. We have someone who KNOWS a lot about innovation. So let's then have a chat about some of these innovative graph database applications, shall we? Here goes.