Bruggen Blog: retail

Showing posts with label retail. Show all posts

Monday, 2 December 2019

Part 4/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here goes part 4 of 4 of my work on the Carrefour shopping receipts dataset. I realize we have come quite a way - and for me too there has been a lot to talk about and explore in these blogposts. Even then I feel like there's a ton of other interesting questions that we could ask and answer - but that would lead us too far.
Just to recap:

In part 1 I wrangled the data, and imported it into Neo4j.
In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset.
In part 3 we started doing some more interesting work - specifically around product combinations.

Now, in this final part of this series, I want to see if we can do some more analytical work with this dataset, for example by applying some algorithms to it. More specifically, I want to use some of our graph similarity algorithms to figure out which products are supposedly similar to one another - and do that along multiple axes.

People have written long and complicated doctorates about the best way to calculate and establish similarities in graphs - and most of it is very much beyond me and my reptile math brain. But one thing is clear: many of the algorithms have very different approaches to doing this, and there are good reasons for wanting to choose or abandon one or the other. However, in our daily Neo4j work, we have seen some particularly interesting results with the Jaccard similarity algorithm, which is part of the algos plugin to Neo4j.

Jaccard similarity

The simple explanation of what Jaccard similarity does, is that it calculates a coefficient that compares members of two sets to see which members are shared and which are very different. So it's a measure of similarity for two sets of data - with a range from 0% (not similar at all) to 100% (identical). Higher scores mean higher similarity between the two populations. Jaccard similarity is sometimes referred to as "Intersection over Union", as explained like this:

I borrowed most of this explanation from the inevitable Wikipedia of course. You can find the Neo4j algo library that contains this algorithm over here.

Part 3/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here goes part 3 of 4 of my work on the Carrefour shopping receipts dataset.

In part 1 I wrangled the data, and imported it into Neo4j. In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset. Now in this article I want to do some more interesting work - specifically around product combinations. Which products are being bought together? Who is buying which combinations together? You can just sense that this would be some interesting stuff.

And I must say that this was quite an interesting "assignment". Originally, I wanted to actually look at all the combinations of products that we found in our dataset, and I wrote a nice little query for it:

//PLEASE DON'T RUN THIS QUERY!!!
call apoc.periodic.iterate("
match (p1:Product)<-[:TICKET_HAS_PRODUCT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p2:Product)
where id(p1)>id(p2)
return p1, p2","
merge (pc:ProductCombo {combo: p1.description+ ' with '+ p2.description, product1: p1.description, product2: p2.description})
on create set pc.frequency = 1
on match set pc.frequency = pc.frequency + 1
",
{batchsize:50000, iterateList: true, parallel: false})

In theory, this works just fine - and the db starts churning away and writing back ProductCombo nodes - but it never finishes. Or maybe I lost my patience :) ... but then I realised that the math is very much working against me: I have 53588 products in this dataset. If I remember my maths correctly, that means that

nCr = n(n - 1)(n - 2) ... (n - r + 1)/r! = n! / r!(n - r)!

I would have 53588! / (2! * 53586!) = 1435810078 combinations of products possible. See the StatTrek website for the calculator :) ... on top of that I realised that ALL of these combinations are probably not that interesting for us - maybe we should try to make this a bit more specific?

Part 2/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

In the previous article in this series, we had started to play around with the Carrefour shopping receipts dataset that I found from a hackathon in 2016. It's a pretty cool dataset, and with some text wizardry and some Neo4j procedures, we quickly had a running database of Tickets, TicketItems, Clients, Malls and Products. The model looks like this:

In summary, we have

about 585k shopping tickets in the dataset,
that hold about 6.8M ticketitems (so 11-12 ticketitems/ticket, on average)
from 2 different Carrefour malls,
from 66k different Carrefour clients
with about 53k different products

This clearly is not "big data" yet, but it's big enough to be interesting and to have a bit of a meaningful play with. So let's run some queries!

Part 1/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here we go again. In an effort to do some more writing, blogging, podcasting, for our wonderful Neo4j community, I wanted to get back into a routine of playing with some more datasets in Neo4j. A couple of weeks ago I was able to play a bit with a small dataset from Colruyt Group, and I wrote about it over here. And I don't know exactly how it happened, but in some weird way I got my hand on another retailer's data assignment - this time from Carrefour.

You will notice that this will be another series of blogs: there's just too much stuff here to put into one simple post. So after having done all the technical prep for this article, it seems most logical to split it into 4 parts:

part 1 (this article) will cover the the data modeling, the import of the dataset, and some minor wrangling to get the dataset into a workable format.
part 2 (to follow) will cover a couple of cool queries to acquaint ourselves with the dataset.
part 3 (to follow) will cover a specific - and quite complicated - investigation into the product combinations that people have been buying at Carrefour - to see if we can find some patterns in there.
part 4 (to follow - and this is the final part) will look at some simple graph algorithms for analytics that we ran.

That should be plenty of fun for all of us. So let's get right into it.

The Carrefour Basket dataset

As I finished up the Colruyt article referenced above, I was actually originally just looking for some spatial information on other supermarket chain's positioning of shops in Belgium. I wanted to see if I could create some simple overlay views of where which shops were - and started browsing the interweb for data on supermarket locations. That very quickly lead to something completely different: I found this website for TADHack Global ("Telecom Application Developer Hackathon", apparently is what it stands for), a 2016 event where people could investigate different datasets and use it to hack together some cool stuff. In that 2016 event, there was an assignment from Carrefour: the Carrefour Delighting Customers Challenge Basket Data set.

Playing with the Colruyt Data Science assignment

If you spend any time in the Wonderful World of Graphs, I am sure you have noticed that the landscape has been changing in the past few years. I have definitely seen a change: the interest in using graphs has shifted from wanting to use graph databases for "data retrieval" purposes, to now also wanting to make use of it ton "make sense of" the data - basically doing data analytics. Of course data retrieval and data analysis are related, and in many cases we nowadays talk about all of this under the umbrella of data science. Sounds great, and at Neo4j we have made fantastic strides in making new functionality (think the Algo library that you can install on every Neo4j server, or think the Neuler graphapp that makes using the Algo library a walk in the park) available to enable these workloads - a work in progress that will only accelerate.

Hierarchies and the Google Product Taxonomy in Neo4j

Quite some time ago, I wrote a blogpost about using Neo4j for managing and calculating hierarchies. That post was then also later used in my book as it proved very useful for explaining one of the key use-cases for Neo4j, Impact Analysis and Simulation. So it should be pretty clear by now that HIERARCHIES ARE GRAPHS right? I think so :) ...

Hierarchical Product Taxonomy

Recently, I was preparing for a very cool brown-bag session at a client's offices, when I wanted to include a demonstration around product taxonomies. These structures are typically presented as some kind of a hierarchy/tree on many eCommerce websites - and are very well known by online users. So I wanted to find a taxonomy, and here here, Google immediately came to the rescue. I found this page on the Google Merchant Center.

You can follow the link to the Excel file, and boom - there's your product Taxonomy for you.

Bruggen Blog

Pages

Monday, 2 December 2019

Part 4/4: Playing with the Carrefour shopping receipts

Jaccard similarity

Friday, 29 November 2019

Part 3/4: Playing with the Carrefour shopping receipts

Thursday, 28 November 2019

Part 2/4: Playing with the Carrefour shopping receipts

Wednesday, 27 November 2019

Part 1/4: Playing with the Carrefour shopping receipts

The Carrefour Basket dataset

Tuesday, 12 November 2019

Playing with the Colruyt Data Science assignment

Thursday, 30 July 2015

Hierarchies and the Google Product Taxonomy in Neo4j

Hierarchical Product Taxonomy

Labels

Blogarchive

Metricool