Bruggen Blog: dataset

Showing posts with label dataset. Show all posts

Monday, 2 December 2019

Part 4/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here goes part 4 of 4 of my work on the Carrefour shopping receipts dataset. I realize we have come quite a way - and for me too there has been a lot to talk about and explore in these blogposts. Even then I feel like there's a ton of other interesting questions that we could ask and answer - but that would lead us too far.
Just to recap:

In part 1 I wrangled the data, and imported it into Neo4j.
In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset.
In part 3 we started doing some more interesting work - specifically around product combinations.

Now, in this final part of this series, I want to see if we can do some more analytical work with this dataset, for example by applying some algorithms to it. More specifically, I want to use some of our graph similarity algorithms to figure out which products are supposedly similar to one another - and do that along multiple axes.

People have written long and complicated doctorates about the best way to calculate and establish similarities in graphs - and most of it is very much beyond me and my reptile math brain. But one thing is clear: many of the algorithms have very different approaches to doing this, and there are good reasons for wanting to choose or abandon one or the other. However, in our daily Neo4j work, we have seen some particularly interesting results with the Jaccard similarity algorithm, which is part of the algos plugin to Neo4j.

Jaccard similarity

The simple explanation of what Jaccard similarity does, is that it calculates a coefficient that compares members of two sets to see which members are shared and which are very different. So it's a measure of similarity for two sets of data - with a range from 0% (not similar at all) to 100% (identical). Higher scores mean higher similarity between the two populations. Jaccard similarity is sometimes referred to as "Intersection over Union", as explained like this:

I borrowed most of this explanation from the inevitable Wikipedia of course. You can find the Neo4j algo library that contains this algorithm over here.

Part 3/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here goes part 3 of 4 of my work on the Carrefour shopping receipts dataset.

In part 1 I wrangled the data, and imported it into Neo4j. In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset. Now in this article I want to do some more interesting work - specifically around product combinations. Which products are being bought together? Who is buying which combinations together? You can just sense that this would be some interesting stuff.

And I must say that this was quite an interesting "assignment". Originally, I wanted to actually look at all the combinations of products that we found in our dataset, and I wrote a nice little query for it:

//PLEASE DON'T RUN THIS QUERY!!!
call apoc.periodic.iterate("
match (p1:Product)<-[:TICKET_HAS_PRODUCT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p2:Product)
where id(p1)>id(p2)
return p1, p2","
merge (pc:ProductCombo {combo: p1.description+ ' with '+ p2.description, product1: p1.description, product2: p2.description})
on create set pc.frequency = 1
on match set pc.frequency = pc.frequency + 1
",
{batchsize:50000, iterateList: true, parallel: false})

In theory, this works just fine - and the db starts churning away and writing back ProductCombo nodes - but it never finishes. Or maybe I lost my patience :) ... but then I realised that the math is very much working against me: I have 53588 products in this dataset. If I remember my maths correctly, that means that

nCr = n(n - 1)(n - 2) ... (n - r + 1)/r! = n! / r!(n - r)!

I would have 53588! / (2! * 53586!) = 1435810078 combinations of products possible. See the StatTrek website for the calculator :) ... on top of that I realised that ALL of these combinations are probably not that interesting for us - maybe we should try to make this a bit more specific?

Part 2/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

In the previous article in this series, we had started to play around with the Carrefour shopping receipts dataset that I found from a hackathon in 2016. It's a pretty cool dataset, and with some text wizardry and some Neo4j procedures, we quickly had a running database of Tickets, TicketItems, Clients, Malls and Products. The model looks like this:

In summary, we have

about 585k shopping tickets in the dataset,
that hold about 6.8M ticketitems (so 11-12 ticketitems/ticket, on average)
from 2 different Carrefour malls,
from 66k different Carrefour clients
with about 53k different products

This clearly is not "big data" yet, but it's big enough to be interesting and to have a bit of a meaningful play with. So let's run some queries!

Part 1/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

Alright here we go again. In an effort to do some more writing, blogging, podcasting, for our wonderful Neo4j community, I wanted to get back into a routine of playing with some more datasets in Neo4j. A couple of weeks ago I was able to play a bit with a small dataset from Colruyt Group, and I wrote about it over here. And I don't know exactly how it happened, but in some weird way I got my hand on another retailer's data assignment - this time from Carrefour.

You will notice that this will be another series of blogs: there's just too much stuff here to put into one simple post. So after having done all the technical prep for this article, it seems most logical to split it into 4 parts:

part 1 (this article) will cover the the data modeling, the import of the dataset, and some minor wrangling to get the dataset into a workable format.
part 2 (to follow) will cover a couple of cool queries to acquaint ourselves with the dataset.
part 3 (to follow) will cover a specific - and quite complicated - investigation into the product combinations that people have been buying at Carrefour - to see if we can find some patterns in there.
part 4 (to follow - and this is the final part) will look at some simple graph algorithms for analytics that we ran.

That should be plenty of fun for all of us. So let's get right into it.

The Carrefour Basket dataset

As I finished up the Colruyt article referenced above, I was actually originally just looking for some spatial information on other supermarket chain's positioning of shops in Belgium. I wanted to see if I could create some simple overlay views of where which shops were - and started browsing the interweb for data on supermarket locations. That very quickly lead to something completely different: I found this website for TADHack Global ("Telecom Application Developer Hackathon", apparently is what it stands for), a 2016 event where people could investigate different datasets and use it to hack together some cool stuff. In that 2016 event, there was an assignment from Carrefour: the Carrefour Delighting Customers Challenge Basket Data set.

The Great Olympian Graph - part 1/3

After my previous experiments with some sports data (most recently, the Tour de France 2016 results) in Neo4j, I recently saw the 2016 Olympic games coming up, and thought: well, there MUST be some interesting datasets to find around that - especially now that one of my favourite bike-riders in the world, Greg Van Avermaet, won the Gold Medal in the Cycling Road Race. Still so excited!!!

I did a bit of research and decided to settle on a combination of two datasets:
Just before the London Olympics in 2012,

The Guardian publlshed a list of all summer Olympic medallists, from 1896 to 2008
Just after the same 2012 games, The Guardian also published the list of the 2012 medallists

Bruggen Blog

Pages

Monday, 2 December 2019

Part 4/4: Playing with the Carrefour shopping receipts

Jaccard similarity

Friday, 29 November 2019

Part 3/4: Playing with the Carrefour shopping receipts

Thursday, 28 November 2019

Part 2/4: Playing with the Carrefour shopping receipts

Wednesday, 27 November 2019

Part 1/4: Playing with the Carrefour shopping receipts

The Carrefour Basket dataset

Monday, 8 August 2016

The Great Olympian Graph - part 1/3

Labels

Blogarchive

Metricool