Bruggen Blog: Part 2/4: Playing with the Carrefour shopping receipts

Thursday, 28 November 2019

Part 2/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)

In the previous article in this series, we had started to play around with the Carrefour shopping receipts dataset that I found from a hackathon in 2016. It's a pretty cool dataset, and with some text wizardry and some Neo4j procedures, we quickly had a running database of Tickets, TicketItems, Clients, Malls and Products. The model looks like this:

In summary, we have

about 585k shopping tickets in the dataset,
that hold about 6.8M ticketitems (so 11-12 ticketitems/ticket, on average)
from 2 different Carrefour malls,
from 66k different Carrefour clients
with about 53k different products

This clearly is not "big data" yet, but it's big enough to be interesting and to have a bit of a meaningful play with. So let's run some queries!

Running some interesting Neo4j queries

As usually happens, a dataset like this one takes a bit of time to get into. So it really helps to get some easier queries under our belts to really get a feel for the type of structure that we are looking at here, warts and all. So here we go:

Let's take a look at the Carrefour clients that have the most tickets in our dataset. Mind you, there's lots of Tickets that do not have a Client associated with it, and where we have marked the Client as "Unknown" - we want to ignore these tickets. Our query would look like this:

match (c:Client)<-[r]-(t:Ticket)

where c.id <> "Unknown"

return c.id as Client, count(r) as NrOfTickets

order by NrOfTickets desc

limit 10

Interesting. There's a could of clients that have been doing quite a bit of shopping here! We could refine this a bit more and see how often these people actually come into the stores (we have time data on the Ticket!) - but that's something for you to explore yourself :) ...

Now let's take a look at the products that are most bought in terms of the quantities that they represent:

match (p:Product)<--(ti:TicketItem)

return p.description, sum(ti.numberofunits) as totalQuantity

order by totalQuantity desc

limit 10

The result is pretty obvious:

Clearly at the time of the publishing of this dataset, there was quite a big bias in the purchased products: seems like BAGS ("Bolsa"), BEER ("Cerveza") and MILK ("Leche") were very important to our shoppers!

Now we can of course also slice this a little differently, and look at the sales value that these product sales represent: let's look at the products most bought in terms of their sales (=netamount) results:

match (p:Product)<--(ti:TicketItem)

return p.description, sum(ti.netamount) as totalNetamount

order by totalNetamount desc

limit 10

Still have a very important bias towards Beer :) ... I like it!

Finally, let's do some more exploration around the things that clients are buying in different malls. I don't have any details on the locations or differences between these shopping locations, but I am going to assume that they are stores that are not located too far from one another. How do I know? Well, I can pretty easily see that there are a number of known Clients that go to both stores. I just need to run this query to see the clients that purchase in different malls:

match (m1:Mall)<--(t1:Ticket)-->(c:Client)<--(t2:Ticket)-->(m2:Mall)

where m1<>m2 and t1<>t2 and c.id <> "Unknown"

return m1,m2,t1,t2,c

limit 10

Pretty easy to understand!

And then it becomes even cooler when we look at which products are being purchased in more than one store: here are the clients that purchase same product in different malls

match path = (m1:Mall)<--(t1:Ticket)-->(c:Client)<--(t2:Ticket)-->(m2:Mall)

where m1<>m2 and t1<>t2 and c.id <> "Unknown"

with path, t1, t2

match (t1)--(ti1:TicketItem)--(p:Product)--(ti2:TicketItem)--(t2)

return path,p,ti1,ti2

limit 10

I would be interested to learn a bit more about some of these products / clients.

That's it for this part of the blogpost. I think we have now started to get a better feel for the dataset - so in the next part we will start doing a little bit more fancy digging around.

All the scripts for this post are on my Github repo, and the specifics of this blogpost are in the part 2 of the gist.

Hope this was interesting so far. More very soon.

Rik

Bruggen Blog

Pages

Thursday, 28 November 2019

Part 2/4: Playing with the Carrefour shopping receipts

Running some interesting Neo4j queries

No comments:

Post a Comment

Labels

Blogarchive

Metricool