Thursday 28 November 2019

Part 2/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1Part 2Part 3Part 4 are all published!)

In the previous article in this series, we had started to play around with the Carrefour shopping receipts dataset that I found from a hackathon in 2016. It's a pretty cool dataset, and with some text wizardry and some Neo4j procedures, we quickly had a running database of Tickets, TicketItems, Clients, Malls and Products. The model looks like this:
In summary, we have
  • about 585k shopping tickets in the dataset, 
  • that hold about 6.8M ticketitems (so 11-12 ticketitems/ticket, on average)
  • from 2 different Carrefour malls, 
  • from 66k different Carrefour clients
  • with about 53k different products
This clearly is not "big data" yet, but it's big enough to be interesting and to have a bit of a meaningful play with. So let's run some queries!

Running some interesting Neo4j queries

As usually happens, a dataset like this one takes a bit of time to get into. So it really helps to get some easier queries under our belts to really get a feel for the type of structure that we are looking at here, warts and all. So here we go:

Let's take a look at the Carrefour clients that have the most tickets in our dataset. Mind you, there's lots of Tickets that do not have a Client associated with it, and where we have marked the Client as "Unknown" - we want to ignore these tickets. Our query would look like this:

match (c:Client)<-[r]-(t:Ticket)
where c.id <> "Unknown"
return c.id as Client, count(r) as NrOfTickets
order by NrOfTickets desc
limit 10

Interesting. There's a could of clients that have been doing quite a bit of shopping here! We could refine this a bit more and see how often these people actually come into the stores (we have time data on the Ticket!) - but that's something for you to explore yourself :) ...

Now let's take a look at the products that are most bought in terms of the quantities that they represent:
match (p:Product)<--(ti:TicketItem)
return p.description, sum(ti.numberofunits) as totalQuantity
order by totalQuantity desc
limit 10

The result is pretty obvious:


Clearly at the time of the publishing of this dataset, there was quite a big bias in the purchased products: seems like BAGS ("Bolsa"), BEER ("Cerveza") and MILK ("Leche") were very important to our shoppers!

Now we can of course also slice this a little differently, and look at the sales value that these product sales represent: let's look at the products most bought in terms of their sales (=netamount) results:

match (p:Product)<--(ti:TicketItem)
return p.description, sum(ti.netamount) as totalNetamount
order by totalNetamount desc
limit 10


Still have a very important bias towards Beer :) ... I like it!

Finally, let's do some more exploration around the things that clients are buying in different malls. I don't have any details on the locations or differences between these shopping locations, but I am going to assume that they are stores that are not located too far from one another. How do I know? Well, I can pretty easily see that there are a number of known Clients that go to both stores. I just need to run this query to see the clients that purchase in different malls:

match (m1:Mall)<--(t1:Ticket)-->(c:Client)<--(t2:Ticket)-->(m2:Mall)
where m1<>m2 and t1<>t2 and c.id <> "Unknown"
return m1,m2,t1,t2,c
limit 10

Pretty easy to understand!

And then it becomes even cooler when we look at which products are being purchased in more than one store: here are the clients that purchase same product in different malls

match path = (m1:Mall)<--(t1:Ticket)-->(c:Client)<--(t2:Ticket)-->(m2:Mall)
where m1<>m2 and t1<>t2 and c.id <> "Unknown"
with path, t1, t2
match (t1)--(ti1:TicketItem)--(p:Product)--(ti2:TicketItem)--(t2)
return path,p,ti1,ti2
limit 10

I would be interested to learn a bit more about some of these products / clients.
That's it for this part of the blogpost. I think we have now started to get a better feel for the dataset - so in the next part we will start doing a little bit more fancy digging around.

All the scripts for this post are on my Github repo, and the specifics of this blogpost are in the part 2 of the gist.

Hope this was interesting so far. More very soon.

Rik

No comments:

Post a Comment