Friday, 29 November 2019

Part 3/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1Part 2Part 3Part 4 are all published!)

Alright here goes part 3 of 4 of my work on the Carrefour shopping receipts dataset.

In part 1 I wrangled the data, and imported it into Neo4j. In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset. Now in this article I want to do some more interesting work - specifically around product combinations. Which products are being bought together? Who is buying which combinations together? You can just sense that this would be some interesting stuff.

And I must say that this was quite an interesting "assignment". Originally, I wanted to actually look at all the combinations of products that we found in our dataset, and I wrote a nice little query for it:

//PLEASE DON'T RUN THIS QUERY!!!
call apoc.periodic.iterate("
match (p1:Product)<-[:TICKET_HAS_PRODUCT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p2:Product)
where id(p1)>id(p2)
return p1, p2","
merge (pc:ProductCombo {combo: p1.description+ ' with '+ p2.description, product1: p1.description, product2: p2.description})
on create set pc.frequency = 1
on match set pc.frequency = pc.frequency + 1
",
{batchsize:50000, iterateList: true, parallel: false})

In theory, this works just fine - and the db starts churning away and writing back ProductCombo nodes - but it never finishes. Or maybe I lost my patience :) ... but then I realised that the math is very much working against me: I have 53588 products in this dataset. If I remember my maths correctly, that means that
nCr = n(n - 1)(n - 2) ... (n - r + 1)/r! = n! / r!(n - r)!
I would have 53588! / (2! * 53586!) = 1435810078 combinations of products possible. See the StatTrek website for the calculator :) ... on top of that I realised that ALL of these combinations are probably not that interesting for us - maybe we should try to make this a bit more specific?


So I started thinking of different ways of making it more manageable AND meaningful for us. I basically thought of two ways:

  1. looking at only the product combinations that our highest spending clients are buying togerther
  2. looking at the most expensive products that are being bought together.
Both of these approaches have the huge advantage that they make our research much more local - I don't have to look at all product combinations anymore, I can just focus on the combinations that might be relevant.

1. Product combinations being bought by the highest spending clients

This sounds like an interesting and relevant question, right? If the big spenders are buying stuff together and if that seems to be like a common pattern, then maybe Carrefour could be marketing towards this, right? So let's see. Here the query to find the product combinations of our 100 highest spenders:

//what productcombo's are the highest spending customers buying most?
match (c:Client)<-[:TICKET_PURCHASED_BY_CLIENT]-(t:Ticket)
where c.id <> "Unknown"
with c, sum(t.ticketamount) as totalspend
order by totalspend desc
limit 100
match
  (c)<-[:TICKET_PURCHASED_BY_CLIENT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p1:Product),
  (t)-[:TICKET_HAS_PRODUCT]-(p2:Product)
where id(p1)<id(p2)
return distinct p1.description+' with '+p2.description, count(*) as frequency
order by frequency desc limit 10;

and the results come back pretty quickly (note that I am not writing this back to the graph):


The result is equally shocking as it is predictable: our big spenders are buying LOTS OF BAGS (=bolsa, in Spanish)!!!

So why don't we just exclude the products that have the word "BOLSA" in their description, and rerun the query:

//what productcombo's are the highest spending customers buying most - EXCEPT BAGS
match (c:Client)<-[:TICKET_PURCHASED_BY_CLIENT]-(t:Ticket)
where c.id <> "Unknown"
with c, sum(t.ticketamount) as totalspend
order by totalspend desc
limit 100
match
  (c)<-[:TICKET_PURCHASED_BY_CLIENT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p1:Product),
  (t)-[:TICKET_HAS_PRODUCT]-(p2:Product)
where id(p1)<id(p2)
and not (p1.description contains "BOLSA")
and not (p2.description contains "BOLSA")
return distinct p1.description+' with '+p2.description, count(*) as frequency
order by frequency desc limit 10

This is already a lot more interesting:

One things immediately struck me in this (very high level) result set: we are looking at LOTS OF DIFFERENT FOODS being bought together! Maybe there's a kind of recipe in here somewhere, or maybe Carrefour could actually experiment with combining different food combinations and make it dead easy for the clients to buy food combinations together? Could be a really interesting subject of further investigations, I think.

2. Expensive products being bought together

This was another hypothesis that could be more interesting - and a lot more local so that we don't have to explore every meaningless combination. In order to get to this, we are however missing some data - we first need to know pricing of a product. Turns out that we don't have that in the dataset: every TicketItem has a ti.netamount and a ti.numberofitems but it does not actually mention an item price - let alone a product price.

Therefore I ran a simple query to find out the potential prices of a product:
MATCH (n:TicketItem)-[:TICKETITEM_HAS_PRODUCT]->(p:Product)
where n.numberofunits = 1
with p,n
limit 1000
with p.description as description,  collect(n.netamount) as prices
return description, size(prices) as nrofprices
order by nrofprices desc
limit 10
And I quickly found out that that 1 product can have more than 1 price in this dataset. The result of the query above - which only looks at ticketitems that have 1 unit of a specific ticketitem - looks like this:


So we already have lots of prices being used here. This gets even worse when you are looking at tickets that have ticketitems with more than 1 unit. So I tried to get a feel for this by formalising this and looking at the average price and standard deviation.

//let's look at average and standard deviation
MATCH (n:TicketItem)-[:TICKETITEM_HAS_PRODUCT]->(p:Product)
where n.numberofunits > 0
with p,n
limit 100
return p.description,  collect(n.netamount/n.numberofunits) as prices, avg(n.netamount/n.numberofunits) as avgprice, stdev(n.netamount/n.numberofunits) as stdev_price
order by stdev_price desc
limit 100

So we quickly see that there are products with one and only one price - but there's also quite a few exceptions:


So in order to answer our question above, to know which expensive products are being bought together, we are going to have to store the ticketitems' prices and the product prices in the graph first. Let's start with the ticketitem prices. We will have to batch this again, as there are so many ticketitems that we could not do this in one transaction on my laptop.

//first establish ticketitemprice
call apoc.periodic.iterate("
match (ti:TicketItem) where ti.numberofunits > 0 return ti",
"set ti.ticketitemprice = ti.netamount/ti.numberofunits",
{batchSize: 10000, parallel:true});

This is a fairly large update operation, so after 45secs or so the result returns:

Now it will be super easy to use the newly calculated ticketitem price to calculate the average price of a product:

//then use the ticketitemprice to calculate averageprice of a product
match (ti:TicketItem)-[:TICKETITEM_HAS_PRODUCT]->(p:Product)
where exists(ti.ticketitemprice)
with p, avg(ti.ticketitemprice) as avgprice
set p.averageprice = avgprice;

This goes quite quickly:
Set 53443 properties, completed after 16153 ms.

And so now we are ready to start querying this for our original question. The method we used below is to work with a "price cutoff point" below which we would not be interested. So we first calculate the cutoff (the 1000 most expensive products), and then look at the product combinations. From our formula above we know that there will only be  1000! / (2! * 998!) = 499500 combinations - and that's perfectly doable in my environment.

//find the product combinations and their prices that are often bought together
match(p:Product)
where exists(p.averageprice)
with p.averageprice as prices
order by p.averageprice desc
limit 1000
with min(prices) as cutoff
match (p1:Product)<--(t:Ticket), (p2:Product)<--(t)
where p1.averageprice > cutoff and p2.averageprice > cutoff and id(p1)>id(p2)
return distinct p1.description as product1, p1.averageprice as product1price, p2.description as product2, p2.averageprice as product2price, count(*) as frequency
order by frequency desc limit 10

This gives us a result very quickly:


We can also look at the same result in graph format:
If you look at the above a little more closely, you already see some interesting lessons again: it seems like we have a number of clients that are buying two versions of phone bought together (eg. in two different colours)! Why would that be? Are they legitimately purchasing two phones (one for themselves and one for their partner) together? Or are they exploiting the easy returns process of Carrefour and just hanging on to one of the two purchases after they get home? Would be interesting to investigate.

All the scripts for this post are on my Github repo, and the specifics of this blogpost are in the part 3 of the gist.

Hope this was interesting so far. More very soon.

Rik

No comments:

Post a Comment