In the previous article in this series, we had started to play around with the Carrefour shopping receipts dataset that I found from a hackathon in 2016. It's a pretty cool dataset, and with some text wizardry and some Neo4j procedures, we quickly had a running database of Tickets, TicketItems, Clients, Malls and Products. The model looks like this:
In summary, we have
- about 585k shopping tickets in the dataset,
- that hold about 6.8M ticketitems (so 11-12 ticketitems/ticket, on average)
- from 2 different Carrefour malls,
- from 66k different Carrefour clients
- with about 53k different products
This clearly is not "big data" yet, but it's big enough to be interesting and to have a bit of a meaningful play with. So let's run some queries!
Running some interesting Neo4j queries
As usually happens, a dataset like this one takes a bit of time to get into. So it really helps to get some easier queries under our belts to really get a feel for the type of structure that we are looking at here, warts and all. So here we go:
Let's take a look at the Carrefour clients that have the most tickets in our dataset. Mind you, there's lots of Tickets that do not have a Client associated with it, and where we have marked the Client as "Unknown" - we want to ignore these tickets. Our query would look like this:
Let's take a look at the Carrefour clients that have the most tickets in our dataset. Mind you, there's lots of Tickets that do not have a Client associated with it, and where we have marked the Client as "Unknown" - we want to ignore these tickets. Our query would look like this:
match (c:Client)<-[r]-(t:Ticket)
where c.id <> "Unknown"
return c.id as Client, count(r) as NrOfTickets
order by NrOfTickets desc
limit 10
Interesting. There's a could of clients that have been doing quite a bit of shopping here! We could refine this a bit more and see how often these people actually come into the stores (we have time data on the Ticket!) - but that's something for you to explore yourself :) ...
Now let's take a look at the products that are most bought in terms of the quantities that they represent:
match (p:Product)<--(ti:TicketItem)
return p.description, sum(ti.numberofunits) as totalQuantity
order by totalQuantity desc
limit 10
The result is pretty obvious:
The result is pretty obvious:
Clearly at the time of the publishing of this dataset, there was quite a big bias in the purchased products: seems like BAGS ("Bolsa"), BEER ("Cerveza") and MILK ("Leche") were very important to our shoppers!
Now we can of course also slice this a little differently, and look at the sales value that these product sales represent: let's look at the products most bought in terms of their sales (=netamount) results:
match (p:Product)<--(ti:TicketItem)
return p.description, sum(ti.netamount) as totalNetamount
order by totalNetamount desc
limit 10
Still have a very important bias towards Beer :) ... I like it!
Finally, let's do some more exploration around the things that clients are buying in different malls. I don't have any details on the locations or differences between these shopping locations, but I am going to assume that they are stores that are not located too far from one another. How do I know? Well, I can pretty easily see that there are a number of known Clients that go to both stores. I just need to run this query to see the clients that purchase in different malls:
match (m1:Mall)<--(t1:Ticket)-->(c:Client)<--(t2:Ticket)-->(m2:Mall)
where m1<>m2 and t1<>t2 and c.id <> "Unknown"
return m1,m2,t1,t2,c
limit 10
Pretty easy to understand!
And then it becomes even cooler when we look at which products are being purchased in more than one store: here are the clients that purchase same product in different malls
And then it becomes even cooler when we look at which products are being purchased in more than one store: here are the clients that purchase same product in different malls
match path = (m1:Mall)<--(t1:Ticket)-->(c:Client)<--(t2:Ticket)-->(m2:Mall)
where m1<>m2 and t1<>t2 and c.id <> "Unknown"
with path, t1, t2
match (t1)--(ti1:TicketItem)--(p:Product)--(ti2:TicketItem)--(t2)
return path,p,ti1,ti2
limit 10
I would be interested to learn a bit more about some of these products / clients.
That's it for this part of the blogpost. I think we have now started to get a better feel for the dataset - so in the next part we will start doing a little bit more fancy digging around.
All the scripts for this post are on my Github repo, and the specifics of this blogpost are in the part 2 of the gist.
Hope this was interesting so far. More very soon.
Rik
All the scripts for this post are on my Github repo, and the specifics of this blogpost are in the part 2 of the gist.
Hope this was interesting so far. More very soon.
Rik
No comments:
Post a Comment