(Note: this post is part of a series - Part 1, Part 2, Part 3, Part 4 are all published!)
Alright here goes part 3 of 4 of my work on the Carrefour shopping receipts dataset.
In part 1 I wrangled the data, and imported it into Neo4j. In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset. Now in this article I want to do some more interesting work - specifically around product combinations. Which products are being bought together? Who is buying which combinations together? You can just sense that this would be some interesting stuff.
And I must say that this was quite an interesting "assignment". Originally, I wanted to actually look at all the combinations of products that we found in our dataset, and I wrote a nice little query for it:
//PLEASE DON'T RUN THIS QUERY!!!
call apoc.periodic.iterate("
match (p1:Product)<-[:TICKET_HAS_PRODUCT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p2:Product)
where id(p1)>id(p2)
return p1, p2","
merge (pc:ProductCombo {combo: p1.description+ ' with '+ p2.description, product1: p1.description, product2: p2.description})
on create set pc.frequency = 1
on match set pc.frequency = pc.frequency + 1
",
{batchsize:50000, iterateList: true, parallel: false})
In theory, this works just fine - and the db starts churning away and writing back ProductCombo nodes - but it never finishes. Or maybe I lost my patience :) ... but then I realised that the math is very much working against me: I have 53588 products in this dataset. If I remember my maths correctly, that means that
I would have 53588! / (2! * 53586!) = 1435810078 combinations of products possible. See the StatTrek website for the calculator :) ... on top of that I realised that ALL of these combinations are probably not that interesting for us - maybe we should try to make this a bit more specific?
Alright here goes part 3 of 4 of my work on the Carrefour shopping receipts dataset.
In part 1 I wrangled the data, and imported it into Neo4j. In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset. Now in this article I want to do some more interesting work - specifically around product combinations. Which products are being bought together? Who is buying which combinations together? You can just sense that this would be some interesting stuff.
And I must say that this was quite an interesting "assignment". Originally, I wanted to actually look at all the combinations of products that we found in our dataset, and I wrote a nice little query for it:
//PLEASE DON'T RUN THIS QUERY!!!
call apoc.periodic.iterate("
match (p1:Product)<-[:TICKET_HAS_PRODUCT]-(t:Ticket)-[:TICKET_HAS_PRODUCT]->(p2:Product)
where id(p1)>id(p2)
return p1, p2","
merge (pc:ProductCombo {combo: p1.description+ ' with '+ p2.description, product1: p1.description, product2: p2.description})
on create set pc.frequency = 1
on match set pc.frequency = pc.frequency + 1
",
{batchsize:50000, iterateList: true, parallel: false})
In theory, this works just fine - and the db starts churning away and writing back ProductCombo nodes - but it never finishes. Or maybe I lost my patience :) ... but then I realised that the math is very much working against me: I have 53588 products in this dataset. If I remember my maths correctly, that means that
nCr = n(n - 1)(n - 2) ... (n - r + 1)/r! = n! / r!(n - r)!