Alright here goes part 4 of 4 of my work on the Carrefour shopping receipts dataset. I realize we have come quite a way - and for me too there has been a lot to talk about and explore in these blogposts. Even then I feel like there's a ton of other interesting questions that we could ask and answer - but that would lead us too far.
Just to recap:
- In part 1 I wrangled the data, and imported it into Neo4j.
- In part 2 I was doing some simple but interesting queries on the data, just to get our feet wet and get a feel for the dataset.
- In part 3 we started doing some more interesting work - specifically around product combinations.
Now, in this final part of this series, I want to see if we can do some more analytical work with this dataset, for example by applying some algorithms to it. More specifically, I want to use some of our graph similarity algorithms to figure out which products are supposedly similar to one another - and do that along multiple axes.
People have written long and complicated doctorates about the best way to calculate and establish similarities in graphs - and most of it is very much beyond me and my reptile math brain. But one thing is clear: many of the algorithms have very different approaches to doing this, and there are good reasons for wanting to choose or abandon one or the other. However, in our daily Neo4j work, we have seen some particularly interesting results with the Jaccard similarity algorithm, which is part of the algos plugin to Neo4j.
Jaccard similarity
The simple explanation of what Jaccard similarity does, is that it calculates a coefficient that compares members of two sets to see which members are shared and which are very different. So it's a measure of similarity for two sets of data - with a range from 0% (not similar at all) to 100% (identical). Higher scores mean higher similarity between the two populations. Jaccard similarity is sometimes referred to as "Intersection over Union", as explained like this:
I borrowed most of this explanation from the inevitable Wikipedia of course. You can find the Neo4j algo library that contains this algorithm over here.
Similarity of products by ticket
So let's apply this. The easiest thing to do is to work with Neo4j Desktop, install the Algo Plugin, and then install the Neuler Graph App from https://install.graphapp.io/. It's really super easy - and it acts as a great frontend to the algo library to allow for easy access and experimentation.Thanks to my friend Michael, I was able to diverge a little bit from the standard code that Neuler generates - and do something a little bit more specific. Here's what we did:
// similarity of product by ticket
MATCH (p:Product)<-[:TICKET_HAS_PRODUCT]-(t:Ticket)
WITH {item:id(p), categories: collect(id(t))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {write:true, writeRelationshipType: "SIMILAR_BY_TICKET", topK:10, degreeCutoff:10, similarityCutoff:0.5})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p25, p50, p75, p90, p95, p99, p999, p100;
As you can see the query
As you can see the query
- starts with a match, containing items and categories
- passes these to the algo in two WITH statements
- writes back the result of the calculation in a SIMILAR_BY_TICKET relationship, including the similarity score
- returns some stats as output for the algo, mainly to verify if everything went well
So launching the query, and 6,5 minutes later we get a result.
As mentioned, the algo writes back a SIMILAR_BY_TICKET relationship, and so we can easily explore that.
MATCH p=()-[r:SIMILAR_BY_TICKET]->() RETURN p LIMIT 100
It's pretty cool to see how the Algo - without any prior domain knowledge - found some interesting similarities, all by itself. Look at this:
//hp printer cartridges
match (p:Product)-[conn]-()
where p.description contains "HP 302"
return p,conn;
It's pretty obvious, but different printer cartridges for the "302" model are often bought together.
Or take a look at this:
//garden chairs and tables
match (p:Product)-[conn]-()
where p.description contains "RIVERSIDE"
return p,conn;
Basically this shows is that Carrefour is also selling garden furniture, and that the "chairs" and "tables" of this furniture seems to be quite "similar" to one another - as they are likely to be bought together.
Undoubtedly there are many more experiments that we could do with this dataset - especially if we had a little bit of a bigger machine setup. But hey, I think I have kind of already made my point - there could be a lot of value in this and I hope this will get many of you thinking and tinkering with Neo4j - it's a blast.
All the scripts for this post are on my Github repo, and the specifics of this blogpost are in the part 4 of the gist.
Hope this was interesting - look forward to hearing your feedback.
Rik
Hope this was interesting - look forward to hearing your feedback.
Rik
No comments:
Post a Comment