Wednesday 27 November 2019

Part 1/4: Playing with the Carrefour shopping receipts

(Note: this post is part of a series - Part 1Part 2Part 3Part 4 are all published!)

Alright here we go again. In an effort to do some more writing, blogging, podcasting, for our wonderful Neo4j community, I wanted to get back into a routine of playing with some more datasets in Neo4j. A couple of weeks ago I was able to play a bit with a small dataset from Colruyt Group, and I wrote about it over here. And I don't know exactly how it happened, but in some weird way I got my hand on another retailer's data assignment - this time from Carrefour.

You will notice that this will be another series of blogs: there's just too much stuff here to put into one simple post. So after having done all the technical prep for this article, it seems most logical to split it into 4 parts:

  1. part 1 (this article) will cover the the data modeling, the import of the dataset, and some minor wrangling to get the dataset into a workable format.
  2. part 2 (to follow) will cover a couple of cool queries to acquaint ourselves with the dataset.
  3. part 3 (to follow) will cover a specific - and quite complicated - investigation into the product combinations that people have been buying at Carrefour - to see if we can find some patterns in there.
  4. part 4 (to follow - and this is the final part) will look at some simple graph algorithms for analytics that we ran.

That should be plenty of fun for all of us. So let's get right into it.

The Carrefour Basket dataset

As I finished up the Colruyt article referenced above, I was actually originally just looking for some spatial information on other supermarket chain's positioning of shops in Belgium. I wanted to see if I could create some simple overlay views of where which shops were - and started browsing the interweb for data on supermarket locations. That very quickly lead to something completely different: I found this website for TADHack Global ("Telecom Application Developer Hackathon", apparently is what it stands for), a 2016 event where people could investigate different datasets and use it to hack together some cool stuff. In that 2016 event, there was an assignment from Carrefour: the Carrefour Delighting Customers Challenge Basket Data set.


Sounded interesting. So I downloaded the original zip-file to see if I could have a play.

Very quickly I found that there was something weird in the JSON file format:


    "_id" : NumberInt(1003), 
    "mall" : NumberInt(2), 
    "date" : ISODate("2016-01-14T20:07:00.000+0000"), 
    "items" : [
        {
            "net_am" : 2.83, 
            "n_unit" : NumberInt(1), 
            "desc" : "QUESO TIERNO MEZCL"
        }, 
...
...
    ], 
    "client" : NumberLong(77021708271)
}

That format does not work well with my JSON processing tools in APOC, so ... after calling a friend I quickly found out that the source format is not really JSON - it may be some MongoDB format of sorts? I don't really know. But I did have to clean up the data bit:


    "_id" : 1003, 
    "mall" : 2, 
    "date" : "2016-01-14T20:07:00.000+0000", 
    "items" : [
        {
            "net_am" : 2.83, 
            "n_unit" : 1, 
            "desc" : "QUESO TIERNO MEZCL"
        },
...
...
    ], 
    "client" : 77021708271
}

So after some impressive text wrangling, I managed to cleaned up the dataset and create a proper JSON file that I could use: here's the actual file (>1GB) and here is the zip of same file (<100MB). Now we can get to work.

Importing the dataset into Neo4j

You will have noticed that this is not a tiny little dataset. It contains these

DATA FIELDS
-----------
The file contains the following fields:

## ID      - Number id for that individual ticket.
## MALL    - Store where the ticket was printed. It has two values, 1 and 2. 
## DATE    - Date and time the ticket was printed.
## CLIENT  - Some tickets will have a Customer ID. Many tickets will share a Customer ID. 
## ITEMS   - List of items contained in the printed ticket. The list contains a dictionary with a product description (desc), the amount charged (net_am), and the number of units bought (n_unit).

and the fields are structured in this JSON format:

DATA FILE STRUCTURE
-------------------

_id    : NumberInt
mall   : NumberInt 
date   : ISODate
client : NumberLong(77021708271)
items  : [{
            "net_am" : NumberInt
            "n_unit" : NumberInt 
            "desc"   : String
         }] 

So In order to do the import I would need to think a little bit about my datamodel first, which seems pretty easy. The dataset is essentially a long list of TICKETS (= shopping receipts of customers that purchase their groceries/whatever in a Carrefour shop) with TICKETITEMS on it. Given my experience with Neo4j, it feels natural to immediately pull out some more interesting data elements, like the MALL (the shop where the customer bought the TICKETITEMS on the TICKET) and the CLIENT (a loyalty identifier - not always available). Last but not least: the TICKETITEMS also have a product description on it - so why not pull that out as well and create a separate PRODUCT node label. Then the model that I would be using would look like this:
Now, given the fact that this is a sizeable dataset, and that I am doing this experiment on a laptop running a zillion other memory-hogging applications, I needed to update my Neo4j config a little bit:

//Neo4j.conf file entries
dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=2G
dbms.memory.pagecache.size=8G

I just noticed that if I tried it with less memory, my import / wrangling transactions would just bomb out and lead to massive STOP-THE-WORLD garbage collection delays. Not something you want.

Preparing for the import also means setting up some indexes that we will need later on:
create index on :Ticket(id);
create index on :Mall(id);
create index on :Client(id);
create index on :Product(description);
create index on :TicketItem(netamount);
create index on :TicketItem(numberofunits);
create index on :TicketItem(product);

and then we are ready to do the actual import. The statement uses two of our beloved APOC procedures:

  1. apoc.periodic.iterate to do the batching of our import iterations. In that procedure, we provide 2 statements, the first outer statement is providing a stream of values to be processed. The second, inner statement processes one element at a time
  2. apoc.load.json to read our json file and extract nodes and relationships from the structure. 

Here's the full statement:

:param url => "file://path/to/file/DelightingCustomersBDclean.json";


call apoc.periodic.iterate("
  call apoc.load.json($url) yield value return value","
    create (t:Ticket {id: value._id, datetime: datetime(value.date)})
    merge (m:Mall {id: value.mall})
    create (t)-[:TICKET_PRINTED_IN_MALL]->(m)
    merge (c:Client {id: value.client})
    create (c)<-[:TICKET_PURCHASED_BY_CLIENT]-(t)
      with value, t
        unwind value.items as item
        create (t)-[:TICKET_HAS_TICKETITEM]->(ti:TicketItem {product: item.desc, netamount: item.net_am, numberofunits: item.n_unit})
        merge (p:Product {description: item.desc})
        create (ti)-[:TICKETITEM_HAS_PRODUCT]->(p)",
  {batchSize: 10000, iterateList: true, parallel: false, params: {url:$url}});

Of course this takes a little while to run - depending on your configuration. This is the model after we run the import:

And these are some numbers in our Neo4j database:

Now, in view of some of the work that we want to do later on, it makes sense to do some aggregation work in the database so that we can work with the dataset more easily. So I am going to do a few updates first before we start querying:

//calculate total value of tickets based on sum of ticketitems in a ticket
match (t:Ticket)-->(ti:TicketItem)
with t, sum(ti.netamount) as ticketvalue
set t.ticketamount = ticketvalue;

This will make it a lot easier to prioritise tickets that we want to analyse, as we now have the total value of the ticket readily available as a property on the Ticket, and don't have to sweep all the TicketItems everytime we need those values.

Last update operation before we start querying, is to basically close a triangle. In the above model, we don't have a direct relationship between the Ticket and the Product - we always have to go through the TicketItem. Not a terrible thing to do, but it's going to be a lot easier to look at combinations of products in the Tickets, if we close that loop - and create the relationship between the Ticket and the Product. This is pretty easy:

//interested in products that are often bought together
//in order to do that we link products with tickets
call apoc.periodic.iterate("
match (t:Ticket)-->(ti:TicketItem)-->(p:Product) return t,p,ti","
create (t)-[:TICKET_HAS_PRODUCT {numberofunits: ti.numberofunits, netamount: ti.netamount}]->(p)",{batchSize:10000, parallel:false});

So we have this adjusted model:

That's it for now. We are ready to start querying the dataset in a bit more detail.

All the scripts for this post are on my Github repo, and the specifics of this blogpost are in the parts 0 and 1 of the gist.

Hope this was interesting so far. More very soon.

Rik



No comments:

Post a Comment