Saturday, 21 March 2020

Supply Chain Management with graphs: part 1/3 - data wrangling and import

Alright, I have been putting the writing of this blogpost off for too long. Finally, on this sunny Saturday afternoon where we are locked inside our homes because of the Covid-19 pandemic, I think I'll try to make a dent in it - I have a lot of stuff to share already.

The basic idea for this (series of) blogpost(s) is pretty simple: graph problems are often characterised by lots of connections between entities, and by queries that touch many (or an unknown quantity) of these entities. One of the prime examples is pathfinding: trying to understand how different entities are connected to one another, understanding the cost or duration of these connections, etc. So pretty quickly, you understand that logistics and supply chain management are great problems to tackle with graphs, if you think about it. Supply Chains are graphs. So why not story and retrieve these chains with a graph database? Seems obvious.

We've also had lots of examples of people trying to solve supply chain management problems  in the past. Take a look at some of these examples:
And of course some of these presentations from different events that we organised:
So I had long thought that it would be great to have some kind of a demo dataset for this use case. Of course it's not that difficult to create something hypothetical yourself - but it's always more interesting to work with real data - so I started to look around.

Finding a Supply Chain Dataset to play with

After some googling, I actually found a really interesting article on StackExchange. It referenced a scientific article from a while back that was doing some analysis on real-world supply chains, and it had the dataset published with it. Take a look at the actual paper over here:


There's an Appendix that's also worth looking at over here.

In paper, it does not take very long to find lots of Supply Chain graphs in there. Here's just some of them:


So: if it looks like a graph, walks like a graph, quacks like a graph - it must be a graph! So I decided to take a look at the underlying data for the paper. Turns out that the author, Sean Willems, also published the datasets for this paper online. So before I know it I was downloading the XL file from this website. To make sure that it would not go unlinked I also put it on a google drive for direct download.

When you open this workbook, you will actually find a real trove of Supply Chain Management datas. In the sheet, you will find a set of worksheets for 38 (!!!) different patterns of supply chains that Mr. Willems' research has analysed. Look at it below: there are interesting examples in here for lots of different industries, and there are chains in there of smaller and (very) large sizes. It's pretty impressive:

For every one of these supply chains, you will find that the Excel file incorporates three different tabs:
  1. an F-tab with visual representation of the supply chain. For the first couple of chains this will be quite easy, but for the higher numbers they actually get really complicated. Take a look.
  2. a SD-tab with data for every stage in the supply chain. It's a pretty wide table with lots of statistics about every steps in the supply chain. For chain 38 this looks very different and a lot more complicated:
  3. an LL-tab with relationships between the different stages. This is where the "graph comes alive":
I thought it would be interesting to actually create a graph database in Neo4j with all of this data in it, nicely integrated. To do that, I need to create a bigger google sheet with all the data for all the different supply chains nicely integrated. That took a bit of copy/pasting, but was not difficult at all. The only thing I had to add was that for ever row in the "relationships" or the "detail" worksheet, I would just add a "ChainId" property to identify the specific supply chain that a stage was part of.



In the sheet, you will therefore find 4 different tabs (instead of 38x3 sheets in the Excel version):

  1. a Summary tab
  2. a Relationships tab
  3. a Details tab
  4. a "Stageclassifications" tab

Based on this sheet, I could actually do an import of the dataset into Neo4j quite easily, as every one of these google sheets can be immediately downloaded as a CSV file. So we should be good - let's give it a try!



Importing the dataset into Neo4j

The load script for this dataset can be found on Github of course. It's a very simple process all in all: it starts with creating some of the indexes in the Neo4j database:

create index on :SupplyChain(id);
create index on :Stage(name);
create index on :Stage(betweenness);
create index on :SIC(code);
create index on :Company(id);
create index on :StageClassification(name);


It's best to do that first, as it will make many of the load operations that follow a bit faster. So then we can proceed to load the companies from the google sheet's CSV export:

//load the companies
load csv with headers from "https://docs.google.com/spreadsheets/u/0/d/1vD4UCxsincEQYQpAeCbEqnd3QFdKcuL4ukesbyu5RIc/export?format=csv&id=1vD4UCxsincEQYQpAeCbEqnd3QFdKcuL4ukesbyu5RIc&gid=0" as csv
merge (sic:SIC {code: toInt(csv.SIC_Code), desc: csv.SIC_Description})
merge (c:Company {id: toInt(csv.companyId)})
merge (ch:SupplyChain {id: toInt(csv.chainId)})
merge (c)-[:HAS_SIC]->(sic)
create (c)-[:HAS_CHAIN]->(ch);

And once that is done we can load Supply Chain stages, and the the relationships between all the stages. For this we do the following:

//load the rels between the chain stages
load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1vD4UCxsincEQYQpAeCbEqnd3QFdKcuL4ukesbyu5RIc/export?format=csv&id=1vD4UCxsincEQYQpAeCbEqnd3QFdKcuL4ukesbyu5RIc&gid=1371172744" as csv
merge (ss:Stage {name: csv.sourceStage, chainId: toInt(csv.chainId)})
merge (ds:Stage {name: csv.destinationStage, chainId: toInt(csv.chainId)})
create (ss)-[:PRECEDES]->(ds);

So, as you can see, this uses our MERGE operator to effectively loop through all of these, and create the stages when they don't exist yet, but leave them alone if they already exist. Very useful to make sure that we don't end up with duplicates. We connect it all with the PRECEDES relationship, to indicate that one stage comes before the next stage.

Next, we will actually connect these stages to the Supply Chains that we have added above. We do that by having the SupplyChain connect to the first Stage through a STARTS_WITH relationship, and have the last Stage connect back to the SupplyChain with an ENDS relationship. Here's how that works:

//create the STARTS_WITH and ENDS rels between the SupplyChain and the first and last chain stage

//create STARTS_WITH relationship
load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1vD4UCxsincEQYQpAeCbEqnd3QFdKcuL4ukesbyu5RIc/export?format=csv&id=1vD4UCxsincEQYQpAeCbEqnd3QFdKcuL4ukesbyu5RIc&gid=1371172744" as csv
match (sc:SupplyChain {id: toInt(csv.chainId)}), (ss:Stage {name: csv.sourceStage, chainId: toInt(csv.chainId)})
where not (()-[:PRECEDES]->(ss))
AND sc.id = ss.chainId
merge (sc)-[:STARTS_WITH]->(ss);

//create ENDS relationships
match path = (sc:SupplyChain)-[:STARTS_WITH]->(startstage:Stage)-[:PRECEDES*]->(endstage:Stage)
where not ((endstage)-[:PRECEDES]->())
merge (sc)<-[:ENDS]-(endstage);

Now the structure of the graph is in place, but we would like to add some more detailed data for every Stage, just by looking up every Stage by it's identifier, and then adding the properties:

//load the detail of each Stage
load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1vD4UCxsincEQYQpAeCbEqnd3QFdKcuL4ukesbyu5RIc/export?format=csv&id=1vD4UCxsincEQYQpAeCbEqnd3QFdKcuL4ukesbyu5RIc&gid=579862779" as csv
match (s:Stage {chainId: toInt(csv.chainId), name: csv.Stage_Name})
set s.stageCost = toFloat(csv.stageCost)
set s.relDepth = toInteger(csv.relDepth)
set s.stageClassification = csv.stageClassification
set s.avgDemand = csv.avgDemand
set s.stdDevDemand = csv.stdDevDemand
set s.maxServiceTime = csv.maxServiceTime
set s.serviceLevel = csv.serviceLevel
set s.stageTime = toFloat(csv.stageTime)
set s.stdDev_stageTime = csv.stdDev_stageTime;

Some housekeeping & refactoring

And then we only have a few housekeeping edits left to do. The stages are actually classified into StageClassifications, so we can split these out:

//split out the stageClassifications
match (s:Stage)
merge (sc:StageClassification {name: s.stageClassification})
create (s)-[:IS_A]->(sc);

And while doing so I noticed that the dataset contained a bit of a data quality problem, as it had a StageClassification named "Parts", and one named "Part". So let merge these:

//merge the two "Parts" stageClassifications
match (sc1:StageClassification {name:"Parts"}), (sc2:StageClassification {name:"Part"})
WITH head(collect([sc1,sc2])) as nodes
CALL apoc.refactor.mergeNodes(nodes,{properties:"combine", mergeRels:true}) yield node
set node.name = "Parts"
return count(node);

You will notice that we used an apoc procedure for this called apoc.refactor.mergeNodes, which allows us to combine nodes and their dependent structures into one and the same.

You can run all of the above queries in one go in the Neo4j Browser's multi-statement editor, and run them very quickly and efficiently that way:



When this is done, we end up with the following datamodel:


which obviously is also nicely represented as a graph.



That's it. We have a great Supply Chain Management graph in Neo4j now, ready for us to play with - which is what we will do in the next part of this blogpost series.

You can find the scripts for the entire blogpost series at at the github gists page, so if you want start playing around with it yourself, feel free!

Hope this was already useful - much more to come in part 2!

Cheers

Rik

No comments:

Post a Comment