As a quick reminder, here's what the data model looked like after the import:
Or visually:
Data validation and profiling
The first thing to do when you have a new shiny dataset like that, is of course to get a bit of a feel for the data. In this case, it really helps to understand the nature of the different SupplyChains - as we know from the original Excel file that they are quite different between the 38 of them. So let's do some profiling:match (n) return distinct labels(n), count(*)
is a good place to start:
and we can do the same for the relationships:
match (n)-[r]->() return distinct type(r), count(*)
gives us this result:
We can also look at if from a company perspective - as some companies had multiple SupplyChains in the dataset. Here's a view of that:
match p= (sc:SupplyChain)<-[:HAS_CHAIN]-(c:Company)-[HAS_SIC]->(s:SIC)
return p limit 25;
gives us this:
That all looks encouraging - so we can run a quick verification query to see if the different SupplyChains have been correctly imported as well. Let's verify the import for the simplest, smalles SupplyChain (number 1) first. In the Excel file this looked like this:
So if we now run a simple query:
match (sc:SupplyChain {id:1})-[*..3]-(conn) return sc, conn limit 10
gives us this:
We can do the same thing for SupplyChain nr 4:
And we see that it all work really well:
So now let's do some more interesting queries.
Supply Chain Queries
Once I was sure that the dataset had been accurately imported into Neo4j, I could do some more interesting queries. The first thing I thought about as a point of interest, was to try to understand the depth of these "multiechelon" supply chain networks. In order to do that, I wrote a quick query of course://depth of SC Network by traversing graph
match path = ((sc:SupplyChain)-[:STARTS_WITH|PRECEDES*]->(s:Stage))
where not ((s)-[:PRECEDES]->())
return distinct sc.id, (max(length(path))-1) as maxlength
order by maxlength desc;
This query basically runs through all of the different paths that originate at one of the 38 SupplyChain nodes, and counts the length of all of these paths - and takes the longest one as the maxlength of the paths for that SupplyChain. That query obviously needs to run through the entire graph, and will take some time. If you profile the query you will find that it has to do a whole lot of work:
Of course, I found out only after I wrote that query above that the depth of the network is actually already embedded in the dataset, as a property on the Stage nodes. So the query could look very differently:
//depth of SC network by reading stage properties
match (sc:SupplyChain)-->(s:Stage)
return distinct sc.id, max(s.relDepth) as maxlengthprop
order by maxlengthprop desc;
Obviously this would run a whole lot faster, and would get me the same result:
The results of the queries are identical, but the resources required to run them are very different.
This also goes to show that it always makes a lot of sense to do some detailed profiling of your dataset, as you may actually want to run the original query, and store the property as an intermediate aggregate property for easier querying afterwards. It makes sense!
Maybe one more interesting query to get a feel for the qualitative differences between the different SupplyChains in our dataset. Let's take a look at the differences in "breadth" of the chains. How many start and end nodes would every chain have? Let's take a look:
//how many start and end nodes
match (s1:Stage)<-[start:STARTS_WITH]-(sc:SupplyChain)
with sc, count(start) as starts
match (sc)<-[end:ENDS]-(s2)
return sc.id, starts, count(end);
This returns very quickly, thanks to the existence of the "ENDS" relationship:
Imagine having to do the same query but having to run through all the paths if there were no ENDS relationship: you would have to check for the absence of the "PRECEDES" relationship at the end of the chain and that would be a lot more complicated if you would want to cover all chains. You can of course do it more easily for one part of the graph, eg. for SupplyChain nr 20, and then you can do this:
//Zooming in on chain nr 20 - different path lengths
match path = (:SupplyChain {id:20})-[:STARTS_WITH]->(startstage:Stage)-[:PRECEDES*]->(endstage:Stage)
where not ((endstage)-[:PRECEDES]->())
return distinct count(path), length(path);
which does return instantaneously:
I thought it actually to be quite interesting to have a specific SupplyChain, nr 20, that has a fairly large variety in length of the number of Stages involved in the chain. Looking at this visually is also quite interesting:
//Visualizing the different path lengths for chain nr 20
match path = (:SupplyChain {id:20})-[:STARTS_WITH]->(startstage:Stage)-[:PRECEDES*]->(endstage:Stage)
where not ((endstage)-[:PRECEDES]->())
return path;
This looks like this:
And you can find similar variety in other chains. So I started to look at specific chain lengths a bit more closely, by way of experiment:
//Visualizing path length = 7 and 8 for chain nr 20
match path = (:SupplyChain {id:20})-[:STARTS_WITH]->(startstage:Stage)-[:PRECEDES*]->(endstage:Stage)
where not ((endstage)-[:PRECEDES]->()) and length(path) = 8
return path
limit 10;
The result of this query with Length = 7 looks like this:
and for Length = 8 you get something like this:
Of course you can rewrite these queries with an check for an "ENDS" relationship instead as well:
//Visualizing the paths with length = 8 with ENDS check
profile match path = (sc:SupplyChain {id:20})-[:STARTS_WITH]->(startstage:Stage)-[:PRECEDES*]->(endstage:Stage)-[:ENDS]-(sc)
where length(path) = 9
return path;
And as expected, we will see a difference in performance:
- Without ENDS check:
- With ENDS check:
That brings me to an end of this second part of this blogpost series on Supply Chain Management and Graphs. You can find the queries on github. In the next part of this series we will get down and dirty with some graph analytics. Looking forward already!
Hope this was useful - feedback welcome, as always!
Cheers
Rik
No comments:
Post a Comment