Wednesday 31 October 2018

Data Lineage in Neo4j - an elaborate experiment

For the past couple of years, I have had a LOT of conversations with users and customers of Neo4j that have been looking at graph databases for solving Data Lineage problems. Now, at first, that seemed like a really fancy new word used only by hipster technovangelists to try to appear interesting, but once I drilled into it, I found that it’s actually something really interesting and a really cool application of graph databases. Read more on the background of it on wikipedia (as always), or just live with this really simple definition:
“Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources.”
That’s easy enough. Fact is that it’s a really big problem for large organisations - specifically financial institutions as they have to comply with regulations like the Basel Committee on Banking Supervision's standard number 239 - which is all about assuring data governance and risk reporting accuracy.

Here’s a couple of really nice articles and videos that should really give you quite a bit of background.

Obviously, for me that meant that I wanted to experiment with this myself. Given the fact that there’s not really sample datasets out there - I had to build my own, import that into Neo4j, and then do some querying. This blogpost is all about that - so here goes.

Generate some lineage data - in a spreadsheet

To do this little experiment, I based myself on some discussions that I had had with clients of course. Basically the model looks something like this:

Pretty straightforward. You can probably argue that the System-to-Process relationships and the Process-to-System relationships should be merged - but I could at the same time see how it would be beneficial to qualify these differently. So: I built myself a little spreadsheet with three tabs:
  1. The different data-elements and the number of nodes/rels that I wanted to generate for each.
  2. The worksheet with all the nodes
  3. The worksheet with all the relationships between these nodes
Here’s what that looks like:


Once I had this it’s pretty straightforward to do the import, and I have published the import scripts on github. Let’s go through this first.

Importing the data lineage sample dataset

You can find the Import script that I created over here. Nothing really special about it, except for the fact that I actually used apoc procedures for the import. The reason for this is simple: in the spreadsheet the node LABELs that I assign and the RELATIONSHIP types that I choose aree variable - and Cypher by default does not allow you to pass in variable label names or relationship types.

Here’s the script for importing the nodes, usiong the apoc.create.nodes procedure that you can find over here.
//load nodes
CALL apoc.create.nodes([csv.Label], [{id: csv.ID, name: csv.Name}]) YIELD node
RETURN count(node);
Then I actually do a little temporary hack to speed up the import of the relationships: I assign every node another label (the Node label) and then create an index on the id property. This is what I will use in the lookups of the nodes when attempting to create the relationships. I will remove the Node label and it’s index after the import...
SET n:Node;
So for the import of the relationships you will find the apoc.create.relationship procedure that will let us create relationships with variable relationship TYPEs:
//load relationships
load csv with headers from
"" as csv
MATCH (from:Node {id: csv.From}), (to:Node {id: csv.To})
CALL apoc.create.relationship(from, csv.Type,null, to) yield rel
RETURN count(rel);
So finally we then remove the temporary label and drop the corresponding index:
REMOVE n:Node;
DROP INDEX ON :Node(id);
And be done! We can check if the db schema checks out by doing a simple call db.schema query - and as you can see below it does seem to be as we expected and intended it to be:

Now we can start playing around with the dataset!

Querying the data lineage dataset

As with every dataset, there’s lots of things that you could do with it. The purpose here of course is not to do an exhaustive review - but rather show you the types of easy querying that we can do once the lineage data is imported into Neo4j. I have put these queries on github as well of course - so you can easily play around with them and customize them to your liking.

A first interesting think to look at would be links between users of different data - a typical pathfinding operation that graph databases are so good at. Here’s what that would look like for a random pair of users:
//link between users
match (u1:User), (u2:User)
where id(u1)<id(u2)
with u1, u2
limit 2
match path = shortestpath( (u1)-[*]-(u2) )
return path;
Instantaneously responds with the right answer:

Or we can take a look at which users are using which systems:
//user list of a systems
match (s:System)
with s
limit 10
match ((s:System)-[*2..2]-(u:User))
return, collect(
And get a tabular result back:

If we would prefer to look at this as a graph, then all we need to do is tell the system to return the path:
//user list of systems
match (s:System) with s
limit 1
match path = ((s:System)-[*2..2]-(u:User))
return path;
And it will kindly respond with a graph-ical representation:

Last but not least, I thought it was interesting to use this dataset to see if we could actually get a better view of the original question that we were trying to answer: where does the data (in this report) come from? So let’s take a look at that:
//lineage of a report
MATCH (rep:Report)
with rep LIMIT 1
match path = (source)-[r*]->(rep)
where not( ()-->(source) )
return source,rep, path
limit 10;
The key here is the direction of the relationships, of course, and the fact that in the where clause we check for the absence of the incoming relationship pattern. It is pretty straightforward this way, and you can look at it immediately in the Neo4j browser:

Last but not least, I also want to play around with this a little in the wonderful world of Neo4j Bloom - where I could actually ask all of the above questions in Near-Natural-Language.

Querying data lineage without Cypher: a Bloom perspective

Of course, it’s great to do the above queries in the standard Neo4j browser - works like a charm. But it does require you to know at least a little bit of Cypher, the declarative query language that we are advocating to become the new Property Graph Query Language standard. Not that that’s difficult - it really isn’t: if I can do it, you can! - but hey, it’s a bit of a learning curve and hurdle to get into.

In come Neo4j Bloom, the Graph Visualization and Discovery product of Neo4j. I have written about Bloom before, and after a good 4-5 months of intensive playing around with it on a variety of different datasets, I would say that the feedback to this new product is o v e r w h e l m i n g l y positive. It’s great for productive, easy, fast interaction with graphs - and so it should also help us with this data lineage experiment.

To do so, I have created a Bloom Perspective - a set of visualization rules that will help us with the exploration. You can find it over here on Github, but it looks like this:


As with every perspective, there’s two parts to it:
    • A category part, where you assign the different visualisation rules for the different node types / labels that are represented in your graph
    • A search part, where you abstract the complex(er) cypher queries into parametrized near-natural language search phrases. I love that part - it’s so powerful and helps non-technical users by sooooo much. I have a couple of these search phrases set up:

Finding the source of a report:

Which gives us this result as an example:

User graph of a system:

Which yields this result as an example:

And then finally: finding the links between two users

Which looks like this, as an example:

That’s about all I wanted to show you. I hope this post has given you some kind of a clue of how Neo4j can contribute to solving today’s massive data lineage - and therefore: regulatory compliance - questions.

All of the scripts are over here:

So let me know if you have any questions or comments.

All the best


No comments:

Post a Comment