Bruggen Blog: October 2014

Friday 31 October 2014

Simulating the IDMS EmpDemo using GraphGen and GrapheneDB

A couple of weeks ago, someone put me in touch with Luc Hermans of P&V Group. Luc had done this presentation 2 years ago about loading the default CA IDMS EmpDemo database into Neo4j. That triggered my interest bigtime, as my recollection of IT history strongly remembers mainframe database technologies as the hotbed for many of our present-day data technologies.

On the wikipedia page it has a bit of background:

IDMS (Integrated Database Management System) is primarily a network (CODASYL) database management system for mainframes. It was first developed at B.F. Goodrich and later marketed by Cullinane Database Systems (renamed Cullinet in 1983). Since 1989 the product has been owned by Computer Associates (now CA Technologies), who renamed it Advantage CA-IDMS and later simply to CA IDMS.

If you think about it for a minute, you immediately understand that these technologies are conceptually very similar, and that "the EMPDEMO" would probably also be a great example of the power of Neo4j.

Now I would love to load the exact same dataset into Neo4j, and have been trying to redo Luc's exercise of loading the data into Neo4j myself. But it is not that easy if you don't have a mainframe background. The model, however is easy enough to understand. Here's the "old" IDMS model:

And Luc created this Neo4j graph model on top of that:

You can immediately spot the similarities, can't you! It's just very, very similar. In fact, the Neo4j model seems a bit easier, since the "junction" nodes in the IDMS model (that represent many-to-many and reflective relationships in the IDMS world) can be eliminated through properties on the direct relationships (see the orange Emposition, Expertise and Structure elements that can be eliminated from the IDMS model). Nevertheless IDMS and Neo4j seem to be - at least philosophically - very related technologies. Even the implementation has some shared characteristics, although I bet that any *real* expert will tell you that IDMS is way more powerful in some ways, but Neo4j is probably more flexible.

Simulating the EMPDEMO with Neoxygen GraphGen

While I was trying to actually do the import of the EMPDEMO sequential files into Neo4j - which I am sure I will succeed at doing, some day - I thought that it would very likely be way easier to "simulate" the same concepts that the EMPDEMO domain is representing, from a vanilla Neo4j install. How? By generating the dataset of course. So then I started thinking about how to do that, and the remainder of this post is going to be about that.

Turns out that Christoph Willemsen has been developing his suite of PHP tools for Neo4j, among which an incredibly lovely data generator, GraphGen. This is a brand new toolset that Christoph has made, and I must say it is absolutely lovely.

Using GraphGen is simple:

you describe your domain (ie. the EMPDEMO model that we have above) in a cypher like syntax, specifying the number of nodes that you want to be generating and the cardinality of the relationships that you want generated. It's really like translating the picture above into Ascii Art :) ...

In my case, this looked like this:

 (empl:Employee {firstname:firstName, lastname:lastName} *100)-[:HAS_SKILL *n..n]->
(skill:Skill {name: progLanguage} *25)  
 (empl)-[:WORKS_AT_DEPT *n..1]->(dept:Department {id: {randomNumber: [2]}} *5)  
 (empl)-[:LOCATED_AT_OFFICE *n..1]->(office:Office {id: {randomNumber: [2]}} *10)  
 (empl)-[:HAS_JOB *1..1]->(job:Job {id: {randomNumber: [2]}} *2)  
 (empl)-[:HAS_COVERAGE *1..1]->(coverage:Coverage {id: {randomNumber: [2]}} *10)  
 (coverage)-[:HAS_CLAIM *1..n]->(HospClaim:HospitalClaim {id: {randomNumber: [2]}} *5)  
 (coverage)-[:HAS_CLAIM *1..n]->(NonHospClaim:NonHospitalClaim {id: 
{randomNumber: [2]}} *5)  
 (coverage)-[:HAS_CLAIM *1..n]->(DentClaim:DentalClaim {id: {randomNumber: [2]}} *5)

you paste it into the website and you get something like this

This is the direct link to it.
Once you've got that, you can download the cypher statements to generate the graph locally, or - and I can't tell you how useful that is, in my opinion - generate the database in one go.

Let's take a look at that now.

From GraphGen to GrapheneDB - one simple click

Generating the database locally is of course one option - but in this case I think it is useful to experiment a bit using GrapheneDB. Maybe you have never used GrapheneDB - but let me tell you that it's pretty sweet for standard prototyping and experimentation. They have a free tier for sandbox graph databases:

This is plenty for most prototypes. In our case here, I will create the IDMSEMPDEMO sandbox database, and then... use the connection details specified below to connect GraphGen to it over the REST API.

Using the GraphGen system, I can now just immediately let the page generate the graph on YOUR GrapheneDB Neo4j server, through the REST API. It starts with some basic information:

and then all you need to do is wait a few seconds to have it complete:

Job done. Now I can start interacting with the graph on the Neo4j Browser hosted on Graphene.

Querying the Simulated EMPDEMO

Here's what the new Empdemo looks like in the shiny Neo4j Browser:

Now we can start thinking of some additional queries. I am no IDMS expert - on the contrary - but what I understood from my conversations with Luc is that you really need to think about the IDMS schema BEFORE you do any queries. If your query patterns are misaligned with the schema - then there's a redesign task to be done for it to work. That's very different to the Neo4j model. Of course, there's a clear and definite interaction between model and query pattern, and things may run more slowly/faster in one model as opposed to another model - but it will still be possible. And adjusting the model in Neo4j is... dead easy.

So here are some queries:

 //find employees  
 match (e:Employee) return count(e);  
 match (e:Employee) return e.firstname, e.lastname;  
   
 //find employee skills  
 match (e:Employee)-[:HAS_SKILL]->(s:Skill)  
 return e.firstname, e.lastname, count(s)  
 order by count(s) desc;  
   
 //find avg nr of employee skills  
 match (e:Employee)-[:HAS_SKILL]->(s:Skill)  
 with e, count(s) as nrofskills  
 return count(e), avg(nrofskills);

Here's what the second query looks like in the Browser:

Many of those query patterns will be similar to what you could do in IDMS, but there are some query patterns that are perhaps a bit different. Like the following example: a pathfinding query:

 //find paths  
 match p=allshortestpaths((n:Employee {lastname:"Howe"})-[*]-(m:Employee 
{lastname:"Herman"})) return p limit 5;

Doing that is really easy in Neo4j, and the results are very interesting:

Without knowing anything about the schema or the potential paths between two elements in the network, I can gain some interesting insights about the connections/relationships between my different entities.

This of course is just a start. But all I really wanted to do with this post was to highlight that

there is a very natural match/fit between the IDMS mainframe model, and the modern-day Neo4j model
It is trivial to simulate that model using fantastic new tools like GraphGen
It is even easier to actually take that model for a spin on GrapheneDB.

Hope that was useful. As always, feedback is very welcome.

Cheers

Rik

Thursday 23 October 2014

Graph Karaoke!

Now that GraphConnect is almost behind us, I can finally talk about and publish the playlist that I have been compiling of songs that I have loaded in Neo4j.
Thanks to Nigel Small for the creative ideas along the way - onto many more musical discoveries!!!

Cheers

Rik

Friday 17 October 2014

Using graphs for recommendations

Last month the wonderful people of Data Science Brussels invited me to do a talk about HR Analytics and how graphs fit into that. The material is over here. It was a great night, so when Philippe told me that the next meeting was going to be about Data Science in Marketing - I just had to invite myself. So yesterday, I had a the pleasure - again :) - to do a talk at the about how you could use graphs in Marketing.

The thing that I focussed on was the topic of recommendations and recommender systems. In a world where content is king, and consumers are always looking for specific deals that would fit their profile best, these systems are going to become critically important.

In my mind, these systems always consist of two parts:

a Pattern discovery system, where you - somehow - figure out what the patterns are that you are looking for. Could be by asking a domain expert, but could just as well be through some advanced machine learning algorithm.
a Pattern application system, where you start operationalising the patterns that you have found and applying them in real world applications, either in batch, or in near-real-time.

I believe that graphs and graph databases offer tremendous potential for this domain, and have tried to illustrate that in this talk. The key points being that

Graphs excel at making these recommendations in real time. No more batch precalculations - just deliver the patterns you uncover as they develop.
Graphs benefit from some of the most fascinating data analytics techniques out there: triadic closures, centrality measures, betweenness measures, PageRank scoring - all of which help you determine what parts of the graph really matter, or not.
Graphs benefit from some operational advantages (among which: graph locality in your queries) to make them operationally very efficient.

I am sure I am missing stuff here, but these seem to be the most important pieces, to me.

The slides are on SlideShare, and included below:

The demonstration that goes with this was also recorded below. It's a little long, but I hope it's still interesting:

I hope this was a useful illustration for you of this fascinating use case. Any feedback - as always - greatly appreciated.

Cheers

Rik

Thursday 16 October 2014

How graphs revolutionize Identity and Access Management

I have been a big believer in Identity and Access Management technologies for the longest time. I spent many years of my professional life working for companies like Novell, Imprivata and Courion - trying to make organisations improve their policies and processes when it comes to authentication, roles and rights. The fundamental reason why spent so much time in that industry is that I truly am convinced of the fact that security threats usually (I say this full knowing that there are of course spectacular exceptions) are not external to organisations - it's usually the disgruntled employee that missed a promotion, or the inadvertent administrator making mistakes with catastrophic security effects. That's how security threats happen most of the time - not because of some external hacker. That's mostly the stuff that movies are made of - not reality.

But: I left that industry two and a half years ago, and started working for NeoTech, because I was sick and tired of the whole thing. Access Management (like Imprivata's toolset) is pretty ok - but "Identity Management" - djeez it's really a big f'ing mess. Maybe I am exaggerating a bit, but I remember thinking what a perverted, dishonest, and utterly money-squandering industry it was. Perverted and dishonest because most of the "products" out there require an army of consultants to make simple things work. And money-squandering because, while the low-end tools maybe affordable, the high-end tools that you *really* want are just completely and ridiculously expensive. It's perverse.

A meeting of minds

And then, about a year ago, I think, I came across this wonderful talk by Ian Glazer (then at Gartner, now at Salesforce). Here is his talk, or watch it below:

Ian talks about the 3 major problems that the identity management industry faces. Let me paraphrase these:

Identity Management is - still - preoccupied with a very static view of the world. It's crazy. People are still trying to automate simple things like create/update/deleting of user credentials, things that add zero to no real business value. IT systems, including Identity Management systems should be more dynamic - should contribute to some kind of competitive business value, shouldn't it? IT for IT's sake - who still does that???
Identity management, as a consequence of the point above probably, has very poor business application alignment. Organisations today are really not as much interested in automating INTERNAL processes, only. They know that part - been there done that. Now, the time has come to spend time on the EXTERNAL facing processes. The processes that link our value creation to other, external parts of the chain. Linking with suppliers, partners, customers, etc. That's where the real business value of these systems lies - not internally. And yet, Identity Management struggles to go there. The consequence of this, I believe, is a constant struggle to justify the investment: how do you explain to business people that really they should bring on board an army of consultants for a year to help solve a problem that is not aligned with the business priorities? You don't.
Finally: Identity Management is not leveraging real world relationships between people, assets, roles, organisations and security policies. Effectively, people - still, today - manage access as part of a hierarchical view of the world. This of course, we know because Emil keeps explaining to everyone, is false. The world is a graph. You should embrace and leverage it.

Ian's conclusion - if I may interprete it - was that really, the Identity Management industry needs to start over. It needs to be killed, in order to be reborn. And I think, that when it gets reborn, as part of that revolutionary overhaul, graph databases like Neo4j will be a big part of the new incarnation. Here's why.

How can graphs help?

I have been trying to summarize, to the best of my abilities and from a very high level, how Graph Databases will help reinvent Identity Management - when that happens. I believe that there are, effectively, two points that will be of massive help.

Hi-Fi representation of relationships in an Identity Graph: many people have referred to this in other places, but effectively this is all about reducing the "impedance mismatch" of traditional hierarchical access control and identity management systems. Just like many business applications based on Relational Database Management Systems suffer from this problem (and try to patch it up with object-relational mapping band-aids), identity and access management tools try to do this with directories and all kinds of fancy overlaying tools. I believe that to things would no longer be of any use to us, if we were to express the relationships in our Identity Graph appropriately:

We could eliminate the need for separate RBAC systems: Role Based Access Control (RBAC) systems are some of the most complex, tedious to use, difficult to understand, expensive to implement etc etc identity and access management systems out there. They are - no other word for it, in my opinion - absolutely terrible. The concept is great and graph based, but the implementations that I know of are just saddening. If we were to be able to express these Roles, these "cross-cutting concerns" that attribute rights to assets across different parts of the traditional hierarchy, in a graph traversal rather than a complex query and integration over relational and other systems, the world would be so much simpler.
We could probably eliminate the need for application specific directories. I know this is a bold statement, and one that was made before when LDAP was first introduced, but I really think this could be true. The reason why Identity Management often times continues to be so difficult is because of the integration problems that are associated with it. These integrations - today - are necessary because of the application specific information that currently is stored in each of the applications. That application specific information currently has to be stored in the application - and not in the central directory - because it would be too difficult to model, insert and query in the traditional hierarchy of those directories. So what if the directories would become graphs instead of hierarchies? Maybe then that need would go away? I know - that will take time. But it is, I think, a valid option and vision for the future?

Real-time querying becomes easy and fast

Needless to say, directory servers are good for some things. I remember working with Novell (now NetIQ) eDirectory and it was blazingly fast for some of those typical access control queries. So is Active Directory and OpenLDAP, I assume. But as we discussed above, the new kinds of queries we want are typically multi-dimensional, across different hierarchies, cross-cutting graph traversals - that's the way we want to answer authentication and access questions in the future. Directory servers are not geared to do that. Graph databases are. That's why I think that databases like Neo4j will be playing a big part in this.

I hope that's clear. Maybe it's a little vague now - but hey, that's how revolutions start :)

Useful pointers and next steps

All of the above is why you can find a lot of different examples and materials out there to help you get started with this. There's a couple of things that I would love to point you to.

there's a great public case study out there about how Telenor uses Neo4j to do this kind of stuff. Take a look at it over here.
Wes Freeman recently created a really nice airpair course that is centered around this use case. Very nice and simple.
Max De Marzi has written a couple of very hands-on blog posts around permission resolution with Neo4j. Look at part 1, part 2 and part 3 for some code.
the awesome Graph Databases Book has a chapter about this as well.
I have just done a webinar (recording link to follow) about this as well. Here are the slides:

The dataset I used to do this can be generated with this gist, and the queries are available over here as well. I have also recorded the demo that I do in the webinar separately - see below:

That's about it for now. I hope this was a useful post for you, and look forward to hearing from you.