Friday 31 October 2014

Simulating the IDMS EmpDemo using GraphGen and GrapheneDB

A couple of weeks ago, someone put me in touch with Luc Hermans of P&V Group. Luc had done this presentation 2 years ago about loading the default CA IDMS EmpDemo database into Neo4j. That triggered my interest bigtime, as my recollection of IT history strongly remembers mainframe database technologies as the hotbed for many of our present-day data technologies.

On the wikipedia page it has a bit of background:
IDMS (Integrated Database Management System) is primarily a network (CODASYL) database management system for mainframes. It was first developed at B.F. Goodrich and later marketed by Cullinane Database Systems (renamed Cullinet in 1983). Since 1989 the product has been owned by Computer Associates (now CA Technologies), who renamed it Advantage CA-IDMS and later simply to CA IDMS.
If you think about it for a minute, you immediately understand that these technologies are conceptually very similar, and that "the EMPDEMO" would probably also be a great example of the power of Neo4j.

Now I would love to load the exact same dataset into Neo4j, and have been trying to redo Luc's exercise of loading the data into Neo4j myself. But it is not that easy if you don't have a mainframe background. The model, however is easy enough to understand. Here's the "old" IDMS model:

And Luc created this Neo4j graph model on top of that:
You can immediately spot the similarities, can't you! It's just very, very similar. In fact, the Neo4j model seems a bit easier, since the "junction" nodes in the IDMS model (that represent many-to-many and reflective relationships in the IDMS world) can be eliminated through properties on the direct relationships (see the orange Emposition, Expertise and Structure elements that can be eliminated from the IDMS model). Nevertheless IDMS and Neo4j seem to be - at least philosophically - very related technologies. Even the implementation has some shared characteristics, although I bet that any *real* expert will tell you that IDMS is way more powerful in some ways, but Neo4j is probably more flexible.

Simulating the EMPDEMO with Neoxygen GraphGen

While I was trying to actually do the import of the EMPDEMO sequential files into Neo4j - which I am sure I will succeed at doing, some day - I thought that it would very likely be way easier to "simulate" the same concepts that the EMPDEMO domain is representing, from a vanilla Neo4j install. How? By generating the dataset of course. So then I started thinking about how to do that, and the remainder of this post is going to be about that.

Turns out that Christoph Willemsen has been developing his suite of PHP tools for Neo4j, among which an incredibly lovely data generator, GraphGen. This is a brand new toolset that Christoph has made, and I must say it is absolutely lovely.

Using GraphGen is simple:

  • you describe your domain (ie. the EMPDEMO model that we have above) in a cypher like syntax, specifying the number of nodes that you want to be generating and the cardinality of the relationships that you want generated. It's really like translating the picture above into Ascii Art :) ... 
     
  • In my case, this looked like this:
 (empl:Employee {firstname:firstName, lastname:lastName} *100)-[:HAS_SKILL *n..n]->
(skill:Skill {name: progLanguage} *25)  
 (empl)-[:WORKS_AT_DEPT *n..1]->(dept:Department {id: {randomNumber: [2]}} *5)  
 (empl)-[:LOCATED_AT_OFFICE *n..1]->(office:Office {id: {randomNumber: [2]}} *10)  
 (empl)-[:HAS_JOB *1..1]->(job:Job {id: {randomNumber: [2]}} *2)  
 (empl)-[:HAS_COVERAGE *1..1]->(coverage:Coverage {id: {randomNumber: [2]}} *10)  
 (coverage)-[:HAS_CLAIM *1..n]->(HospClaim:HospitalClaim {id: {randomNumber: [2]}} *5)  
 (coverage)-[:HAS_CLAIM *1..n]->(NonHospClaim:NonHospitalClaim {id: 
{randomNumber: [2]}} *5)  
 (coverage)-[:HAS_CLAIM *1..n]->(DentClaim:DentalClaim {id: {randomNumber: [2]}} *5)  

  • you paste it into the website and you get something like this
    This is the direct link to it.
  • Once you've got that, you can download the cypher statements to generate the graph locally, or - and I can't tell you how useful that is, in my opinion - generate the database in one go. 
Let's take a look at that now.

From GraphGen to GrapheneDB - one simple click

Generating the database locally is of course one option - but in this case I think it is useful to experiment a bit using GrapheneDB. Maybe you have never used GrapheneDB - but let me tell you that it's pretty sweet for standard prototyping and experimentation. They have a free tier for sandbox graph databases:
This is plenty for most prototypes. In our case here, I will create the IDMSEMPDEMO sandbox database, and then... use the connection details specified below to connect GraphGen to it over the REST API.

Using the GraphGen system, I can now just immediately let the page generate the graph on YOUR GrapheneDB Neo4j server, through the REST API. It starts with some basic information:

and then all you need to do is wait a few seconds to have it complete:

Job done. Now I can start interacting with the graph on the Neo4j Browser hosted on Graphene.

Querying the Simulated EMPDEMO

Here's what the new Empdemo looks like in the shiny Neo4j Browser:
Now we can start thinking of some additional queries. I am no IDMS expert - on the contrary - but what I understood from my conversations with Luc is that you really need to think about the IDMS schema BEFORE you do any queries. If your query patterns are misaligned with the schema - then there's a redesign task to be done for it to work. That's very different to the Neo4j model. Of course, there's a clear and definite interaction between model and query pattern, and things may run more slowly/faster in one model as opposed to another model - but it will still be possible. And adjusting the model in Neo4j is... dead easy.

So here are some queries:

 //find employees  
 match (e:Employee) return count(e);  
 match (e:Employee) return e.firstname, e.lastname;  
   
 //find employee skills  
 match (e:Employee)-[:HAS_SKILL]->(s:Skill)  
 return e.firstname, e.lastname, count(s)  
 order by count(s) desc;  
   
 //find avg nr of employee skills  
 match (e:Employee)-[:HAS_SKILL]->(s:Skill)  
 with e, count(s) as nrofskills  
 return count(e), avg(nrofskills);  
   

Here's what the second query looks like in the Browser:

Many of those query patterns will be similar to what you could do in IDMS, but there are some query patterns that are perhaps a bit different. Like the following example: a pathfinding query:

 //find paths  
 match p=allshortestpaths((n:Employee {lastname:"Howe"})-[*]-(m:Employee 
{lastname:"Herman"})) return p limit 5;  

Doing that is really easy in Neo4j, and the results are very interesting:
Without knowing anything about the schema or the potential paths between two elements in the network, I can gain some interesting insights about the connections/relationships between my different entities.

This of course is just a start. But all I really wanted to do with this post was to highlight that
  • there is a very natural match/fit between the IDMS mainframe model, and the modern-day Neo4j model
  • It is trivial to simulate that model using fantastic new tools like GraphGen
  • It is even easier to actually take that model for a spin on GrapheneDB
Hope that was useful. As always, feedback is very welcome.

Cheers

Rik


No comments:

Post a Comment