Tuesday, 21 January 2014

The Open Source Licensing Graph

Selling an open source product like Neo4j, always gets you into the interesting question around Licensing. How do you license your product? And then I get into a very interesting explanation on how the different versions of Neo4j compare in terms of features, license, support capability, and of course pricing.

Tuesday, 14 January 2014

Cool graph events in the next couple of weeks!

Waw. I just realised that there are a TON of very cool graph events coming up that I am going to be fondly participating in.

Hoping to see many of you there!

Monday, 13 January 2014

The Making of - my genealogy graph database

Quite a few people asked me "how did you create the genealogy graph?" - well: here's your answer in a quick 5min video.

Hope this is useful.



Thursday, 9 January 2014

Leftovers from the holidays: Genealogy Graphs!

Of all the graphy posts that I have written over the past 18 months, this is probably one of the ones that is closest to my heart. Beer-posts not included of course - those are in a category of their own :) ... This is a post about my family, and a tale of how a tiny little experiment led to a couple of very late graphy nights.

Mid-life crisis schmisis!

So maybe a bit of context. The fact is that I turned 40 a couple of months ago, and with that, in spite of what you may think, there has been a surprising lack of crisis. I feel good about my age, have no problem getting older, don't feel any need whatsoever to go off and race around on a loud Harley Davidson or something. But. Maybe this is completely unrelated, but I have found myself to do something else. I have gotten a lot more interested in the most basic of things: family history. I have been playing around with old photographs, scanning them and trying to make something out of them - and more stuff like that.

So a couple of weeks before Christmas, I discover a new website that I had never heard of before: geni.com. I spent some time playing around it, and found the concept of "social genealogy" really interesting. Geni allows you to share your tree with people in the tree, and then lets them work together with you on completing the information. Next thing I know I share it with my dad, he shares it with an uncle, and next thing you know old dusted boxes of family history are being opened. One of these boxes contained a lot of research done by my great-grand-uncle (the brother of my grandfather's father), Alphonse Van Bruggen. He's the priest in the picture below:

The fact that he was a priest is relevant: he had access to all of the baptism registries in Belgium/The Netherlands, and was extremely well placed to do some family tree research. And he did. At a family party this last Christmas break, I got my hands on his hand-written notes full of details, names, birthdates, burial dates, places, etc etc. It was an amazing piece of information - I still can barely believe that I found this.

Next thing I know I end up inputting all of this data into Geni. It was a lot of work - two late nights went into this. It's now an amazing tree that many of my family members are working together on - and which dates back to Jan van Brugel - my oldest ancestor - at the end of the 17th century. 

I can make a graph out of that!

Then of course my mind starts thinking: wouldn't it be great if I could do more with this data? After all - this is not just a "family tree"! It's a family graph - all the different trees (my own family, my inlaws, my mother's family, my dad's family, etc.) interact here, and it is clearly something that could really benefit from something like a graph database. So I started to look around, noticed that I could export a GedCom file from geni, and that there are tools like Gramps around that allow you to read these files offline. Gramps actually allowed me to export my family tree into a CSV file - and then... you know I love spreadsheets ... it was just one more step away to create a google spreadsheet doc that would allow me to prepare my import. I used the good old spreadsheet import method here - the dataset is just a couple of hundred nodes and relationships.

The model I chose to work with is as follows (courtesy to the wonderful graphjson.io tool for the colours): people have labels (male/female), are in relationships, and have kids as part of that relationship. 

I must admit that I had a couple of data quality issues at first, but I spent some time troubleshooting the dataset and was able to get a neo4j database with my entire family network in a matter of minutes rather than hours. Result!

Some interesting family queries!

Let's take a look at some interesting queries that I was able to write with this dataset.

Find my kids!

match (rik:male {FirstName:"Rik"}),
return kids.FirstName, kids.LastName;

And there they are:

Easy! So let's try something more difficult. 

Find my grandfathers!

This is the query I wrote to find my own grandfathers - a bit more difficult as it involves using a UNION statement to combine the two grandfathers from both my father's and my mother's side of the family into one resultset.

match (rik:male {FirstName:"Rik"})-[:CHILD_OF]->()<-[:IS_MAN_OF]-(father:male)-[:CHILD_OF]->()<-[:IS_MAN_OF]-(grandfather1:male)
return grandfather1.FirstName as FirstName, grandfather1.LastName as LastName
(rik:male {FirstName:"Rik"})-[:CHILD_OF]->()<-[:IS_WOMAN_OF]-(mother:female)-[:CHILD_OF]->()<-[:IS_MAN_OF]-(grandfather2:male)
return grandfather2.FirstName as FirstName, grandfather2.LastName as LastName;

and there they are:

But to be honest: I thought this was terribly complicated. So decided to try and make things easier.

Inferring fatherhood/motherhood

Part of the complexity of the query above is simply because we have to traverse through a relationship everytime we want to find a child of a parent. Seems like an unnatural and complicated things to do. Wouldn't it be nicer if we had a model that looked more like this:

At first I was uneasy about this, as I felt that I was duplicating information in my model, but after reading Mark's article on the topic I decided to just go with it - it would make life a lot easier. So to implement  this, I would need to infer the FATHER_OF and MOTHER_OF relationships, by executing a cypher query that would update the graph. The query goes like this:

match (child)-[:CHILD_OF]->(rel:relationship)-[:IS_MAN_OF]-(father:male),
create mother-[:MOTHER_OF]->child<-[:FATHER_OF]-father;

Simple enough! So let's see what this would do to the grandfather query:

match (rik:male {FirstName:"Rik"})<-[:FATHER_OF*2..2]-grandfather1
return grandfather1.FirstName as FirstName, grandfather1.LastName as LastName
match (rik:male {FirstName:"Rik"})<-[:MOTHER_OF]-()<-[:FATHER_OF]-grandfather2
return grandfather2.FirstName as FirstName, grandfather2.LastName as LastName;

That's a lot more readable in my book. And one of the nice things is that these new relationships would actually allow me to easily follow my lineage all the way back to the 17th century roots of my family tree:

match (rik:male {FirstName:"Rik"})<-[:FATHER_OF*]-(lineage:male)
return lineage;

gives me this:

All the way left are me and my sister. All the way right is Jan Van Brugel.

Learning more about life in the past few centuries

Obviously, now with this dataset in neo4j, I could ask lots of other questions:

Age of my forefathers

match (n:male) 
where NOT n.Birthdatenumber = "Unknown" AND NOT n.Deathdatenumber = "Unknown" 
return n.FirstName as FirstName, n.LastName as LastName, (n.Deathdatenumber-n.Birthdatenumber)/365 as Age
order by Age DESC;

Not a "bad" result - but not a very exciting one either. Chances of getting over eighty seem slim, based on this historical performance :) ...

Childhood deaths

This one's actually more encouraging: 
match (n)
where n.Deathdatenumber <> 'Unknown' AND n.Birthdatenumber <> 'Unknown'
with n, (n.Deathdatenumber-n.Birthdatenumber) as Age
return n.FirstName as FirstName, n.LastName as LastName, n.Birthdate, Age
order by Age ASC
limit 5;

Seems like in 3 centuries worth of (incomplete) data, we only had three babies die within the first 100 days. Tragic as that is - it still seemed like an ok number, especially if you know the huge advances in medical care that we have made in Western Europe over the past century.

Multiple relationships? Check!

And then of course the question to end all questions: how many of my ancestors had more than one wife? Here's the cypher query:

match (r1:relationship)-[:IS_MAN_OF|IS_WOMAN_OF]-(n)-[:IS_MAN_OF|IS_WOMAN_OF]-(r2:relationship)
return distinct n.FirstName as FirstName, n.LastName as LastName, n.Birthdate as Birthdate;

And again, it yielded an encouraging result:

Only 4! I think my missus will feel reassured :) ...

So that's about it. I must say that this was one blogpost that I more than enjoyed writing - it was fantastic to explore my family history, and use the wonderful world of graphs and neo4j while doing so. I will not be publishing the database - in the interest of my family's privacy - but am happy to discuss if you are interested in knowing more.

Hope this was interesting and/or useful.

All the best.