Thursday 9 January 2014

Leftovers from the holidays: Genealogy Graphs!

Of all the graphy posts that I have written over the past 18 months, this is probably one of the ones that is closest to my heart. Beer-posts not included of course - those are in a category of their own :) ... This is a post about my family, and a tale of how a tiny little experiment led to a couple of very late graphy nights.

Mid-life crisis schmisis!

So maybe a bit of context. The fact is that I turned 40 a couple of months ago, and with that, in spite of what you may think, there has been a surprising lack of crisis. I feel good about my age, have no problem getting older, don't feel any need whatsoever to go off and race around on a loud Harley Davidson or something. But. Maybe this is completely unrelated, but I have found myself to do something else. I have gotten a lot more interested in the most basic of things: family history. I have been playing around with old photographs, scanning them and trying to make something out of them - and more stuff like that.


So a couple of weeks before Christmas, I discover a new website that I had never heard of before: geni.com. I spent some time playing around it, and found the concept of "social genealogy" really interesting. Geni allows you to share your tree with people in the tree, and then lets them work together with you on completing the information. Next thing I know I share it with my dad, he shares it with an uncle, and next thing you know old dusted boxes of family history are being opened. One of these boxes contained a lot of research done by my great-grand-uncle (the brother of my grandfather's father), Alphonse Van Bruggen. He's the priest in the picture below:


The fact that he was a priest is relevant: he had access to all of the baptism registries in Belgium/The Netherlands, and was extremely well placed to do some family tree research. And he did. At a family party this last Christmas break, I got my hands on his hand-written notes full of details, names, birthdates, burial dates, places, etc etc. It was an amazing piece of information - I still can barely believe that I found this.


Next thing I know I end up inputting all of this data into Geni. It was a lot of work - two late nights went into this. It's now an amazing tree that many of my family members are working together on - and which dates back to Jan van Brugel - my oldest ancestor - at the end of the 17th century. 

I can make a graph out of that!

Then of course my mind starts thinking: wouldn't it be great if I could do more with this data? After all - this is not just a "family tree"! It's a family graph - all the different trees (my own family, my inlaws, my mother's family, my dad's family, etc.) interact here, and it is clearly something that could really benefit from something like a graph database. So I started to look around, noticed that I could export a GedCom file from geni, and that there are tools like Gramps around that allow you to read these files offline. Gramps actually allowed me to export my family tree into a CSV file - and then... you know I love spreadsheets ... it was just one more step away to create a google spreadsheet doc that would allow me to prepare my import. I used the good old spreadsheet import method here - the dataset is just a couple of hundred nodes and relationships.

The model I chose to work with is as follows (courtesy to the wonderful graphjson.io tool for the colours): people have labels (male/female), are in relationships, and have kids as part of that relationship. 

I must admit that I had a couple of data quality issues at first, but I spent some time troubleshooting the dataset and was able to get a neo4j database with my entire family network in a matter of minutes rather than hours. Result!

Some interesting family queries!

Let's take a look at some interesting queries that I was able to write with this dataset.

Find my kids!

match (rik:male {FirstName:"Rik"}),
(kids)-[:CHILD_OF]-(relationship:relationship)-[:IS_MAN_OF]-rik,
(spouse:female)-[:IS_WOMAN_OF]-relationship
return kids.FirstName, kids.LastName;

And there they are:


Easy! So let's try something more difficult. 

Find my grandfathers!

This is the query I wrote to find my own grandfathers - a bit more difficult as it involves using a UNION statement to combine the two grandfathers from both my father's and my mother's side of the family into one resultset.

match (rik:male {FirstName:"Rik"})-[:CHILD_OF]->()<-[:IS_MAN_OF]-(father:male)-[:CHILD_OF]->()<-[:IS_MAN_OF]-(grandfather1:male)
return grandfather1.FirstName as FirstName, grandfather1.LastName as LastName
UNION
match
(rik:male {FirstName:"Rik"})-[:CHILD_OF]->()<-[:IS_WOMAN_OF]-(mother:female)-[:CHILD_OF]->()<-[:IS_MAN_OF]-(grandfather2:male)
return grandfather2.FirstName as FirstName, grandfather2.LastName as LastName;

and there they are:


But to be honest: I thought this was terribly complicated. So decided to try and make things easier.

Inferring fatherhood/motherhood

Part of the complexity of the query above is simply because we have to traverse through a relationship everytime we want to find a child of a parent. Seems like an unnatural and complicated things to do. Wouldn't it be nicer if we had a model that looked more like this:

At first I was uneasy about this, as I felt that I was duplicating information in my model, but after reading Mark's article on the topic I decided to just go with it - it would make life a lot easier. So to implement  this, I would need to infer the FATHER_OF and MOTHER_OF relationships, by executing a cypher query that would update the graph. The query goes like this:

match (child)-[:CHILD_OF]->(rel:relationship)-[:IS_MAN_OF]-(father:male),
(rel)<-[:IS_WOMAN_OF]-(mother:female)
create mother-[:MOTHER_OF]->child<-[:FATHER_OF]-father;

Simple enough! So let's see what this would do to the grandfather query:

match (rik:male {FirstName:"Rik"})<-[:FATHER_OF*2..2]-grandfather1
return grandfather1.FirstName as FirstName, grandfather1.LastName as LastName
UNION
match (rik:male {FirstName:"Rik"})<-[:MOTHER_OF]-()<-[:FATHER_OF]-grandfather2
return grandfather2.FirstName as FirstName, grandfather2.LastName as LastName;

That's a lot more readable in my book. And one of the nice things is that these new relationships would actually allow me to easily follow my lineage all the way back to the 17th century roots of my family tree:

match (rik:male {FirstName:"Rik"})<-[:FATHER_OF*]-(lineage:male)
return lineage;

gives me this:


All the way left are me and my sister. All the way right is Jan Van Brugel.

Learning more about life in the past few centuries

Obviously, now with this dataset in neo4j, I could ask lots of other questions:

Age of my forefathers

match (n:male) 
where NOT n.Birthdatenumber = "Unknown" AND NOT n.Deathdatenumber = "Unknown" 
return n.FirstName as FirstName, n.LastName as LastName, (n.Deathdatenumber-n.Birthdatenumber)/365 as Age
order by Age DESC;


Not a "bad" result - but not a very exciting one either. Chances of getting over eighty seem slim, based on this historical performance :) ...

Childhood deaths

This one's actually more encouraging: 
match (n)
where n.Deathdatenumber <> 'Unknown' AND n.Birthdatenumber <> 'Unknown'
with n, (n.Deathdatenumber-n.Birthdatenumber) as Age
return n.FirstName as FirstName, n.LastName as LastName, n.Birthdate, Age
order by Age ASC
limit 5;


Seems like in 3 centuries worth of (incomplete) data, we only had three babies die within the first 100 days. Tragic as that is - it still seemed like an ok number, especially if you know the huge advances in medical care that we have made in Western Europe over the past century.

Multiple relationships? Check!

And then of course the question to end all questions: how many of my ancestors had more than one wife? Here's the cypher query:

match (r1:relationship)-[:IS_MAN_OF|IS_WOMAN_OF]-(n)-[:IS_MAN_OF|IS_WOMAN_OF]-(r2:relationship)
return distinct n.FirstName as FirstName, n.LastName as LastName, n.Birthdate as Birthdate;

And again, it yielded an encouraging result:


Only 4! I think my missus will feel reassured :) ...

So that's about it. I must say that this was one blogpost that I more than enjoyed writing - it was fantastic to explore my family history, and use the wonderful world of graphs and neo4j while doing so. I will not be publishing the database - in the interest of my family's privacy - but am happy to discuss if you are interested in knowing more.

Hope this was interesting and/or useful.

All the best.

Rik

5 comments:

  1. Thanks a lot. I'm working in a Museum that has over 10,000 gedcom files and my job is to bring them online and I plan to use your design to store our trees in Neo4J.
    One modification to your design we're considering is using PARENT_OF instead MOTHER_OF and FATHER_OF to simplify the path. Be happy to hear your take.

    ReplyDelete
  2. Yeah that probably makes sense. I guess the question to ask yourself is what kinds of queries you want to ask, and then work back the model from there. I could see how "PARENT_OF" would make things easier for some queries, but more complicated for others :) ... In fact, there is no reason why you wouldn't be able to have BOTH the PARENT_OF and the MOTHER_OF/FATHER_OF relationships at the same time. That's the beauty of the graph model - you actually totally have that flexibility...

    Hth

    Rik

    ReplyDelete
  3. I am glad to have come across your blog today. A friend and I 6 months ago starting working on the same thing and we thought it would be great to map our family trees and use Neo4j to do the same. We explored it but could not go further then as we moved to different cities. I must admit that before reading your blog I reached the step of using Google Sheets to have all my data intact and I am reassured of the path I chose after reading it.

    It would be great to see your project's gist or a snippet to take a sneak peak. My repo is https://github.com/sachinsancheti1/Lifeline . I'd love to someday share my tree with all my known relatives to fill in the necessary data and see how far it goes :) and visualize it

    ReplyDelete
  4. Does your model account for the condition where the parents have no relationship?

    ReplyDelete
    Replies
    1. I think so. This actually one of the powerful characteristics of a graph. I see two ways of solving it:
      - either you just omit the "Relationship" node for that situation - and only work with the "FATHER_OF" and "MOTHER_OF" relationships... that works just fine...
      - or you would have a Relationship node with a different set of attributes... eg. you could have a 'type="Marriage" ' property, or a 'type="Living together" ' property, or even type:="Semen donor" if that would be the case...

      Hope that makes sense?

      Rik

      Delete