Wednesday 31 October 2018

Data Lineage in Neo4j - an elaborate experiment

For the past couple of years, I have had a LOT of conversations with users and customers of Neo4j that have been looking at graph databases for solving Data Lineage problems. Now, at first, that seemed like a really fancy new word used only by hipster technovangelists to try to appear interesting, but once I drilled into it, I found that it’s actually something really interesting and a really cool application of graph databases. Read more on the background of it on wikipedia (as always), or just live with this really simple definition:
“Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources.”
That’s easy enough. Fact is that it’s a really big problem for large organisations - specifically financial institutions as they have to comply with regulations like the Basel Committee on Banking Supervision's standard number 239 - which is all about assuring data governance and risk reporting accuracy.

Here’s a couple of really nice articles and videos that should really give you quite a bit of background.

Monday 22 October 2018

Poring over Power Plants: Global Power Emissions Database in Neo4j

In the past couple of weeks, I have been looking to some interesting datasets for the Utility sector, where Networks or Graphs are of course in very, VERY abundant supply. We have Electricity Networks, Gas Networks, Water Networks, Sewage Networks, etc etc - that all form these really interesting graphs that our users can. Lots of users have specialised, GIS based tools to manage these networks - but when you think about it there are so many interesting things that we could do if ... we would only store the network as a network - in Neo4j of course.
So I started looking for some datasets, and maybe I am not familiar with this domain, but I did not really find anything too graphy. But I did find a different dataset that contained a lot of information about Power Plants - and their emissions. Take a look at this website:
and then you can download the Excel workbook from over here. It's not that big - and of course the first thing I did was to convert it into a Google Sheet. You can access that sheet over here:

There's two really interesting tabs in this dataset:
  1. the sheet containing the fuel types: this gives you a set of classifications of the fuel that is used in the different power plants around the globe
  2. the list of 30,5k power plants from around the world that generate different levels of power from different fueltypes. While doing so, they also generate different levels of emissions, of course, and that data is also clearly mentioned in this sheet. Note that the dataset does not include any information on Nuclear plants - as they don't really have any "emissions" other than water vapour and... the nuclear waste of course.
So let's get going - let's import this dataset into Neo4j.

Friday 12 October 2018

Podcast Interview with Michael Simons, Neo4j

For this week's episode of our Graphistania podcast, I had the great pleasure of spending some time on the phone with Michael Simons - one of the talented Neo4j engineers that build our products. Michael only recently joined our team, and we actually got talking on our internal channels about something we both love dearly... Bikes. I did a ride in Belgium recently that Michael found interesting and then he rode it himself as well - and hey, we got talking. One thing led to another, and before you know it we are recording the conversation... Here it is:

Here's the transcript of our conversation:
RVB: 00:00:01.418 Hello, everyone. My name is Rik Van Bruggen from Neo4j, and here I am again recording another episode for our Graphistania podcast. And today, I have one of my dear colleagues on the other side of this Google Hangout again, and that's Michael Simons from Neo4j engineering. Hi, Michael.

MS: 00:00:19.623 Hi, Rik.

Wednesday 3 October 2018

Podcast Interview with Michael McKenzie

Why spend my evenings/weekends/empty hours creating a podcast? Well that's very simple: I love talking to like-minded people in the graph community. There's something about this community that attracts people that are equally fond of "connections" and building relationships that is just too awesome to explain. I love it. So when Karin told me about this guy in Washington that was doing awesome things with Neo4j and was helping out with community activities (he wrote about it over here), I was all too keen to have a chat with him. Meet Michael McKenzie, from Washington DC - here's our chat:

Note: I recorded this with Michael before our fantastic GraphConnect conference in New York a few weeks ago - but did not have the time to publish it earlier... apologies...

Here's the transcript of our conversation:
RVB:00:00:00.000 Hello, everyone. My name is Rik. Rik Van Bruggen from Neo4j and here I am again recording another Graphistania Neo4j podcast. And today, I have a wonderful community member on the other side of this Google hangout and that's Michael McKenzie from Washington, D.C. in the US. Hi, Michael.