Wednesday, 31 October 2018

Data Lineage in Neo4j - an elaborate experiment

For the past couple of years, I have had a LOT of conversations with users and customers of Neo4j that have been looking at graph databases for solving Data Lineage problems. Now, at first, that seemed like a really fancy new word used only by hipster technovangelists to try to appear interesting, but once I drilled into it, I found that it’s actually something really interesting and a really cool application of graph databases. Read more on the background of it on wikipedia (as always), or just live with this really simple definition:
“Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources.”
That’s easy enough. Fact is that it’s a really big problem for large organisations - specifically financial institutions as they have to comply with regulations like the Basel Committee on Banking Supervision's standard number 239 - which is all about assuring data governance and risk reporting accuracy.

Here’s a couple of really nice articles and videos that should really give you quite a bit of background.

Monday, 22 October 2018

Poring over Power Plants: Global Power Emissions Database in Neo4j

In the past couple of weeks, I have been looking to some interesting datasets for the Utility sector, where Networks or Graphs are of course in very, VERY abundant supply. We have Electricity Networks, Gas Networks, Water Networks, Sewage Networks, etc etc - that all form these really interesting graphs that our users can. Lots of users have specialised, GIS based tools to manage these networks - but when you think about it there are so many interesting things that we could do if ... we would only store the network as a network - in Neo4j of course.
So I started looking for some datasets, and maybe I am not familiar with this domain, but I did not really find anything too graphy. But I did find a different dataset that contained a lot of information about Power Plants - and their emissions. Take a look at this website:
and then you can download the Excel workbook from over here. It's not that big - and of course the first thing I did was to convert it into a Google Sheet. You can access that sheet over here:

There's two really interesting tabs in this dataset:
  1. the sheet containing the fuel types: this gives you a set of classifications of the fuel that is used in the different power plants around the globe
  2. the list of 30,5k power plants from around the world that generate different levels of power from different fueltypes. While doing so, they also generate different levels of emissions, of course, and that data is also clearly mentioned in this sheet. Note that the dataset does not include any information on Nuclear plants - as they don't really have any "emissions" other than water vapour and... the nuclear waste of course.
So let's get going - let's import this dataset into Neo4j.

Friday, 12 October 2018

Podcast Interview with Michael Simons, Neo4j

For this week's episode of our Graphistania podcast, I had the great pleasure of spending some time on the phone with Michael Simons - one of the talented Neo4j engineers that build our products. Michael only recently joined our team, and we actually got talking on our internal channels about something we both love dearly... Bikes. I did a ride in Belgium recently that Michael found interesting and then he rode it himself as well - and hey, we got talking. One thing led to another, and before you know it we are recording the conversation... Here it is:

Here's the transcript of our conversation:
RVB: 00:00:01.418 Hello, everyone. My name is Rik Van Bruggen from Neo4j, and here I am again recording another episode for our Graphistania podcast. And today, I have one of my dear colleagues on the other side of this Google Hangout again, and that's Michael Simons from Neo4j engineering. Hi, Michael.

MS: 00:00:19.623 Hi, Rik.

Wednesday, 3 October 2018

Podcast Interview with Michael McKenzie

Why spend my evenings/weekends/empty hours creating a podcast? Well that's very simple: I love talking to like-minded people in the graph community. There's something about this community that attracts people that are equally fond of "connections" and building relationships that is just too awesome to explain. I love it. So when Karin told me about this guy in Washington that was doing awesome things with Neo4j and was helping out with community activities (he wrote about it over here), I was all too keen to have a chat with him. Meet Michael McKenzie, from Washington DC - here's our chat:

Note: I recorded this with Michael before our fantastic GraphConnect conference in New York a few weeks ago - but did not have the time to publish it earlier... apologies...

Here's the transcript of our conversation:
RVB:00:00:00.000 Hello, everyone. My name is Rik. Rik Van Bruggen from Neo4j and here I am again recording another Graphistania Neo4j podcast. And today, I have a wonderful community member on the other side of this Google hangout and that's Michael McKenzie from Washington, D.C. in the US. Hi, Michael.

Tuesday, 11 September 2018

Podcast Interview with Karin Wolok, Neo4j

Next week is GraphConnect New York City 2018, and that's of course a big highlight for all of us at Neo4j. You should really be there if you can :) ...

One of the reasons why GraphConnect is such a great event, is because it allows us to connect all the nodes in the graph and have a great couple of days of real-world conversations about this fascinating topic called graphs. Again, we are going to have a great line-up, not in the least because of all the great community content that we will be presenting and working on during the event.

On top of that, we have had a LOT going on in the Neo4j Community recently - with the launch of a new community site and more. That's a good enough reason for me to invite Karin Wolok, our Community Manager at Neo4j for a good chat. Here it is:

Here's the transcript of our conversation:
RVB: 00:00:00.819 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j. And here I am again recording another episode of our Graphistania Neo4j podcast. And today's a little bit of a special episode I think because it relates to something very dear to my heart and many people at Neo4j's heart, which is our Neo4j community. And for that, I've invited Karin Wolok on the podcast. Karin is our community manager. Actually, you have a very different and more expensive-sounding title, right, Karin? But maybe you can introduce yourself to our listeners.

Monday, 3 September 2018

Podcast Interview with Johannes Unterstein, Neo4j

A couple of months ago, we had a great Online Meetup that was all about scaling out Neo4j using containerisation and container orchestration technologies. You can see the recording over here:

That was really cool, and a great execuse to invite my nowadays *colleague* Johannes Unterstein to the podcast. Johannes has a really interesting history and a lot of expertise in these technologies, and could really talk about them for our audience. So here's our chat:

Here's the transcript of our conversation:
RVB: 00:00:00.399 Hello, everyone. My name is Rik Van Bruggen from Neo4j, and here I am again after the holiday period recording another Graphistania podcast. And today I have the pleasure of welcoming one of my dear engineering colleagues on this podcast episode, and that's Johannes Unterstein from Germany. Hi, Johannes.

Thursday, 23 August 2018

ESCO database in Neo4j: Skills, Competencies, Qualifications and Occupations form a beautiful graph!

Just a few weeks ago, I was discussing with Neo4j users that are active in the domain of "labour", or work. While talking to these users, they mentioned that there are standards out there that classify different types of work into different buckets (a taxonomy, if you will), and that there are two competing standards to do so out there. There's 
  • the ESCO standard: the European Skills, Competences, Qualifications and Occupations, and 
  • the ROME standard: the "Répertoire opérationnel des métiers et des emplois (ROME)"
The ESCO seems to be promoted by the European Commission, and the latter seems to be a Belgian/French initiative of some sorts. Surely they overlap, but I am not sure by how much. As luck would have it I started looking at the ESCO material first, but I am sure we could have written this post about ROME as well. It's the principles that matter.

And in principle, I figured that using these standards would be a really cool thing to do in Neo4j. Skills/Competences and  Occupations form really interesting graphy structures, and I could see how you could use a taxonomy like that to do some really interesting recommendations and other data workloads. So I wanted to give it a poke around.

Loading ESCO into Neo4j

The entire ESCO dataset can be downloaded from the European Commission's portal site:  
It's really easy: you just select the data that you are interested in - the topic, format, and the languages - and put together a download package. 

In terms of format, you can choose between
  • an RDF format, which basically gives you a large (500MB) Turtle file. Turtle - the Terse RDF Triple Language, see - is probably more comprehensive, as it contains everything. But it's also quite a bit more difficult to manipulate and get your head around. I was able to import the Turtle file really easily using Jesus' "neosemantics" plugins, and had it up and running in minutes. But I found it more difficult to use - most likely because I am not an RDF afficionado. Sorry.
  • CSV format. That's easy enough - we know how to import those. So all I needed to do was write a few Cypher scripts and import the data in a few minutes. I will put the scripts below, but you can also see them on github.
In any case, I opted to continue with the CSV files, and spent a little time importing the different files and connecting them together - in different languages. There's basically 5 files:
  1. the Skills
  2. the Skillsgroups, grouping the above together in groups
  3. the Occupations
  4. the ISCOgroups: this is a standard of the International Labour Organisation (ILO) that provides an International Standard Classification of Occupations. 
  5. and then a few files with relationships between Skills and Occupations, different ISCO groups, and different Skills/Skillsgroups.
I wrote the script pretty quickly - it's really not that hard - and then I ...
... ended up with a few Neo4j databases:
  1. one full of RDF triples - complicated!
  2. one with English Skills, Skillsgroups, Occupations and ISCOgroups. 
  3. one with Dutch Skills, Skillsgroups, Occupations and ISCOgroups.
In the Neo4j Desktop that looks a bit like this:
This is where the scripts are on Github.

Working with the ESCO database in Neo4j

Now that all that is imported, we can take a look at it. Let's start by looking at the model that we have imported. Pretty straightforward:
We can also just start looking at some data by just visually exploring it in the Neo4j Browser:
But it get's a lot interesting when we put Cypher to it, and start querying the data. For example, let me grab these two nodes here:
And look at the paths between them:
As always, the things that are located on the path, tend to be pretty interesting. Even more so when I think a bit more about the data, and start looking for the ESSENTIAL FOR relationship links. Let's see what comes back when I look for the links between a "software developer" and a "beer sommelier", when I ONLY traverse the relationships that define really important / ESSENTIAL relationships between Skills and Occupations:
Interesting. I am sure that a domain expert could do lots of other things here, especially if we could give that expert some non-technical tool like Neo4j Bloom.
All in all, this was a really easy and interesting experiment. I am sure there's a lot more to do here - but this was yet another example of a cool application of Neo4j in a surprising domain.

Hope this was useful.