Thursday, 23 August 2018

ESCO database in Neo4j: Skills, Competencies, Qualifications and Occupations form a beautiful graph!

Just a few weeks ago, I was discussing with Neo4j users that are active in the domain of "labour", or work. While talking to these users, they mentioned that there are standards out there that classify different types of work into different buckets (a taxonomy, if you will), and that there are two competing standards to do so out there. There's 
  • the ESCO standard: the European Skills, Competences, Qualifications and Occupations, and 
  • the ROME standard: the "Répertoire opérationnel des métiers et des emplois (ROME)"
The ESCO seems to be promoted by the European Commission, and the latter seems to be a Belgian/French initiative of some sorts. Surely they overlap, but I am not sure by how much. As luck would have it I started looking at the ESCO material first, but I am sure we could have written this post about ROME as well. It's the principles that matter.

And in principle, I figured that using these standards would be a really cool thing to do in Neo4j. Skills/Competences and  Occupations form really interesting graphy structures, and I could see how you could use a taxonomy like that to do some really interesting recommendations and other data workloads. So I wanted to give it a poke around.

Loading ESCO into Neo4j

The entire ESCO dataset can be downloaded from the European Commission's portal site: https://ec.europa.eu/esco/portal.  
It's really easy: you just select the data that you are interested in - the topic, format, and the languages - and put together a download package. 

In terms of format, you can choose between
  • an RDF format, which basically gives you a large (500MB) Turtle file. Turtle - the Terse RDF Triple Language, see https://www.w3.org/TR/turtle/ - is probably more comprehensive, as it contains everything. But it's also quite a bit more difficult to manipulate and get your head around. I was able to import the Turtle file really easily using Jesus' "neosemantics" plugins, and had it up and running in minutes. But I found it more difficult to use - most likely because I am not an RDF afficionado. Sorry.
  • CSV format. That's easy enough - we know how to import those. So all I needed to do was write a few Cypher scripts and import the data in a few minutes. I will put the scripts below, but you can also see them on github.
In any case, I opted to continue with the CSV files, and spent a little time importing the different files and connecting them together - in different languages. There's basically 5 files:
  1. the Skills
  2. the Skillsgroups, grouping the above together in groups
  3. the Occupations
  4. the ISCOgroups: this is a standard of the International Labour Organisation (ILO) that provides an International Standard Classification of Occupations. 
  5. and then a few files with relationships between Skills and Occupations, different ISCO groups, and different Skills/Skillsgroups.
I wrote the script pretty quickly - it's really not that hard - and then I ...
... ended up with a few Neo4j databases:
  1. one full of RDF triples - complicated!
  2. one with English Skills, Skillsgroups, Occupations and ISCOgroups. 
  3. one with Dutch Skills, Skillsgroups, Occupations and ISCOgroups.
In the Neo4j Desktop that looks a bit like this:
This is where the scripts are on Github.

Working with the ESCO database in Neo4j

Now that all that is imported, we can take a look at it. Let's start by looking at the model that we have imported. Pretty straightforward:
We can also just start looking at some data by just visually exploring it in the Neo4j Browser:
But it get's a lot interesting when we put Cypher to it, and start querying the data. For example, let me grab these two nodes here:
And look at the paths between them:
As always, the things that are located on the path, tend to be pretty interesting. Even more so when I think a bit more about the data, and start looking for the ESSENTIAL FOR relationship links. Let's see what comes back when I look for the links between a "software developer" and a "beer sommelier", when I ONLY traverse the relationships that define really important / ESSENTIAL relationships between Skills and Occupations:
Interesting. I am sure that a domain expert could do lots of other things here, especially if we could give that expert some non-technical tool like Neo4j Bloom.
All in all, this was a really easy and interesting experiment. I am sure there's a lot more to do here - but this was yet another example of a cool application of Neo4j in a surprising domain.

Hope this was useful.

Cheers

Rik