Tuesday 5 October 2021

ReBeerGraph: importing the Belgian BeerGraph straight from a Wikipedia HTML page

I have written about beer a few times, also on this blog. I have figured out a variety of ways to import The Wikipedia Page with all the belgian beers into Neo4j over the years. Starting with a spreadsheet based approach, then importing it from a Google Sheet (with it's great automated .csv export facilities), and then building on that functionality to automatically import the Wikipedia page into a spreadsheet using some funky IMPORTHTML() functions in Google sheets.

But: all of the above have started to crumble recently. The Wikipedia page, is actually kind of a difficult thing to parse automatically, as it splits up the dataset into many different HTML tables (which makes me need to import multiple datasets, really), and then it also seems like Wikipedia has added an additional column to it's data (the "Timeframe" or "Period" in which a beer had been brewn), which had lots of missing, and therefore empty cells. All of that messes up the IMPORTHTML() Google sheet that I had been using to automatically import the page into a gsheet.

So: I had been on the lookout for a different way of doing the automated import. And recently, while I was working on another side project (of course), I actually bumped into it. That's what I want to cover and demonstrate here: importing the data, directly from the page, without intermediate steps, automatically, using the apoc.load.html functionality.

APOC allows to load from HTML page - directly

If you are not familiar with Neo4j APOC yet, then rest assure, you really need to be. APOC allows for a very flexible, extensible and practical set of utility functions and procedures to be called from the Cypher language, and it has over the years proven to be an indispensable Swiss Army knife kind of toolbox for Neo4j. It's just fantastic. In the Awesome Procedures On Cypher, you can find a wide variety of tools, but the one that we are specifically going to be using here is the apoc.load.html procedure. This procedure allows you to call a particular HTML page somewhere on the internet, read its contents, and then process the content in subsequent cypher statements.

Here's some more details, before we dive in:

  • The syntax of the procedure is pretty simple:

    apoc.load.html(url :: STRING?, query = {} :: MAP?, config = {} :: MAP?) :: (value :: MAP?)

    So we really need to give it a URL, a set of queries, and some configuration parameters (if any).

  • You should note that this procedure returns a MAP of data as a result - and therefore working with maps in cypher is very important to process the results that are being returned from the procedure. Please refer to the cypher manual to understand how to work with a map in cypher. We will need that in the example below to process the different fields that we extract from the table structures in the HTML.

  • The nice thing about this approach is that it can unify all the different rows in all the different tables into one resultset. All the HTML elements that adhere to the same jquery syntax (as used by jsoup - which is the Java HTML parser that is being used under the hood) will be put into the result map, independent of the fact that in the HTML, they are actually appearing in different tables. The jQuery will just say that the table row has to be part of some table structure and that's it - therefore matching all the rows in all the tables, and efficiently and effectively solving the problem that we would have if we were to deal with multiple tables.

To create my own example below, I must say that I had to learn a thing or two about jQuery and jSoup, and that took me a little bit of time and experimentation to get used to. I was actually super grateful to have foond quite an advanced example from Vlad Batushkov to start understanding the functionality. Thanks a lot to Vlad - he has a lot of interesting articles out on Medium!

We will get into the example below, but before we do, I want to highlight a few interesting other techniques that I used here.

Iterating over the resulting apoc.load.html map using UNWIND

As you will see below, the result map below will yield thousands of entries into one map structure. Therefore, it will become important to be able to iterate over the different sets into result map, and process them line by line. To do that, we will use the UNWIND functionality of Cypher.

First, we need to be evaluating the size of the resulting apoc.load.html map structure (ie. the number of rows in the Wikipedia page) using size(). Once we have that number, we can comfortably use

UNWIND range(0,<size of the map>) as i

to walk through every row in the map and process this. See below for the actually application of this super useful technique.

Now I want to highlight one more super useful technique that really helped me out here.

Dealing with NULL Values

The thing is, in any import job, there is always a little bit of a balancing act between making sure that we don't create any duplicates in our database (and therefore use the MERGE operator in Cypher), and the fact that cannot "merge a null value". MERGE always throws an error when it hits a null value in the property that we are using to "merge" on, and therefore we need to figure out a way to converting the table above to a graph, and account for the "empty cells" that are clearly appearing in the tables. We want to create the following model:  

allowing for the fact that the source table has empty cells, and thus NULL values.

Essentially we have 2 ways to be dealing with NULL values in MERGE:

  1. we can filter out the NULLs before we MERGE, by having consecutive WITH / WHERE clauses
  2. we can use a special function: COALESCE() to return the first non-NULL value in a list, and therefore ensure that if the table holds an empty cell - the COALESCE function would still return a non-NULL value (eg. the value "Unknown" or something similar).

Therefore the MERGE queries would something like this:

MERGE (abc:LABEL {name: coalesce(<potentially null value>,"Unknown")})

This query will always work, even if the input values (the empty cells in our tables above) are possibly NULL.

Bringing it all together: Importing the data into Neo4j

Let's now bring it all together and do the actual import. Note that this example below assumes that you have installed and whitelisted the apoc.* procedures into your Neo4j server.

First, we need to create the database on our Neo4j 4.3.x server:

create database beergraph;

Then, we are going to be setting up the indexes for the database - using the new Cypher syntax:

CREATE INDEX beerbrand_name FOR (bb:BeerBrand) ON (bb.name);
CREATE INDEX brewery_name FOR (br:Brewery) ON (br.name);
CREATE INDEX beertype_name FOR (bt:BeerType) ON (bt.name);
CREATE INDEX alcoholpercentage_value FOR (ap:AlcoholPercentage) ON (ap.value);

 

Once that is done, we can then do the import straight from the Wikipedia page into our database using the following query.

WITH "https://nl.wikipedia.org/wiki/Lijst_van_Belgische_bieren" as url
    CALL apoc.load.html(url, {
        brand: "table.wikitable tbody tr td:eq(0)", 
        beertype: "table.wikitable tbody tr td:eq(1)",
        alcoholpercentage: "table.wikitable tbody tr td:eq(2)",
        brewery: "table.wikitable tbody tr td:eq(3)",
        timeframe: "table.wikitable tbody tr td:eq(4)"
        }) yield value
WITH value, size(value.brand) as rangeup
UNWIND range(0,rangeup) as i
    WITH value.brand[i].text as BeerBrand, value.brewery[i].text as Brewery, value.alcoholpercentage[i].text as AlcoholPercentage, value.beertype[i].text as BeerType, value.timeframe[i].text as Timeframe
        MERGE (bt:BeerType {name: coalesce(BeerType,"Unknown")})
        MERGE (bb:BeerBrand {name: coalesce(BeerBrand,"Unknown")})
        SET bb.Timeframe = coalesce(Timeframe,"Unknown")
        MERGE (br:Brewery {name: coalesce(Brewery,"Unknown")})
        MERGE (ap:AlcoholPercentage {value: coalesce(AlcoholPercentage,"Unknown")})
        MERGE (bb)-[:HAS_ALCOHOLPERCENTAGE]->(ap)
        MERGE (bb)-[:IS_A]->(bt)
        MERGE (bb)<-[:BREWS]-(br);

The result of which, assuming a good connectivity to Wikipedia, returns almost instantly:  

Just like that, we have imported 3k nodes and 6k relationships - the Belgian BeerGraph!

Revisiting the Belgian BeerGraph

As intended, the data model looks exactly the way that we intended and expected:

The stats of the dataset are broadly inline with what we would expect:  

I think you probably realise that for a small country like Belgium, this works out at a great number of beers per capita. There's a reason why I enjoy living here :) ...

Still seeing the graph epiphanies

Before we wrap up this post: it's cool to see that the graph epiphanies that I originally experienced when I first started playing with this dataset, are still well alive and kicking.

  • Epiphany 1: Orval and Westvleteren are connected!

    I remember seeing this for the first time. Orval is probably my favourite beer of all time, and it's a very special, Trappist beer, of 6,2% Alcohol. And just by clicking around on that favourite beer of mine, I found out that there is another Trappist beer that has 6.2% Alcohol. It's the Westvleteren 6 beer, which is extremely unknown and impossible to get to, for the simple reason that you actually have to be a Trappist monk to be allowed to drink it. I really should be tasting this but am not about to become a monk.

    The following query will show you this epiphany, immediately"

      MATCH (b:BeerBrand {name:"Orval"})-[*..2]-(conn) return b, conn;
    

  • Epiphany 2: Duvel and Orval are linked

    I love Orval, but I also love Duvel. Both beers are lovely, but very different. Orval is a darker, amber-coloured, sour-y beer, and Duvel is sweeter, blond, high-alcohol and fermented on the bottle style of beer. So they are veyr different. But when you run a pathfinding query to see how both beers are connected, you find out that there are some really interesting beers on that path. More specifically, Orval and Duvel are linked to one another through Heksemiebier and Wilderen Goud.

    Take a look at how this works in Cypher:

      MATCH (o:BeerBrand {name:"Orval"}), (d:BeerBrand {name:"Duvel"}),
      path = allshortestpaths ((o)-[*]-(d))
      RETURN path
    

So there we have it all together: the Belgian BeerGraph is whole again, and we learned a whole lot in the process!

Hope this was useful for you - looking forward to hearing from you and see what you think!

Cheers

Rik

PS - all the code is on github, as usual

No comments:

Post a Comment