I have written about beer a few times, also on this blog. I have figured out a variety of ways to import The Wikipedia Page with all the belgian beers into Neo4j over the years. Starting with a spreadsheet based approach, then importing it from a Google Sheet (with it's great automated .csv
export facilities), and then building on that functionality to automatically import the Wikipedia page into a spreadsheet using some funky IMPORTHTML()
functions in Google sheets.
But: all of the above have started to crumble recently. The Wikipedia page, is actually kind of a difficult thing to parse automatically, as it splits up the dataset into many different HTML tables (which makes me need to import multiple datasets, really), and then it also seems like Wikipedia has added an additional column to it's data (the "Timeframe" or "Period" in which a beer had been brewn), which had lots of missing, and therefore empty cells. All of that messes up the IMPORTHTML()
Google sheet that I had been using to automatically import the page into a gsheet.
So: I had been on the lookout for a different way of doing the automated import. And recently, while I was working on another side project (of course), I actually bumped into it. That's what I want to cover and demonstrate here: importing the data, directly from the page, without intermediate steps, automatically, using the apoc.load.html
functionality.