But in the 191 years that we have existed as a country, we have done some weird sh!t as well. As an example, we did some of the craziest stuff ever under our former King Leopold II. I got to know a bit more about that through Klara, a (Dutch spoken, but nevertheless crazy wonderful) Flemish radio station, which hosted a podcast about Leopold II's crazy, and sometimes cruel, adventures in Congo. See over here if you are interested. It's a terrible, but fascinating story.
So more recently, when I started reading about the Congo Holdup in De Standaard, the newspaper that I subscribe to over here, I could not be but interested - and wanted to know more. I have been reading about it on an off, and have made my way to the actual source of this investigation, which is run by the EIC - the European Investigative Collaborations - you can find the details over here. Essentially, this is a about How to rob a country, using a bank. Specifically, this is the BGFIBank, which laundered lots of unsavoury transactions through its books - and allowed some of the key leaders of Congo to get away with crazy big amounts of $$$ - leaving the country's citizens in poverty in their wake.
So this post is NOT going to be about the details of that investigation. It's more about how you can use a graph, and some of the easiest tools ever, to better make sense of a pile of articles like that. That's what I have tried to do, and what I would like to highlight below. I will leave the actual interpretation of this data to others.
First: Import the articles from the EIC website
First thing I wanted to do, was to download a version of all the published articles from around the world that are on the EIC project page into Neo4j. As you may have seen in previous posts, I have been experimenting with webscraping procedures in APOC recently - but I have found those to be a bit cumbersome to use sometimes. I was looking for a bit of a wrapper that would make it easier toe "scrape" a page. And guess what, I found one.
Using the Scraper
plugin to get the links
Without too much trouble, I was able to find a great little tool that Janos Szendi-Varga created some time ago as a proof of concept to do the web scraping in a much simpler way. The tool is a bit old and could no doubt be improved - but still works like a charm. Take a look: Scraper uses the well known JSoup library to parse the HTML of the webpages that we are targeting, and do all kinds of funky things with it - with a simple Cypher procedure call.
All I needed to do was to find the latest release of Scraper on Github, drop it into the ./Plugins
directory of my Neo4j server, whitelist the plugin in the neo4j.conf
configurations file, and then we can get cracking. Easy.
Here's how we get the links from the EIC project page:
CALL scraper.getLinks("https://eic.network/projects/congo-holdup.html")
YIELD element
WITH element
WHERE left(element.attributes.href,8) = "https://"
MERGE (a:Article {title: element.text, link: element.attributes.href})
Now we have a SET of Article
nodes in the database, with their title
and their link
. Now we can expand on that.
Extract the Publishers from the articles
In order to extract the Publisher
from the article, I am going to make the assumption that a Publisher
- will have a link to it's main page in the database. I know that from looking at the EIC project page.
- will need to have had more than 1 article published to really be a valid publisher. So with the next query, I will compare all pairs of articles, and see if the full URL of the first article MATCHes the start of the URL of the second article. If it does, then the first article is the
Publisher
node!
Here's how that goes:
MATCH (a1:Article), (a2:Article)
WHERE a1.link = left(a2.link,size(a1.link))
AND id(a1) <> id(a2)
MERGE (a1)-[:PUBLISHES]->(a2)
SET a1:Publisher
REMOVE a1:Article
SET a1.name = a1.title
REMOVE a1.title;
Magic happens: get the full text of the articles
Now here's WHERE some real magic happens. I will use the Scraper tool to get the content of each of the Congo Holdup articles that is on the webpage of the EIC. When I first saw this, I could now believe how simple this was and how easy it is to do this. Here's the query:
MATCH (a:Article)
UNWIND a.link AS links
CALL scraper.getPlainText(links,"body") YIELD value
SET a.scrapedtext = value;
So: simply put - I just give it all the articles (there's not that many of them, so I can do it in 1 transaction - otherwise I would want to batch this in to a series of separate transactions), give it the URL of the link of the article, and the HTML tag that I want to be scraping (<body>
), and of it goes. 42 seconds later, we are all SET!
As a consequence, the very simple data the model looks like this now:
and that means that we can add a couple of indexes to enable our future queries.
CREATE INDEX article_title FOR (a:Article) ON (a.title);
CREATE FULLTEXT INDEX article_scrapedtext FOR (a:Article) ON EACH [a.title, a.scrapedtext];
CREATE INDEX entity_text FOR (e:Entity) ON (e.text);
Some cleanup needed!
In order to now start taking this one step further, I needed to do a little bit of cleanup on the data. The most important, and for me, a bit surprising and annoying, task, was that I needed a way to tag all the articles with their Language
. This is really cumbersome - I would have loved to automate this a bit more, but I have not found a way to do that, YET! So I just decided to do it "manually" with a bunch of cypher queries. See below:
<!-- Remove the `creativecommons.org` publsher -->
MATCH (p:Publisher {link:"https://creativecommons.org/licenses/by-sa/4.0/"})
DETACH DELETE p;
<!-- Set the languages -->
MATCH (p:Publisher)--(a:Article)
WHERE right(p.link,2)="fr"
SET a.language ="FR";
MATCH (p:Publisher)--(a:Article)
WHERE right(p.link,2)="nl"
SET a.language ="NL";
MATCH (p:Publisher)--(a:Article)
WHERE right(p.link,13)="standaard.be/"
SET a.language ="NL";
MATCH (p:Publisher)--(a:Article)
WHERE right(p.link,9)="lesoir.be/"
SET a.language ="FR";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "Bloomberg"
SET a.language ="ENG";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "Politiken"
SET a.language ="DK";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "France"
SET a.language ="FR";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "Spiegel"
SET a.language ="DE";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "Mediapart"
SET a.language ="FR";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "infoLibre"
SET a.language ="ES";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "PPLAAF"
SET a.language ="ENG";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "Orient"
SET a.language ="FR";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "Kringvarp"
SET a.language ="FO";
MATCH (p:Publisher)--(a:Article)
WHERE p.link CONTAINS ".pt"
SET a.language ="PT";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "Sentry"
SET a.language ="ENG";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "BBC"
SET a.language ="ENG";
MATCH (p:Publisher)--(a:Article)
WHERE p.name CONTAINS "Wire"
SET a.language ="ENG";
MATCH (a:Article)
WHERE a.link CONTAINS "/de/"
SET a.language = "DE";
MATCH (a:Article)
WHERE a.link CONTAINS "/fr/"
SET a.language = "FR";
MATCH (a:Article)
WHERE a.title CONTAINS "VG"
SET a.language ="NO";
MATCH (a:Article)
WHERE a.title CONTAINS "Nacional"
SET a.language ="HR";
MATCH (a:Article)
WHERE a.title CONTAINS "României" OR a.title CONTAINS "România"
SET a.language ="RO";
MATCH (a:Article)
WHERE a.title CONTAINS "Namibian"
SET a.language ="ENG";
MATCH (a:Article)
WHERE a.title CONTAINS "Wire"
SET a.language ="ENG";
MATCH (a:Article)
WHERE a.title CONTAINS "Continent"
SET a.language ="ENG";
MATCH (a:Article)
WHERE a.title CONTAINS "KWANZA"
SET a.language ="ENG";
<!-- Remove the Twitter and LinkedIn articles -->
MATCH (a:Article) WHERE a.link starts with "https://twitter.com" detach delete a;
MATCH (a:Article) WHERE a.link CONTAINS "linkedin" detach delete a;
Then we can do a quick check and see how many articles we have in every language:
MATCH (a:Article)
WHERE a.language IS NOT null
RETURN a.language, count(a) ORDER BY count(a) DESC;
So now, we can proceed to the next step: using NLP to try and understand these articles a bit better, without having to read them all :) ...
Let's use Natural Language Processing for some insights!
Now I would like to do one more thing. Similar to what I did a few months ago with news articles, I would like to use Google's Cloud NLP engine to try and understand these articles without having to - or being able to, in some languages - read them all.
In order to do that, I will - again - need to install this library as an extension to our beloved APOC plugin, so that we can call the Google Cloud NLP service.
Note that there's a bit of setup that you need to do as well. The most important of which is that you will have to set up an API key for the queries to work:
:param apiKey =>("123423534zfhfgfsada8576589fsdfsdf785678")
Then we can proceed to run the two main NLP queries that GCP's service helpfully provides here:
- We can use the API to extract the entities from the
Article
'sscrapedtext
properties. Note that this only works for a number of specific languages - which is why we needed to have thelanguage
property on every article in the first place. Essentially, and as documented over here, we have English, French, German, Spanish and Portuguese as supported languages. So here's the query that helps us with this:
WITH ["FR","ENG","DE","ES","PT"] as languages
UNWIND languages as language
MATCH (a:Article)
WHERE a.language = language AND a.scrapedtext IS NOT null
CALL apoc.nlp.gcp.entities.graph(a, {
key: $apiKey,
nodeProperty: "scrapedtext",
scoreCutoff: 0.01,
writeRelationshipType: "HAS_ENTITY",
writeRelationshipProperty: "gcpEntityScore",
write: true
})
YIELD graph AS g
RETURN "Success!";
This takes a little bit of time (100 seconds) to complete for 67 articles:
- We can use the API to classify the ENGLISH content into a limited set of categories, so that it would become easier for us to make sense of the structure. Using the same API key, we will run the following query:
WITH ["ENG"] as languages
UNWIND languages as language
MATCH (a:Article)
WHERE a.language = language and a.scrapedtext is not null
CALL apoc.nlp.gcp.classify.graph(a, {
key: $apiKey,
nodeProperty: "scrapedtext",
scoreCutoff: 0.01,
writeRelationshipType: "CLASSIFIED_AS",
writeRelationshipProperty: "gcpClassificationProperty",
write: true
})
YIELD graph AS g
RETURN "Success!";
As this only works for English language articles, this works a bit faster.
So now we actually have a really interesting graph that we can start exploring. The model is clearly a lot richer than what it was before, with all the different types of entities that were detected: Person
, Location
, Organization
, ConsumerGood
, Event
and WorkOfArt
nodes were all added to the Model. Note that these are all Entity
labeled nodes as well - so they will have two labels.
I have then started to take this graph for a little spin in Bloom, and got some reasonably interesting results:
If you zoom into the English articles, it actually gets a bit more interesting to look at:
For example, you can start exploring the neighborhood of a couple of Person
nodes (in blue):
This should prove quite interesting to explore a bit more. I have not made a lot of progress there, but I think I may spend some additional time on this.
In any case, I hope you enjoyed following this little exercise, and I look forward to hearing your feedback.
All the best
Rik
No comments:
Post a Comment