Tuesday 29 September 2020

Using Apache Zeppelin with Neo4j to analyse the FinCEN Files

Last week, we got another great and widely publicised case of Graph Databases' usefulness throw our way. The ICIJ published their FinCEN Files research, and on top of allowing you to explore the data on their website they also published an anonymised subset of the data as a series of CSV/JSON files. My friends and colleagues Michael Hunger, Will Lyon and the rest of the team, helped with the process of making this subset available as a Neo4j database (see this github repo), and there's even a super easy FinCEN Files Neo4j Sandbox that you can spin up in no time for some investigation fun.

So of course I had to take this data for a spin myself - it seems really important to me that more eyeballs are looking at this, and more people exposing the sometimes very questionable behaviour of the world's largest financial institutions.

Introducing Zeppelin

I had heard of some great technology a while ago that would allow people to use their data in a very different way, by looking at these interactive webpages that would interact with a Neo4j database.

Kind of like a GraphGist, but then running against your own dataset, and more interactive, with many more visualisation and reporting possibilities. It was called Apache Zeppelin, and it looked really interesting. I read a few articles about it (like this Blogpost), and soon was browsing the web for more info.

Not much after that was giving it a spin: I quickly installed Zeppelin following these instructions, and really found it quite a breeze to get going.

The Neo4j Interpreter

Next up after the install was to connect the Zeppelin installation to my Neo4j database. Zeppelin uses a specific terminology for these types of connectors - calling them "Interpreters". I am (of course) running the latest 4.1.1 release inside the Neo4j Desktop, and the Neo4j Interpreter that comes with Zeppelin 0.8.0 (you can find the documentation for that interpreter on the Apache website) was not working - apparently it was only supporting Neo4j 3.5.x.

So that's when I gave my friends a call. Turned out that the Neo4j Interpreter had been created by our long time Italian partner called Larus. One of their engineers, Andrea Santurbano was the author of the Interpreter - and after a few slack messages back and forth he was willing to upgrade the code to the latest Neo4j version. You can also ind Andrea on LinkedIn - he's such a nice guy and really helped me out.

So once Andrea had done the technical work to upgrade the Interpreter, there were only a handful of steps left to take:

  • I needed to update the interpreter on my default install:
  • Just download the ZIP file, and
    • then replace the contents of the in <ZEPPELIN_HOME>/neo4j directory with the contents of the ZIP file. This contains the jars with the support to Neo4j 4.x (as supported by the Java Driver 4.0.1)
    • Then I could start Zeppelin with a simple command: bin/zeppelin-daemon.sh start. (Note that this should not be in your G-drive folder :) ... - it does not like that)
  • Once that was done, I could browse to http://localhost:8080/, and then
  • start the Tutorial on Zeppelin.
It was very easy and quick to get going with.

Writing my FinCEN Files Notebook

Creating the actual Zeppelin notebook was not very difficult either. Once logged into the main Zeppelin management page, I just had to configure the interpreter by providing it with the URI for the server, and the authentication credentials. 

That way, the interpreter would know what to do every time the notebook would have a paragraph starting with %neo4j: that would mean that the paragraph would need to route the following queries in that paragraph to the configured Neo4j server.
As you can see, I was able to very quickly add a few useful paragraphs to the notebook. I have uploaded the full notebook (a .json file that contains all info about the notebook) to github: please find it over there. The results looked really good very quickly:

Adding some context with Markdown

Final note: apart from using the %neo4j start to the paragraphs, I was also very quickly able to add context with to the notebook by including some Markdown text. Just start a paragraph with %md, and before you know it the notebook includes some images, links etc. It's super easy.

I could really see how this system, the Zeppelin Notebooks, could be even more powerful if you would start combining lots of different datasources (with other paragraph codes like %cassandra or %jdbc or ...) in the same notebook. Then you could really use this notebook paradigm to enrich and develop your insights into the data in an even more profound way.

I hope this was a useful example/exercise for you. As always, looking forward to some feedback - my contact details are below.



About me

No comments:

Post a Comment