Wednesday 22 November 2017

Podcast Interview with Niek Bartholomeus, Clarabridge

End of May 2017, I get a direct message on Twitter from @niekbartho. This guy claiming that we have met at some meetup - I frankly did not remember :), because, old age and all that - told me that he was "working on a little project" and asked me if we could have a chat about it. So we met up, had a cup of coffee - and started a conversation. It was an amazing conversation in my book, and related to something that I wrote about in 2015: analysing corporate networks with Neo4j. At the time, I just loaded the publicly available Belgian Corporate Registry into Neo4j and had some fun - but this guy took it to the next level.

That guy was Niek Bartholomeus, a rockstar developer from Belgium. He had some spare time between jobs, and decided that he would build something useful: openthebox.be. A site that takes a bunch of publicly available datasets about corporations, their shareholders, their structure and also... politicians that may have interests (financial or other) in these structure. Quite a bit of data - all in the best database out there (of course), Neo4j.

Niek has spoken about it at our Belgian meetup and at the online meetup (see deck and recording below), but of course, that was not enough:



We needed the get Niek to have a chat on the podcast, and here it is:

Here's the transcript of our conversation:
RVB: 00:00:03.667 Hello, everyone. My name is Rik Van Bruggen from Neo4j, and here I am again recording a Graphistania podcast episode. And today, I'm very happy to have one of my fellow countrymen on the other side of this Skype call. And that's Niek Bartholomeus from Clarabridge. Hi, Niek. 
NB: 00:00:19.902 Hello, Rik. Thanks for having me on your podcast. 
RVB: 00:00:22.548 I'm so, so happy to have you here because, as you know, people are talking about you in our local industry here in Belgium. But we'll get into that a little bit later [laughter]. Niek, tell us a little bit-- who are you and what do you do and what's your relationship to the wonderful world of graphs? 
NB: 00:00:40.763 Well, I'm Niek. I'm a software developer, as a background, although I like to do lots of different stuff. And right now, I'm more focused on data science and NLP, my day job. And the subject that we're going to talk about here is the project for which I used Neo4j, that's the Open the Box project. 

RVB: 00:01:10.148 Absolutely. Open the Box is the thing that my users and customers are talking to me about. Last week, someone referred to it as the "Belgian Paradise Papers", which I thought was quite interesting. Tell us a little bit about that project. What is it? 
NB: 00:01:30.244 Yeah. It's a website, basically, that's open to the world. It's publicly available. It uses all kinds of openly-available data: a company database (note: from the Belgian Corporate Registry - http://kbopub.economie.fgov.be/kbopub), annual accounts of the companies (note: from the National Bank of Belgium - https://www.nbb.be/), well, the Belgian companies so far (note: Niek also integrated records from the Flemish government - http://binnenland.vlaanderen.be/). And it focuses on the relationships that are found in that open data. It's basically administrator relationships, companies are administered by people or sometimes by other companies. I also added, quite soon after the initial start, a political dimension where I used a PDF that's published each year (note: in the Belgian Journal of Public records
http://www.ejustice.just.fgov.be/mopdf/2017/08/11_1.pdf), where all politicians in Belgium have to declare the mandates that they have in the Belgian companies. So these relationships are also added and of course, what else to use than Neo4j to keep track of these relationships and to do all kinds of interesting queries, like shortest path, for example, that's something that I'm using to actually visualize, starting from one company or one person, and going out to five number of hops. It can get quite complex for big companies with many administrators, or many political mandates. And there's also a feature where you can try to connect two entities; a person or a company with another person or a company. As you can see, it has shown me the path between them or the X shortest paths between these two entities. 
RVB: 00:03:15.752 So Niek, this is a project that you developed professionally, or just in your spare time, or--? How did this come about? How did you get this idea? 
NB: 00:03:26.323 Well, it started as an experiment. In my previous job, I was focused on company databases for different use case, so I said, "Well, a lot of information is available, and the world gets opened up more and more. We have the cloud, that's already there for a long time, so it's quite easy to spin up new infrastructure. There is a lot of open source technology that I'm using going from Spark, I talked about Neo4j already, JavaScript libraries, Cytoscape, for example, that I'm using. So it's amazing how a technical person like me can quickly come up with quite a big thing. There is about 500,000 companies and about the same amount of persons in the database, so that's not huge, but that's quite big. And it turned out that it's quite easy to set up all of this information in this decade and this era. 
RVB: 00:04:31.172 And so you worked on it on your own, or did you have a team of people working on it, or--? And how long did it take you to make this thing? 
NB: 00:04:37.690 No. It was just me. And as I said, I like to do lots of different things, so I don't consider myself a specialist in one particular technology, but more a big picture person. I like to do data engineering, I like to do front end, I like to connect different technologies with each other, and yeah, basically I did this all on my own, except for the web design. That's where I'm really bad at, and that's where I ask help from a web designer who-- 
RVB: 00:05:14.319 Just a couple of weeks of work, really? 
NB: 00:05:17.957 Well, yeah. The core was a couple of weeks-- there was about three weeks of spare time that I had between two companies. I am working for two companies. And in these three weeks, I really built the concept where I was able to show interesting stuff. And since then, I work on it in my spare time, mostly in the weekend when I have some time, when I'm not working with my family. There's not really a big plan where I want to go, it's going with the flow. Get lots of feedback and suggestions from people who are using it, critical people that are interested in transparency, especially in the political sphere. Other people are, for example, interested in accountancy. They also gave me suggestions. So I have a whole list of features that I can work on, so. 
RVB: 00:06:15.474 So have there been any notable users of it? I know you were talking to a couple of journalists at one point and stuff like that. Has that moved any further? 
NB: 00:06:29.018 Yeah. I spoke to mainly some journalists who were involved in the Panama Papers and are now involved in the Paradise Papers, and they gave great feedback, and they told me that they are using it regularly for their own research. So yeah, that's quite fun to find out that this small experiment is being used by these people doing quite important work for the world, I would say. 
RVB: 00:07:00.935 Yeah. I couldn't agree more. So Niek, maybe we can talk a little bit about why-- well, why did you do this? We already talked about it a little bit, but why did you use a graph for this? What was the idea behind that? What was the main reason for you to adopt Neo4j for this? 
NB: 00:07:19.553 Well, to start with, graph databases are fun. It's different from what I'm used to work with, and you can really focus on a different dimension of your information, especially that was the focus of this project, to show the relationships between all the companies and the persons. I didn't have to think more than, probably, seconds to decide that I'm going to use Neo4j for that. And to be honest, it has been a very fun ride to try to implement everything on top of Neo4j. I can only say how incredibly performant this database is. There is so much information in it. I'm not saying this because I like you, Rik, but it's really true [laughter]. 
RVB: 00:08:09.432 I'll pay you later, Niek. No problem. Yeah, no problem [laughter]. 
NB: 00:08:12.716 But looking at the limited resources that I'm currently using on - I have it running on the Amazon Cloud with very small instances - I haven't had any issues, I guess, in the last six months, something like that, since it's been in running. Things go really fast. You can try it out with sometimes companies that have a huge number, hundreds, sometimes thousands of connections. There will probably be only a small glitch, and that's on the front-end. Neo4j has already done all of its work but it's just the front-end visualization script that sometimes needs one or two seconds to actually show it in the browser. But other than, it's so fast. It's incredible, really. 
RVB: 00:09:02.305 Fantastic. I think we've got a fan here [laughter]. 
NB: 00:09:05.248 Yes, exactly. 
RVB: 00:09:07.337 Pretty cool. 
NB: 00:09:07.696 You start with technologies, like all the different technologies that I've been using. Many times you look around, you see if there is a big community, if it's a stable product. There is a couple of things to think about when deciding for technology. But in the end, you find out when the product is ready. And you can only hope that the technology you chose is going to keep up with the load that you're going to put on it, or the requirement that you have for it. So I can only say, I'm very happy, very lucky that the Neo4j is able to definitely keep up with my requirements. 
RVB: 00:09:49.856 Well, you know what? That's our mission: helping the world make sense of data [laughter]. It's so appropriate these days, It's really cool. So Niek, maybe we can talk a little bit about the future. Where is this going? Where is Open the Box going, and where do you see this project going? And also, maybe, where do you see Graph Databases going? Do you have any perspective on that? 
NB: 00:10:16.160 I would consider myself quite as an outsider. I don't have any deep thoughts about it at the moment, I'm just using it. And from my point of view, I'm just going with the flow. I will see how I can extend my project. And there have been talks about organizations being able to reuse my data set; how they could start making more global queries, queries that tackle all of the entities. Recently, I came to an article of your colleague, Michael Hunger, who showed off a lot of interesting algorithms. I think that's quite a new plugin. I guess, in just one hour, I tried out a couple of them, like PageRank, Betweenness, and so on. And I did it just on my laptop. And most of the time, it's finished after a minute, and you can see really deep properties, global properties, coming up from my graph database. So I haven't really used that piece yet. I can't see yet which value I could give to my users about it, but that's definitely something that I can look into in the coming months, or even years. 
RVB: 00:11:44.964 Yeah. So much that you can do with that, right? It's really the start of a whole new exploration phase, I guess, when you start looking at those types of metrics. 
NB: 00:11:54.085 Yes. It's a very different mindset. I know it from the world of natural language processing as well. You can look at sentences and it makes a lot of sense. And then suddenly, you throw a million sentences to an machine-learning algorithm, and it behaves very differently. I'm quite sure in this context of persons and companies, it will be similar. It takes some deep experimentation to actually get out this more microscope structure to add value to. 
RVB: 00:12:29.153 Niek, I have one more question for you. The name, Open the Box, I feel like there's something about that. What's the movie reference, or something like that? I've always been wondering [laughter]. 

NB: 00:12:41.117 It was the main name that was still available [laughter]. 
RVB: 00:12:44.091 Okay. Yeah. 
NB: 00:12:47.388 I liked the concept of opening up something because that's really what it is. In legal terms, this information is open. Technically, it's open, but it's still hidden in websites that are quite difficult to use for end users. So really, opening up that information, that was what I was thinking about. And I didn't focus it too much on business or politics, it's just the concept, and we'll see what else we put inside of the box that can then be opened up, together with all the rest. 
RVB: 00:13:23.447 Fantastic. All right. Thank you, Niek, for coming online and doing this podcast episode with me. It was a joy talking to you. And I will put some links to the website and to-- I think you've got some presentation material out there, as well, right? We can put that in the transcription. 
NB: 00:13:38.617 Yeah. Okay. Perfect. 
RVB: 00:13:40.645 And I wish you lots of luck and fun with your project. And I look forward to seeing you at one of the next events. 
NB: 00:13:48.733 Thanks, Rik. Thanks for having me. Bye-bye. 
RVB: 00:13:50.263 Cheers. Bye. Bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

No comments:

Post a Comment