Thursday, 5 July 2018

Podcast Interview with Matt Casters, Neo4j & Kettle

A couple of years ago, I got to know another Belgian data aficionado that was doing quite a bit of work in the open source community, called Bart Maertens. For a while, we actually met at Antwerp Airport when we were both "commuting" to London City Airport for business - and we got a conversation going. Bart was organising a Pentaho Community Meeting in Antwerp, less than 500m from my home, and invited me to come along and talk a bit about my favourite subjects: beer and graphs :) ... 
So one thing lead to another, and Bart started to do some interesting work integrating his data integration tools with Neo4j. He wrote the code, and blogged about it in some detail

Fast forward to early 2018. Neo4j is more and more in the Enterprise market, with very large organisations seeing the value of graph databases and the platform around it. But most of these environments are NOT greenfield environments - they almost always require some kind of data integration work to make the tools work effectively. So it became very natural for us to start look for architects and experts that could help us... and that's effectively what brought my next Podcast guest to the Graph: Matt Casters has worked together with many other Neo4j people in a previous life, and is now the Chief Solutions Architect in our professional services organisation. 

Here's my chat with Matt:



Here's the transcript of our conversation:
RVB: 00:00:01.215 Hello, everyone. My name is Rik Van Bruggen from Neo4j. And I'm here recording another episode of our Graphistania podcast. And I have been looking forward to this one for a long time, actually for a month or so. Because on the other side of this hangout is a gentleman that you may know, I have gotten to know a little bit more. And that's Matt Casters. Hi, Matt. 
MC: 00:00:25.135 Hi, Rik. Nice to meet you. 
RVB: 00:00:26.749 Yeah. And thank you for joining me. That's great. Matt, you're not so far away from me because you're also in Belgium. 
MC: 00:00:35.250 That's right. 
RVB: 00:00:36.064 But I think some people might know you from your history in the open source community. But why don't you tell us a little bit about that, and maybe also, I'll tell us what the relationship with Neo4j is these days. 
MC: 00:00:49.534 Well, yeah. My background is actually in infrastructure. I was [inaudible] support professional from AIX, at one point. And I became an Oracle database admin. And so this is kind of how I rolled into the wonderful world of business intelligence and data warehousing. So at some point, 15, 20 years ago, I started building my own tool, my open source tool, later on. But my own tool, called Kettle, the Kettle, extraction, transportation, transformation, and loading environment in a big world. And [laughter]-- 
RVB: 00:01:34.608 It's very GNU like method [crosstalk]-- 
MC: 00:01:36.460 Yeah. It was a recursive acronym. 
RVB: 00:01:40.774 Absolutely. 
MC: 00:01:41.082 That's why I liked it, of course. And in December 2005, I open sourced it. Initially, it had a lesser public license. But I think six, seven years ago, that was switched to a project license. But in 2005, it was version 2, something, that I released, and it became very popular, instantly, because there was nothing out there. Talend didn't exist yet. There were no real data integration tools that were deployable. And I had deployed Kettle on a few occasions, on a few projects. So it was kind of hardened a little bit. So most people could just zip it and use it, just like it is today. You can zip it and start using it. That's kind of the idea of the tool. 
RVB: 00:02:32.451 It's super well known, right? Lots of people do it. Lots of people use it. I've actually used it in the past, myself. It's really nice. 
MC: 00:02:40.610 So it really has millions and millions of downloads. And the company I work for the past 12 years, Pentaho, had a lot of success with it. And Pentaho was then acquired by Hitachi Data Systems, Now Hitachi Pentaho. And at some point, I felt like it was time to move on to greener pastures, different challenges, right. 
RVB: 00:03:10.997 Absolutely. And that's how you came to Neo4j, I think a month ago, or something like that, yeah? 
MC: 00:03:16.222 Yeah. So I flew down to San Mateo to talk to the team. I wasn't really familiar with the success of Neo4j, really. Because it's been growing and growing, and becoming more and more successful. To my defence, I did have the book in my library from Emil, Graph Databases, and I did read a lot about it. But that was like five years ago, or something. It's an old book, right [laughter]? So I talked to the whole team, and I was so-- I felt like I could make a huge difference to this team. Because I saw that Neo4j was not just being used as a database, but as part of larger solutions, specifically, solutions for recommendation engine, that sort of thing, customer 360s, [inaudible] variety of a bigger solutions. And I have a lot of experience doing those. 
RVB: 00:04:22.687 And I think, also, the whole ETL story fits really well with graph databases, right, when you--? 
MC: 00:04:30.469 Of course. 
RVB: 00:04:32.367 And it's also one of the things that Neo4j actually, very, very often, we find that we're not in a greenfield situation, right? We have to integrate Neo4j with other tools and other environments. And we have to extract data from that. We have to transform it. And we have to load it into the graph database, right? So there's a lot of links there. 
MC: 00:04:54.467 So, yeah. When you're talk about integrating with modern technologies like Kafka, or Hadoop, or the modern NoSQL databases, or the case in point, Neo4j, those are the ideal cases for Kettle, right. 
RVB: 00:05:15.332 We've had a connector, too, from Kettle to Neo4j for a while, haven't we? 
MC: 00:05:20.346 Yes, absolutely. So Bart Maertens from know.bi wrote that. It works fine. The only thing that I would add to that, and I have been adding to that in the last couple weeks, is, I've been writing a bunch of new plugins for Neo4j, is that it's less suitable for streaming data sources like the ones from Kafka that we encountered a few weeks ago on the customer side. And so there, we need an easier way of uploading relational sets of data into a graph, right, either using Cypher or using something new [laughter]. 
RVB: 00:06:03.566 So what do you think is the big attraction point for you to Neo4j? What do you like about it, mostly? Or, in other words, why did you join [laughter]? 
MC: 00:06:12.933 Oh, as a BI data warehouse guy, I have been in situations where the number of relationships between, let's say, fact tables have caused problems. The performance aspect of a very fast graph sorter can solve these relationship problems very quickly, where you can still have huge amounts of, let's say, hops in transport situations, travelling salesman situations. Or you can still traverse these relationships quite quickly and find the shortest paths and these graph algorithms. That's very interesting, right. But the use cases that I mentioned earlier, they are real. They are key differentiated between what's out there in the market. You talk about Mongrel or any of the other big table implementations, where they still have simple tube holes, basically, a key and a valley in a database, these are very fast for looking things up and updating them. And that's great, but that doesn't give you any relationships between these entities. Or it doesn't give you anything on top of that. So I felt like, yeah, this was something completely different. And like I said, I felt like I could make a difference in the team. I felt like I could improve our situation in Neo4j quite dramatically, by writing a few quick wins, setting up the recommended architectures, coming up with best practices architectures for using Kettle. 
RVB: 00:07:57.281 I'm reading LinkedIn here, and I'm finding that your new title at Neo4j is chief solutions architect. 
MC: 00:08:02.752 Exactly. 
RVB: 00:08:03.351 Yeah [laughter]. What does that mean? And what are you going to be doing at Neo4j? Can you [crosstalk]? 
MC: 00:08:07.767 First, so I'm looking at the solutions that we're building at Neo4j. And one of the things that I want to do is to standardize more, so that we can scale faster, so that we can learn easier from all the projects that we're doing. So whether all the best practices that go along with any kind of development projects, how can we reuse codes? How can we go from, let's say, an 18-month implementation trajectory, with lots of involvement, to something that we can give it a few days. That's basically what I want to do. And, initially, obviously, we're looking at the data loading aspect. But afterwards, obviously, we're going to look at the front end as well, because we have really cool tools like Bloom. Thanks for your video, by the way, Rik. 

RVB: 00:08:59.121 [crosstalk]. 
MC: 00:08:59.130 [crosstalk] on the core. But we need more than just Bloom. We need also sorts of dashboarding technologies for these larger solutions. 
RVB: 00:09:08.849 That's really cool. I think that especially our field engineering teams, right, the consultants that are implementing these solutions at the customer, or that are working with partners to implement these, they will benefit big time from this, right? 
MC: 00:09:21.421 Yes, absolutely so. And the nice part is, Kettle, Pentaho Data Integration is Apache licensed, you just download that for free from SourceForge. There's no real, let's say, barrier to entry to start playing along. And ultimately, that's my short-term goal, is to make it incredibly easy to load any sort of data into Neo4j, with similar relationships. I think that is something that is needed. I think Neo4j has offered a tremendous amount of options, from a quick initial load from a relational database, with Neo4j ETL, that still exists out there. We have things like bulk loaders. We have Java APIs. We have all sorts of scripting capabilities. But, so far, the data integration, the visual program tools, kind of were left behind. I think that's going to change dramatically, of course [laughter]. 
RVB: 00:10:32.389 I mean, this part of the project, I guess, as a customer, I can only confirm that it's a huge topic at the customer side or the user side, right. I mean, my worst nightmare is also always when customers are super enthusiastic about Neo4j, and then they try to load a zillion rows into [laughter] Neo4j in one transaction. And that doesn't necessarily end well. So you know how it is, right. 
MC: 00:11:06.135 So, yeah, if you ask me how I would love to see this? This is the world that I live in as an architect. I try to envision how I would love to see this. And I would love to see this do conceptual data modelling with graphs, not on a blackboard, but in our own software. I would love for them to be able to connect the dots in Kettle, using those shared graphs, distributable, annotated graphs. So that's my ideal world, where we might use Cypher, or we might use a Java API in the background. But the user wouldn't have to handcraft large Cypher statements, or anything like that, and just do this automatically, based on the notes, and relationships, and properties, and whatnot. 
RVB: 00:11:58.971 So you've already given us a little bit of a taste of what the future looks like [laughter]. 
MC: 00:12:02.977 Yeah. I'm always very open about this [laughter]. 
RVB: 00:12:06.519 But what does it look like in two or three years from now? How do you look at that? What's in store for the industry? 
MC: 00:12:12.341 I really want to make a big difference. I really want to make it very easy for people to use Neo4j. So probably integration into something like the Neo4j desktop app or in the [inaudible] somewhere, something very easy, some wizard-like fashion, where people can try out Neo4j in five minutes. That would be my preferred way of really dumbing it down for, let's say, the data scientist guy, the ITer that wants to try it out quickly. And so ease the learning curve that is sometimes substantial [laughter]. 
RVB: 00:13:00.721 I agree. Yep. Pretty good. And then after that is just world domination, right? That's all-- 
MC: 00:13:06.584 Yeah. And let's take a clear, in-depth look later on at that front end, because a lot of vendors out there, the [inaudible], the click types, they don't support graph databases. They don't support any kind of querying, right. So we can either dumb down the graphs to flat data sources and expose JDBC or ODBC drivers, but that's not what we really want, right? We want to own that as well, I think, at some point. I think Bloom is a great starting point, but I think we can do more. I think we can-- there is also already so much open source technology out there, that I'm sure we can do more and create a more integrated and easy to use platform for all our users and partners out there. 
RVB: 00:14:02.132 Absolutely. Well, Matt, I think I'm looking forward to the future already. And first of all, I want to thank you for sharing that with us and with our listeners. 
MC: 00:14:12.691 Absolutely. 
RVB: 00:14:12.907 It's great to have you on a team, and it's great to have you on the podcast. And I'm sure we'll hear a lot more about you and your work in the next couple of months. 
MC: 00:14:20.550 Yeah. Invite me in six months, and we'll talk about it [laughter]. 
RVB: 00:14:23.361 Absolutely. There's no doubt about that. Thank you, Matt. Talk to you soon. 
MC: 00:14:26.573 Thank you, Rik.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

No comments:

Post a comment