Friday, 9 March 2018

Podcast Interview with Dilyan Damyanov, Snowplow Analytics

Here's another great podcast for you: I had a chat with Dilyan Damyanov of Snowplow Analytics, chatting about how you can use a graph database for enhancing your event analytics, specifically for clickstream analysis. I wrote about this myself a while back, but of course there is so much more to it - and Snowplow has really done a great job at enabling it with their toolset.

Here's our chat:

Here's the transcript of our conversation:
RVB: 00:00:14.000 Hello everyone. My name is Rik Van Bruggen from Neo4j and here I am recording another Graphistania Neo4j podcast. And today, I've got someone from London on the phone. That's Dilyan Damyanov. Hi, Dilyan.
DD: 00:00:32.026 Hi, Rik, how are you?
RVB: 00:00:33.628 Very, very good. Thank you for joining me. Dilyan, you work for Snowplow Analytics, which is a really interesting company that does-- well, you can explain it yourself. Event data and analytics, I guess, right? But maybe you can introduce yourself a little bit and also tell a little bit, what's your relationship to the wonderful of graphs, right?
DD: 00:00:56.997 Sure, sure. Yeah, so my name is Dilyan Damyanov and I've been working at SnowplowAnalytics for 12 years now. Snowplow is an open-source event analytics platform that lets you collect data from any source, then store that data in your own data warehouse and just run any analysis on it. And I'm part of the analytics services team that delivers custom analytics projects. And we've been very interested in using graph technology for some of those projects. So far, it's only been on the R&D side, like we're exploring where we could take this, but we are very keen to make it part of our regular stock.
RVB: 00:01:55.808 So can you explain a little bit more on what that means, like event data and mapping a graph on top on that? I think there might be some different ways to do that, right?
DD: 00:02:08.543 Yeah, that's exactly right. So if we think of event data that's traditionally stored in a relational database, there's really only one logical way to store that and that's you have one line per event and then your table is as wide as many different facets you have for that event. So you may have a user ID, you may have a timestamp, you may have a URL that was visited, and so on. When you think of modeling the same data as a graph, there's many different ways you can do it. So one way would be your event could be a node and then you can have relationships going out of that node to all the different parameters like user ID, URL timestamp, and so on. So that's one way to do it. Another way to do it would be to have all the different objects in your event, like the user ID and the timestamp and the URL, the nodes themselves, and then have relationships between those nodes. And then you can take it from these two and then you can mix and match them so that some objects are nodes and you can visualize or model the relationships between them as edges in the graph. Or some of the relationships could actually be that, a node who have different properties. So you can end up with at least five or six different ways to model that. And it's not necessarily clear which one of these you should choose. There's many considerations that you need to take because some of those methods would mean you're duplicating data in your graph, so you're storing data in several places. But other ways mean that your queries will moving faster because you won't have to do as many hops.
RVB: 00:04:19.151 Yeah. Doesn't it mean that it kind of depends on the question that you're trying to answer? The model is kind of driven by the query pattern? Or am I wrong there?
DD: 00:04:34.992 Yeah, so that's one consideration. You might have a very clear question in mind or a set of question that you want to answer. And then, the query pattern will be something that dictates what shape the graph will have. Or you can have some other constraints. For instance, at Snowplow, we believe that your atomic data, that's the data the most [inaudible] level, should be mutable because, that way, it's fully auditable. Even if there was some error in the pipeline, if you don't change your underlying data, you can always inspect it and you can always find a fix for it. But if you change it, then fixing things becomes much harder because then you cannot trust your data anymore. You know that, at some point, it might have been changed. So that's an interesting constraint. It means that, for instance, you can't have the user be a node that takes on different properties as time goes on because that would violate the principle of mutability.
RVB: 00:05:47.190 Yeah, yeah, absolutely. Oh no, there seems to be an echo [laughter] on our call here. That's weird. I'll cut that out later. Dilyan, maybe we can talk a little about the reasons why you got into the graph and why you started using it. Was there a particular background there, particular story there that might be interesting for us?
DD: 00:06:17.630 Yeah, so the way I got into graphs is we have regular hackathons at Snowplow and at one of those events, I just came up this idea of exploring how we would do path analysis in a graph database. So we already knew how to do that in a relational database and we knew that, depending on the length of your path, you could end up with very expensive queries. So I was interested to see if those queries won't be much more efficient in a graph database. And that's exactly what turns out to be the case.
RVB: 00:07:03.697 What kinds of paths are we talking about here? It's paths between events then? Or between users and events? Or what types of paths were you thinking of?
DD: 00:07:13.529 So it could be different things. One, the simplest example would be somebody landing on a website and then the path would be tracking that person through the different pages that they visit. It could be a funnel analysis thing where you track a person through the different stages of a funnel. Or if you're talking about something like marketing attribution, you could track people through the different marketing twitches they go through before they convert. So all of these are things that we see people explore using Snowplow. And all these analysis are much harder in a relational database than they are in a graph database.
RVB: 00:08:05.397 I can totally see that. That sounds like a great use case because you kind of know where you're starting, you know where you end with these events, but it's probably interesting to figure what happened in the meanwhile, right?
DD: 00:08:18.776 Yeah, yeah, correct.
RVB: 00:08:19.887 Yeah. Cool. All right. Well, maybe we can talk a little bit about the future? What do you think is happening with Snowplow and Neo4j and graph databases in general? And also, where do you see the industry going, maybe? What's your perspective on that?
DD: 00:08:39.898 I think, at Snowplow, we're very interested in exploring how we can do a lot more with graphs. And we are at the start of this journey, but hopefully, we'll see some concrete results very soon. I think, industry-wide, there's still a little bit of a stigma there that people tend to think that graphs are something very complicated, which they're not. But it's still something that we have to deal with. And I think Neo4j actually has a great-- is already playing a great role in that. Even if you think about basic things like-- I mean, obviously, you're putting a lot of effort into creating a community that people can fall back on for support. But also, if you look at something like Cypher, the query language in Neo4j, it's very intuitive. It's extremely easy for a data analyst who's already experienced in SQL to understand how Cypher works. There's an almost one-to-one mapping for most of the queries. Oh, taking into account the fact that they're very different things, relational databases and graph databases. But it makes intuitive sense how that is going to work. And I think that's a step in the right direction. Some of the other query languages make graphs look like this scary, arcane thing, which they're really not.
RVB: 00:10:28.910 Like a little green monster that runs through the graph, is that what you're talking about [laughter]?
DD: 00:10:34.324 [inaudible]--
RVB: 00:10:35.181 Like a little gremlin?
DD: 00:10:36.871 Yeah, personally, I find gremlin to be a little bit intimidating.
RVB: 00:10:43.478 I agree. Yeah. Very much so, yeah. Very cool. Dilyan, I think what we'll do is we'll put a bunch of links to Snowplow and to some of the work that you guys have been doing with Neo4j on the website. And we'll link into the transcription of the podcast, obviously. But, for now, I think this was a very, very useful and interesting conversation for our listeners and I want to thank you for taking the time to come online. And I'm sure we'll hear more about you guys in the upcoming months.
DD: 00:11:17.467 And thank you for inviting me. It was great to be a part of this.
RVB: 00:11:20.938 Fantastic. Thank you.
DD: 00:11:23.427 Thanks, bye-bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

3 comments:

  1. Hi, great podcast. The background music is a bit distracting though!

    ReplyDelete
    Replies
    1. I have actually updated the audio file with a version that is hopefully a bit less distracting / lower volume wrt the background music ...

      Delete
  2. Oh... sorry about that... maybe it was a bit too loud in the background... I will take care to make it more silent in the future...

    ReplyDelete