Bruggen Blog: master data management

Showing posts with label master data management. Show all posts

Wednesday, 31 October 2018

Data Lineage in Neo4j - an elaborate experiment

For the past couple of years, I have had a LOT of conversations with users and customers of Neo4j that have been looking at graph databases for solving Data Lineage problems. Now, at first, that seemed like a really fancy new word used only by hipster technovangelists to try to appear interesting, but once I drilled into it, I found that it’s actually something really interesting and a really cool application of graph databases. Read more on the background of it on wikipedia (as always), or just live with this really simple definition:

“Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources.”

That’s easy enough. Fact is that it’s a really big problem for large organisations - specifically financial institutions as they have to comply with regulations like the Basel Committee on Banking Supervision's standard number 239 - which is all about assuring data governance and risk reporting accuracy.

Here’s a couple of really nice articles and videos that should really give you quite a bit of background.

Podcast Interview with Amanda Schaffer, Cisco

Been a while actually since the below conversation happened - but now I have a finally found the time to put it up here. Apologies for the delay. Amanda Schaefer has been a really great community member for Neo4j and has been using and advocating the use of graphs in lots of different use cases. Listen to her story - and hopefully it will be an inspiration for other graphistas to come out of the woodwork and start tackling real world use cases with Neo4j.

Here's the transcript of our conversation:

RVB: 00:02.767 Hello everyone. My name is Rik Van Bruggen from Neo Technology and here we are again recording another Graphistania podcast episode and today I'm joined by Amanda Schaefer from Cisco. And you're based in Seattle, right Amanda?

AS: 00:17.171 That's correct.

RVB: 00:20.162 Well our listeners may not know you yet, even though you've participated in the GraphGist Challenge last winter. So why don't you introduce yourself, Amanda?

AS: 00:29.131 Sure, so I am the technical lead for an analytics team in Cisco and my group focuses on maintenance contract renewals and kind of optimizing the quoting work flow and optimizing customer success. So we look at a lot of metrics related to opportunity and bookings and quoting packways and things like that.

RVB: 00:50.598 Wow, that sounds really interesting and you know that Cisco is already a Neo4j user, maybe someday you'll get to use it professionally there as well.

AS: 01:00.715 I hope so. I'm working on a couple of use cases for that.

RVB: 01:03.514 Really cool. But what I've read from your work so far, you've been using Neo4j for some of your personal projects, right.

AS: 01:10.929 I have, yeah.

RVB: 01:12.699 Can you tell us a little bit more about that?

AS: 01:14.455 Sure, absolutely. So I started out going to some theater productions around Seattle and noticed that I recognized a lot of the actors from different plays and different theater companies and I got interested in graphing that, because that's such a perfect kind of classic graph problem, mapping people in social networks, and so I was interested in that in the theater space. So for the GraphGist Challenge last winter, I wanted to take a look at that and ended up focusing on the Seattle Shakespeare Company, mostly because their data was the best available of the local theater companies [laughter] so I had to deal with the least data engineering for that, and could focus on the analysis a little bit more. I took a look at their past productions, and matched that up with all of the available Shakespeare plays, and took a look at things like production year and comedies versus tragedies, and their normal seasons versus the things that they take out to the parks. Just had a lot of fun exploring the data with Neo4j.

RVB: 02:15.017 Did you learn anything interesting, things that you didn't know before?

AS: 02:19.544 I definitely learned that the most popular plays and the most successful ones tend to be the things that they take to the parks, which was interesting. I found that there were only eight plays of Shakespeare's that the Seattle Shakespeare Company hasn't produced, so they have done a pretty comprehensive job.

RVB: 02:37.899 When you say, take it to the park what does that mean? I'm not familiar with that.

AS: 02:41.361 So there is a summer kind of "Shakespeare in the park" program, where they go out to different parks around the city, and tour, even around Washington state and do free production in parks around the summer.

RVB: 02:55.427 Now you are just trying to get me to move to Seattle, right?

AS: 02:58.729 Seattle is a fantastic place to visit in the summer, I highly recommend it.

RVB: 03:03.180 Very good. You told me a little bit about some other projects that you have been working on as well, like the movie festival. Tell us about that maybe.

AS: 03:11.317 Yes. So Seattle has an international film festival that takes place in May and June. And so this year I had a festival pass which means I could see any movie essentially that was running during the festival. But there are about 500 movies to see in about three and a half weeks. So figuring out which movies to see is a big challenge. I watched all of the trailers and rated the movies according to my interests, and then I loaded the schedules, the theaters, the transit time between theaters, and my ratings and the movies all into Neo4j, and using a Python program, I created my optimal schedule for the international film festival. I ran 100 simulations and took a look at the top 10 to 20 schedules, and used that as my basis for deciding which movies to go see.

RVB: 04:03.776 Sweet. That sounds really great. It reminds me a little bit of the use case that we talked about on this podcast a couple episodes back, about the Date Night movies. I don't know if you heard that episode. It's datenightmovies.com. You'll like it if you're a movie buff [laughter].

AS: 04:22.165 Great, I'll have to take a look at that.

RVB: 04:24.285 Yeah. Very good. So why graphs, Amanda? Why did you get into graphs, and what's so cool about them for you?

AS: 04:32.701 I actually got interested in graphs looking at the master data management use case, because as part of our quoting workflows, we have a lot of places where a single kind of parent company will have a bunch of different subsidiaries or a bunch of different field locations, and we want to be able to understand which of these contracts really belong to the same company and things like that. So I took a hands on workshop with Nicole White from Neo4j actually, at NoSQLNow! in 2015 and that was my official kind of hands on introduction, when I was exploring that master data management use case. And I just kind of got hooked after that workshop. It was so much fan and so easy and intuitive to play around with the graph model especially in Neo4j, so from there it just sort of took of. And they say once you get it, everything is a graph and I think that's really true. I am always kind of thinking about how can I make this into a fun Neo4j project?

RVB: 05:38.791 Absolutely yeah, it's unbelievable. I was actually jogging yesterday and all of a sudden there is my podcast is talking about graphs [chuckles]. It was really, really peculiar. All right, the model, that's what you find very interesting, the ease of use, is there anything particular that you find most appealing in Neo4j?

AS: 06:02.033 I love the ease of use. For me. I'm just kind of always thinking about the intersection of business and technology, or the intersection of modeling real world things and technology. So the modeling events, like I did for the film festival, is very interesting to me, like the use case I was thinking about, DataDay Seattle, a local conference that I attended a couple of days ago. And thinking about conference management software, and figuring out combining the sessions scheduling with a recommendation engine, which I think are both things that Neo4j does really well. And it seems like you could build a really powerful conference scheduler application based on that, so attendees could make it social, and recommend sessions for each other, and things like that. So just always thinking about the ways that things are connected, and how to just apply these classic graph problems to a lot of situations in the real world.

RVB: 07:02.117 Hmm totally. Well, at GraphConnect we always have a schedule graph. I don't know if you are familiar with that, but in the GraphConnect conference that we host every year, every six months, upcoming October in San Francisco as well, we'll have a schedule graph as well. So maybe that's a starting point for you.

AS: 07:18.126 That's great, I have actually recently purchased tickets to the event in San Francisco, so I am really looking forward to it.

RVB: 07:24.554 Fantastic, we will see each other there for sure then. Other than GraphConnect, what is the future look like Amanda? Where do you think this is going for you personally, for the industry? What's in store, do you think?

AS: 07:39.721 For me, the really interesting next step, and the hurdle that I need to overcome to use it professionally, is just really making the self-serve, analytics part, and getting graph understanding out to the typical data analysts on my team, and the people that would use this analysis day-to-day in their business, and helping them understand these cases and understand the graphic analysis, and things like that, and I think making it really accessible to a lot more people around the organization, is one of the biggest challenges that I'm looking at. In Neo4J 3.0 there's the ability to share the graph style sheets that you've set up in the cloud, so that everyone can see exactly what you're seeing on the screen, and it's much more easy to share those around the organization. Things like that I'm really excited about, because at least in my organization, I know that this is a very cool thing, and there are a lot of use cases for it, but I need to take that out and empower other people to figure out how to take advantage of it. That's what I'm really looking forward to.

RVB: 08:52.664 Fantastic. I think that's something that we'll see many more people working on in the next couple of years, for us as well, as a company in this industry, it's really important that we make that work. Cool. Amanda, you know that we want to keep these podcasts fairly short, so we'll put some links maybe to your graph sheets and the rest of your work, on the transcription page, if that's okay. For now, I'm just going to thank you so much for coming online and having a chat with me, and I look forward to meeting you at GraphConnect.

AS: 09:25.550 Thanks, Rik. I had a lot of fun on the podcast this morning.

RVB: 09:28.510 Cheers. Bye-bye.

AS: 09:29.962 Bye.

Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!

All the best

Rik

Wednesday, 22 June 2016

Podcast Interview with Aaron Wallace, Pitney Bowes

One of the coolest use cases for a graph databases that has seen a big uptake in the past couple of months and years, is Master Data Management. Kind of a vague term, but Wikipedia defines it as

master data management (MDM) comprises the processes, governance, policies, standards and tools that consistently define and manage the critical data of an organization to provide a single point of reference

Sounds about right to me. Lots of customers have been developing their own solutions to their specific MDM problems, but some people have been thinking about this in a generic, generalized way. Like for example our customer Pitney Bowes. They have been early adopters of Neo4j, and have been articulating that vision for the longest time: see this video from 2014 (including other folks from UBS, TomTom, eBay/Shutl as well), and more recently a recording of a talk that Aaron Wallace, one of Pitney Bowes' product managers and my guest on today's podcast, did in 2015 at GraphConnect.

Bruggen Blog

Pages

Wednesday, 31 October 2018

Data Lineage in Neo4j - an elaborate experiment

Thursday, 20 October 2016

Podcast Interview with Amanda Schaffer, Cisco

Wednesday, 22 June 2016

Podcast Interview with Aaron Wallace, Pitney Bowes

Labels

Blogarchive

Metricool