Here's the transcript of our conversation:
RVB: 00:01 Hello, everyone. My name Rik. Rik Van Bruggen from Neo Technology, and here I am, again, recording another episode of the Graph Database podcast. Today, I've got a guest all the way from the US, Karl Urich. Hi, Karl.
KF: 00:15 Rik, very nice to speak with you.
RVB: 00:17 Thank you for joining us. It's always great when people make the time to come on these podcasts and share their experience and their knowledge with the community, I really appreciate it. Karl, why don't you introduce yourself? Many people won't know you yet. You might want to change that.
KF: 00:35 Yeah, absolutely. So, again, thanks for having me on this podcast. It's really great to be able to talk about the things I have experimented with and see if it resonates with people. I own a small consulting business called DataFoxtrot, started under a year ago. Primary focus of the business is on data monetisation. If a company has content or data, how can we help those companies make money or get new value from that content or data if they could be collecting data as a by-product of their business or they could be using data internally in their business and then they realise that someone outside the company can use that as well? So, that's the primary focus of my business, but like any good consulting company, I have a few other explorations and really this intersection of the world of graph and spatial analytics or location intelligence is what interests me. So, talking a little bit about those explorations is what will hopefully interest your listeners.
RVB: 01:38 Yeah, absolutely. Well, so, that's interesting, right? I mean, what's the background to your relationship to the wonderful world of graphs then, you know? How did you get into it?
KF: 01:45 Yeah, so going all the way back to college, I did take a good Introduction to Graph Theory as a mathematics elective, but then really got into the world of spatial and data analytics. For 20 years working with all things data: demographic data, spatial data, vertical industry data, along the way building some routing products, late 1990's or late 2000's products, that did point to point routing, drive time calculations, multi-point routing. Really kind of that original intersection of graph and spatial. But, data junky, very interested in data: graph, spatial, data modelling et cetera.
RVB: 02:28 Yeah. Cool. I understand that these spatial components is like your unique focus area, or one of your at least focus areas these days, right? Tell us more about that.
KF: 02:39 Yeah, absolutely. And it's certainly what resonates when I think of about the graph side, spatial data really should define-- spatial data could be any sort of business problems related to proximity location or driving things because you know where something is, your competitors, your customers, the people that you serve. And that's where it resonated to me when, as I start to look at graph and spatial, I was really excited back in April. I walked in, just very coincidentally, in a big data conference to a presentation being put on by Cambridge Intelligence--
RVB: 03:24 Oh, yeah.
KF: 03:26 And so they were introducing spatial elements to their graph visualization.
RVB: 03:31 That's really-- they just released a new product, I think. Right?
KF: 03:34 Just released the new product, at the time had gone beta. So, that really got me thinking about how could you combine graph and spatial together to solve a problem. Looking at Cambridge Intelligences, technology of looking at some spatial plugins for Neo, and again, my company is a consulting company and if there is a need for that expertise at the intersection of graph and spatial, we want to explore that.
RVB: 04:05 Very cool. Did you do some experiments around this as well, Karl? Did you, sort of, try to prove out the goals just a little bit?
KF: 04:11 Yeah. Absolutely. Let me talk a little bit about that. At this concept of combined spatial and graph problem that looked at the outliers, outliers just meaning things that are exceptional, extraordinary, and the thinking is, in my mind, was businesses and organisations can get value from identifying outliers and acting on those outliers. So, maybe an outlier can represent an opportunity for growth by capitalising on outliers, or bottom-line savings by eliminating outliers. Let me give an example of an outlier. If you look at a graph of all major North American airports, and their flight patterns, and put it on a map, you could visualise that Honolulu and Anchorage airports are outliers. There are just few other airports that, "look the same”, meaning same location, same incoming and outgoing flight patterns. And that's really relatively easy if you have a very small graph to visualise outliers, but if you want to look at a larger graph, hundreds of thousands, millions of nodes, what would you do? So, that really started the experiment. I was looking around for test data. Wikipedia is fantastic. You can download--
RVB: 05:28 [chuckles] It is.
KF: 05:29 Wikipedia data-- I love Wikipedia. Anyway, it seemed very natural. And the great thing is that there are probably around a million or so records that have some sort of geographic tagging.
RVB: 05:42 Oh, do they?
KF: 05:44 Yep, so a page-- London, England has a latitude longitude. Tower of London has a latitude and longitude. An airport has a latitude longitude.
RVB: 05:54 Of course.
KF: 05:54 So, you can tease out all of the records that have latitude longitude tagging, preserve the relationships and shove that all into a graph. So, you have a spatially enabled graph, every XY has a-- every page has a latitude longitude or XY. So, really the hard work started, which was taking a look at outliers. So, quick explanation of outliers, so, you think of a Wikipedia page for London, England, a Wikipedia page for Sidney, Australia, they cross reference each other. Pretty unusual to locations other side of the world, but would you call those outliers? Not really, because there's also a relationship between the London page and the Melbourne, Australia Wikipedia page. So, you really wouldn't call those anything exceptional. And so, what I built was a system, or just a very brief explanation is that I looked at relationships in the graph, looked only at the bi-directional or bilateral relationships where pages cross-referenced each other. None have really identified how close every relationship was to another relationship or looked for the most spatially similar relationship. You can score them then, and you can kind of rank outliers. So, let me just give one quick example. It's actually my favorite outlier that I've found--
RVB: 07:30 Which category?
KF: 07:31 Unusual thing to say. There's a small town in Australia called Arish. I think I'm pronouncing that right, that has a relationship with the town in the Sinai Peninsula called Arish, and El Arish in Australia is named after Arish, Egypt because Australian soldiers were based there in World War One--
RVB: 07:51 No way!
KF: 07:53 Yep! And most importantly, this relationship from a spatial perspective, looks like no other relationship. So, that's the kind of thing, when you are able to look at relationships, try to rate them in terms of spatial outliers--
RVB: 08:10 Yeah, sure.
KF: 08:12 You can find things that lead to additional discovery as well.
RVB: 08:18 Super cool.
KF: 08:19 As a Wikipedia junkie, that's pretty fascinating.
RVB: 08:21 [laughter] Very cool. Well, I read your blog post about-- outliers made me think of security aspects actually. I don't know if you know the book Liars and Outliers. It's a really great book by Bruce Schneier. I also have to think about-- we recently did a Wiki Wiki challenge, which is, you know, finding the connections between Wikipedia pages. You know, how are two randomly chosen Wikipedia pages linked together, which is always super fun to do.
KF: 09:00 It was even in my original posting and I didn't want to say that, "Hey, this could be used for security type applications." So, I think I talked in code and said, "You could use this to identify red flag events," but I like to think of it as both the positive opportunity and the negative opportunity when you're able to identify outliers and--
RVB: 09:26 Yeah, identifying outliers has lots of business applications, right? I mean, those outliers are typically very interesting, whether it's in terms of unexpected knowledge, or fraudulent transactions, suspect transactions. Outliers tend to be really interesting, right?
KF: 09:43 Absolutely, absolutely.
RVB: 09:45 Super cool. So, where is this going, Carl? What do you think-- what's the next step for you and DataFoxtrot, but also graph knowledge in general? Any perspectives on that?
KF: 09:56 Yeah. So, there's more of a tactical thing, which is as we record a week from now we have GraphConnect probably--
RVB: 10:04 I am so looking forward to it.
KF: 10:06 Which will be fantastic and being able to test this out with people. It's always great to bounce ideas off to people. In terms of our next experiments, the one that interests me is almost the opposite of outliers and let me explain. So, I have some background in demographics, analytics, and segmentation, so, what interests me a lot is looking at clustering of relationships of the graph. Think of clustering is grouping things that are similar in to bins or clusters, so that you can really make over arching statements or productions about each cluster. You can use techniques like K Means to do the clustering. So, what interests me about graph and spatial for clustering is you can use both elements. The relationships of the graph, spatial location of the nodes, together to drive the clustering. I've started some of the work on this and, again, using Wikipedia data and maybe the outcome, using Wikipedia, if you did your clustering based on spatial location of the nodes, plus strength of the connection, plus the importance of the nodes, plus maybe some other qualifiers, like if a node is a Wikipedia page for a city or a man-made feature, a natural feature, you might end up with clusters that have labels to them. One cluster might be all relationships connecting cities in South America and Western Europe, or relationships between sports teams around the world. So, it's kind of the opposite, if outliers is finding the outliers, the exceptional things, clustering is finding the patterns.
RVB: 11:42 Commonalities.
KF: 11:44 A real-world example might be an eCommerce company is looking at the distribution network, and they want to do clustering based on shipments, who shipped what to whom, where the shipper and recipient are, package type, value, other factors, and they could create a clustering system that categorises their distribution network and they can look at business performance by cluster, impact of marketing on clusters and sometimes just the basic visualisation of clustering just often yields those Eureka moments of insight. That's kind of the next entrusting project that's out there. I'd say, ask me in six to eight weeks [laughter].
RVB: 12:29 We'll definitely do that. Cool. Carl, I think we're going to wrap up here. It's been a great pleasure talking to you. Thank you for taking the time, and I really look forward to seeing you at GraphConnect. I wish you lots of fun and success with your project.
KF: 12:49 Excellent. Thank you very much Rik, really appreciate it.
RVB: 12:51 Thank you, bye bye.
Subscribing to the podcast is easy: just add the rss feed or add us in iTunes! Hope you'll enjoy it!
All the best