Thursday, 10 June 2021

Network Analysis of Shakespeare's plays

What do you do when a new colleague starts to talk to you about how they would love to experiment with getting a dataset about Romeo & Juliet into a graph? Yes, that's right, you get your graph boots on, and you start looking out for a great dataset that you could play around with. And as usual, one things leads to another (it's all connected, remember!), and you end up with this incredible experiment that twists, turns and meanders into something fascinating. That's what happened here too.  

William Shakespeare

Finding a Data source

That was so easy. I very quickly located a Dataset on Kaggle that I thought would be really interesting. It's a comma-separated file, about 110k lines long and 10MB in size, that holds all the lines that Shakespeare wrote for his plays. It's just an amazing dataset - not too complicated, but terribly interesting.

The structure of the file has the following File headers:

DatalinePlayPlayerLinenumberActSceneLinePlayerPlayerLine
abcdefghijklmnopqr

Of course you can find the dataset on Kaggle yourself, but I actually quickly imported it into a google sheet version that you can access as well. This gsheet is shared and made public on the internet, and can then be downloaded as a csv at any time from this URL. This URL is what we will use for importing this data into Neo4j.

Tuesday, 11 May 2021

Graphistania 2.0 - The one with all the GraphStuff

Yes! Here's another great Neo4j podcast episode for you. I hope you will enjoy it -  just as much as I enjoyed recording it with Stefan.

Note that I have put all the interesting links together at the very bottom of the post. They all come from the Twin4j newsletter - to which you should all subscribe, obviously!


Here's the transcript of our conversation:

RVB: 00:00:44.353 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j, and yes, it's that time again. We are recording another Graphistania Neo4j podcast. And on the other side of this Zoom call is my dear partner in crime, Stefan, Stefan Wendin. How are you, man?
SW: 00:01:05.215 Always good. Always good meeting up, doing this with you, Rik. It's one of the favourites of the month. And I don't know, what can be better, talking about graphs with your best friend Rik in a sunny southern part of Sweden? Amazing. So good to go.
RVB: 00:01:22.239 Good to go. Fantastic. Great to have you here. And actually, we need to specify one thing, right, before we move on to the real topic of our podcast recording.

Saturday, 24 April 2021

Making sense of the news with Neo4j, APOC and Google Cloud NLP

Recently I was talking to one of our clients who was looking to put in place a knowledge graph for their organisation. They were specifically interested in better monitoring and making sense of the industry news for their organisation. There's a ton of solutions to this problem, and some of them seem like a really simple and out of the box toolset that you could just implement by giving them your credit card details - off the shelf stuff. No doubt, that could be an interesting approach, but I wanted to demonstrate to them that it could be really much more interesting to build something - on top of Neo4j. I figured that it really could not be too hard to create something meaningful and interesting - and whipped out my cypher skills and started cracking to see what I could do. Let me take you through that.

The idea and setup

I wanted to find an easy way to aggregate data from a particular company or topic, and import that into Neo4j. Sounds easy enough, and there are actually a ton of commercial providers out there that can help with that. I ended up looking at Eventregistry.org, a very simple tool - that includes some out of the box graphyness, actually - that allows me to search for news articles and events on a particular topic.

So I went ahead and created a search phrase for specific article topics (in this case "Database", "NoSQL", and "Neo4j") on the Eventregistry site, and got a huge number of articles (46k!) back. 

Monday, 29 March 2021

Part 1/3: Wikipedia Clickstream analysis with Neo4j - the data import

Alright, here's a project that has been a long time in the making. As you may know from reading this blog, I have had an interest, a fascination even, with all the wonderful use cases that the graph ecosystem holds. To me, it continues to be such a fantastic thing to be able to work in - graphs are everywhere, and more and more people are waking up to the fact that they really should look at their data as a network, and leverage the important relationships that are often hidden from plain sight.

One of these use cases that has been intriguing me for years, literally, is clickstream analysis. In fact, I wrote about this already back in 2013 - amazing when you think about it. Other people, like our friends at Snowplow Analytics, have been writing about this as well, but somehow the use case has been snowed under a little maybe. With this blogpost, I want to illustrate why I think that this particular use case - which is really a typical pathfinding application when you think about it, is such a great fit for Neo4j.

A real dataset: Wikipedia clickstream data

This crazy journey obviously started with finding a good dataset. There's quite a few of them around, but I wanted to find something realistic, representative and useful. So after some digging around I found the fantastic site of Wikimedia, where they actually structurally make all aggregated clickstream data of Wikipedia's pages available. You can just download them from this their website, and grab the latest zipped up files. In this blogpost, I worked with the February 2021 data, which you can find over here.

When you dowload that fine, you will find a tab-separated text file that includes the following 4 fields
  • prev: the previous page that the navigation came from
  • curr: the current page that the navigation came into
  • type: the description of the type of navigation that was occuring. There's different possible values here
    • link: a regular link between pages
    • external: a link from an external page to the current page
    • other: a different type - which can occur if people try to hide their navigation patterns
  • n: the number of occurrences of the (prev, curr) pair - so the number of times this navigation took place.
So this is the dataset that we want to import into Neo4j. But - we need to do one tiny little fix: we need to escape the “ characters that are in the dataset. To do that, I just opened the file in a text editor (eg. TextEdit on OSX) and did a simple Find/Replace of " with "". This take care of it.

Part 2/3: Wikipedia Clickstream analysis with Neo4j - queries and exploration

In the previous blogpost, I showed you how easy it was to import data into Neo4j from the official Wikipedia clickstream data. I am sure you would agree that it was surprisingly easy to import a reasonably sized dataset like that, within a very reasonable timeframe. So now we can have some fun with that data, and start applying some graph queries to it. All of these queries are also on github, of course, and you can play around with them there as well.

So let's take a look at some of these queries. 

Some data profiling and exploration

Here's a very simple query to give you a feel for the dataset:

match (n)-[r:LINKS_TO]->(m)
return distinct r.type, count(r);
match (n) return count(n);

The results are telling:

And so now we can start taking a look at some specific links between pages. One place to investigate would be the Neo4j wikipedia page. Here's a query that looks at the source pages that are generating traffic into the Neo4j wikipedia page:

Part 3/3 - Wikipedia Clickstream analysis with Neo4j - some Graph Data Science & Graph Exploration

In the previous blogposts, I have tried to show
  • How easy it is to import the Wikipedia Clickstream data into Neo4j. You can find that post over here.
  • How you can start doing some interesting querying on that data, with some very simple but powerful Cypher querying. You can find that post over here.
In this final blogpost I want to try to add two more things to the mix. First, I want to see if I can do some useful "Graph Data Science" on this dataset. I will be using the Neo4j Graph Data Science Library for this, as well as the Neuler Graph App that plugs into the Neo4j Desktop. Next, I will be exposing some of the results of these Graph Data Science calculations in Neo4j's interactive graph exploration tool, Neo4j Bloom. So let's do that. 

Installing/Running the Graph Data Science Library

Thanks to Neo4j's plugin architecture, and the Neo4j Desktop tool around that, it is now super easy to install and run the Graph Data Science Libary - it installs in a few clicks and that you are off to the races:

Monday, 15 March 2021

Graphistania 2.0: The Lockdown Anniversary session

This weekend marked the 1 year anniversary of the "Covid era" - the time when many of us have been hunkered down at home, or close to home at least, to deal with the raging pandemic. It has been a strange couple of months, and while I personally have been able to deal with it quite comfortably, I must say that my heart has been going out to all the friends and families that have had a far worse time with this. I personally got to experience it first hand in the first couple of months of 2021, loosing a very dear friend to the virus and its awful disease, and will not easily forget this period of our lives.

But we do want to keep the fantastic drive and atmosphere of our Neo4j community up - it would be a terrible shame if we lost that, too, to the virus. So that's why me and Stefan are going to continue making these podcast episodes, at least for the foreseeable future. Not in the least because they are a ton of fun to make :) ... 

So here's our latest chat. We have found a true treasure trove of great use cases and such in the Twin4j newsletter, which we will try to highlight:

Hope you will find our chat interesting! Here it goes:


RVB: 00:00:00.768 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j, and yep, it's that time again. Yippee! Yeah. We have another podcast recording day, and for that, I have my dear friend Stefan on the other side of this Zoom call. Hey, Stefan.

SW: 00:00:18.130 Hello, Rik. Nice to be back here with you.

RVB: 00:00:20.830 Hey, there.

SW: 00:00:21.733 Coming back from a week of vacation, so very nice to be back, and what better way to get your graph mind up and running than to hang out with you here in this lovely podcast?

RVB: 00:00:33.018 I hope so too. Thanks for being here, and I hope you had a good holiday. And it's been two months, Stefan, so we really need to get our act together. It's been a very busy--

SW: 00:00:43.185 Holy crap.

RVB: 00:00:44.057 --couple of months, but. We try to do these things once a month, but that didn't work in February, and so we've got a lot to talk about, actually. And maybe I'll just kind of frame it for you. I went through all of the Twin4j this week, the Neo4j newsletters over the past couple of weeks, and I found some really interesting themes. And maybe we can talk about those for a bit. There's three of them. Is that okay?

SW: 00:01:16.179 Yeah, yeah. Let's go. I think that should be super fun.

RVB: 00:01:20.885 Yeah. So the first one I wanted to talk about is, basically, use cases. I mean, we see this all the time in our community, really, right? That there's these unbelievable interesting use cases popping up. There's a couple of them that popped up in the newsletters. The first one is actually dear to my heart. It's something that I started working on back in 2012 when I first joined Neo4j. It's protein interaction networks. Have you taken a look at that one?

SW: 00:01:50.023 Yeah. That was an amazing one. As always with those kind of things, when there is-- this was written by Tomaz, right? So I started--

RVB: 00:02:01.558 Tomaz, [crosstalk].

SW: 00:02:02.277 --reading the article, but then halfway through the article, I was like, "Oh, but I better just try it out." That's how I learn. So I ended up running this, and it is so neat that this is, in some sort of way, so accessible. I think for me, that is super cool in that. So I find it extremely interesting and also because of the simplicity of it and how complex it is, even if it's just such a simple thing as a protein that interacts with another protein, basically, at the foundation, and it's still amazing. I think it's kind of mind boggling in that sense, so.

RVB: 00:02:44.678 A little story from my side: when I first started working with Neo4j, we did some work with the University of Ghent here in Belgium, and they were working on a topic called metaproteomics, which is exactly this, interactions of proteins. And they struck a nerve with me because one of their most important research customers was, of course - drumroll - a brewery.

SW: 00:03:13.135 Of course. I was wondering, "Where is this going to go?"

RVB: 00:03:17.022 Yes.

SW: 00:03:17.797 There it was. Brewery time again.

RVB: 00:03:18.825 There it is. Yeah.

SW: 00:03:20.664 There it is.

RVB: 00:03:21.182 It was a beer brewery, and they were basically saying yeasts that are being added to brewing systems, they create these protein interactions, and if you better manage those protein interactions, you can actually influence the brewing and the taste of the brew by doing so. So that was an interesting one. I had a good time exploring that. There was another one, Stefan, around asset management. This is a very well-known one as well, right? Things like configuration management databases, building information management. It's all about managing assets, isn't it? It's a very networked problem.

SW: 00:04:06.005 Yeah, yeah. And also, this is one of the use cases that seems to pop up more and more often in customer or prospect interactions, I think, so I think it's going to be very helpful for people. So go check it out if you are interested in that. I think that would be very neat to do that.

RVB: 00:04:29.195 But I know you want to skip to the third one that we had [crosstalk].

SW: 00:04:32.095 Yeah, the big one. This is what I'm kind of waiting for, like, "How can we get to this point faster?" Ha ha!

RVB: 00:04:37.534 Yeah. "How can we get to that third point faster?" which is, of course, the use case of getting to Mars and NASA. It was all over the news in the past couple of weeks, obviously, but yeah, that story of how David Meza from NASA-- he's the - how do you call it? - chief knowledge management architect or something like that with NASA, and he worked on that Lessons Learned database in Neo4j. Super cool, right? It's so good.

SW: 00:05:12.489 Yeah. It's super good. I think the use case is good. The interview is also great. David is also very relaxed, I think, also. Ashley - or what is the name of the girl interviewing? - is also doing a great job. There's very good chemistry in there, really enjoyed it. And again, of course, anyone that was dreaming about going to space as a kid, imagine working with such a thing. I mean, it can't get better. This is the moment when you go to work, and you go, "Holy crap. I'm so proud now." But I think it's an interesting thing on how much this actually speed up time for them, right? How much is saved not only time, in that sense, but also taxpayers' money, right?

RVB: 00:05:59.053 Of course.

SW: 00:05:59.211 I think that's what I keep coming back to in talking about use cases. So you can do a lot of things with a lot of technologies, right? So very often, people ask me, "How can I use Neo?" Right? But then I say, "You can do it for this, but you can, of course, do this with your old technology, in theory. However, if that theory takes you two years, maybe you can't really do it in practice." So I keep coming back to think about that, and I think this is such a good showcaser on that. It's a great YouTube clip here, so for those that have a hard time reading or just want to listen while-- I was saying commuting to work. Apparently, commuting to your working room should be better now, in these times. But I really enjoyed it. Happy to see it, so yeah, hope you like it as well.

RVB: 00:06:49.604 Cool. Yeah. It was super nice. And there's another, I mean, theme to the newsletters in the past couple of months. I've seen so many-- it's kind of like a use case, but it's also a technology foundation of people that are using graphs in combination with natural language processing, right? I saw a number of posts from Jesús, our colleague, who was talking about RDF-related work, WordNet, those types of things, but there's also people that are doing really interesting work on extracting new knowledge from existing documents, right? Were you able to make any sense of that?

SW: 00:07:39.354 Yeah. And I think it's like this is also one of those kind of super untapped-- and I think it's also a perfect bridge from NASA, right? Because literally, that was what they were doing. They had the answers; they just couldn't see it, right? So it's a classical, "You can't see the forest because of all the trees," right? So I think that is really interesting. And I think also, there is a great post about-- I think it was called From Text to Knowledge: The Information Extraction Pipeline or something, basically where Tomaz then explained why he see a combination of NLP and graphs as one path to explainable AI, right? And I think this is also one of those topics that are super important from a lot of things to kind of understand but also compliance and a lot of things, right? So I think this is also one of those kind of areas, use cases, or whatever you want to call it that literally are exploding on all different kind of verticals, you may almost use as a word there. But yeah, I think that, again, our--

RVB: 00:08:44.103 Yeah. Well, it's been very popular in domains where there's a lot of documentation, right? So academics, pharmaceuticals, patterns, legal texts, all those things have been really a great showcase for this type of work, I think.

SW: 00:09:04.354 Yeah. No, but as you're saying, I can see anything from academia, patterns-- I mean, I don't know how many of these kind of works that we have done with prospects and really kind of tapping into this kind of super kind of deep knowledge, but you can't see the new perspective because it is just a lot of deep silos, right? So in that sense, it's super graphy, and I think when people start to see it, this is also where they get so excited so were almost screaming, "Take my money," and I was like, "Calm down. Behave good. What is it that you're thinking of answering, or what is the thing that would help you," right? It doesn't have to be the money query, but just don't throw technology at the problem. So be mindful of what you want. I mean, we can see it a lot. I think one of the interesting parts is also looking upon the entire web, scraping information and making sense of it and treating that as a kind of a knowledge grab itself. It's also one of a neat couple of projects that I'm working on personally. Yeah. So there's a lot of funny things to do with the NLP and knowledge graphing combination, I think.

RVB: 00:10:19.330 Yep, totally. Well, what strikes me there-- and this is a perfect segue to our third theme. What strikes me is that it's becoming so much easier to do this, right? So NLP a couple of years ago, that was just so exotic and difficult to use. You basically had to have some kind of a computer science degree or a PhD to be able to use it, but these days, the tools that we have to implement some of these techniques are super accessible. I mean, even a lost sales guy like me can use it. Do you know what I mean? It's pretty usable, even in its basic form. So I wanted to talk about some of the really interesting tools that we see emerging in our community. The one that I was so happy about, to finally see it fully released, is the Arrows app. I think you've used it for a long time already when it was still in an alpha or beta stage, and now it's--

SW: 00:11:26.893 Exactly.

RVB: 00:11:27.278 --actually been released. Alistair Jones' pet project is now finally out there in the wild. Great, though, right? I mean, really great.

SW: 00:11:37.420 Yeah. No, but I think it's so useful in so many ways, of course, for graph modelling, where it was intended, right? And I have a couple of memories working with a lot of C-level people trying to help them understand the power of connected data and so on. So they're not going to write any code, but what we tend to do is do some modelling practices and basic cipher, and for that, I use Arrows. And one of the constant feedback that I get because of the simplicity of the tool and how it naturally kind of lends your thinking to this kind of graph thinking idea is that, every single one of these sessions, these CEOs, COOs are coming back to me like, "Oh, this is a really good way of thinking of the business, the different domain, and how it's connected. It has given me tons of new ideas." So in that sense, that kind of doubled down as a ideation kind of tool almost, if it makes sense, but I think that's such a cool kind of thing, right? If you're going to build an app or if you have a business logic or anything, that really helps you to kind of map it out in that sense. So it's also kind of neat to see that and to see those people, also, stepping into the graph arena. But yeah.

RVB: 00:12:53.804 Yep. That frame also. Yeah. So Arrows is one of those really amazing tools, but there's other things that are coming up, right? We've all known about the GRANDstack, the development framework that's been around for quite some time. A new release for that and new features, capabilities there, and some examples, also, for people to use and to abuse, I would say.

SW: 00:13:18.057 Use and abuse. That's a perfect way to do it.

RVB: 00:13:19.407 Use and abuse, yeah.

SW: 00:13:21.935 Just dive in there and try.

RVB: 00:13:22.283 And then another one that I wanted to mention was I've really enjoyed using this tool that Niels created called NeoDash. It's a graph app that plugs into the Neo4j browser and allows you to put together dashboards on top of Neo4j. Really no-code development, that type of thing, super easy to use. I was very impressed by that, how accessible it's made everything.

SW: 00:13:52.996 Yeah. And I think that's such a great part because I think just the ability to do something without that no code. Of course, that is in the topic with the cloud itself, one of those kind of things that you can really see how accessible these are. But seeing people with no previous knowledge just trying and fiddling around there, they kind of stumble upon the solution almost. That simple. I think the NeoDash is such a great kind of application for just exploring or visualising the data that you have in a graph that you normally would not use. So we tend to try it out with a lot of the nontechnical people when we work, and it works like a charm every single time. There is literally almost no studying kind of to get started, so I again encourage you, as with all articles, use and abuse, right? Dive in, try around because actually, you're going to get kind of far by just doing that. And that is, I think, a common theme for all of these. You can really see how this whole paradigm of connected data is changing, which is, of course, super cool.

RVB: 00:15:17.502 Yeah, absolutely. Well, I mean, so many other things to talk about. What we'll do is when we get to the blog post created together with this recording, we'll also put all the links to the amazing Twin4j newsletter items and the blog posts and everything all together, and then people can have a look, have a play, use and abuse, and that should set them on their path for even more graph adoption, right? So that's the [crosstalk] idea.

SW: 00:15:50.313 Even more graph adoption. Yeah. And I'm also going to squeeze in a last one because we also had the GDS 1.5 release, right? And there's a great piece on it from Amy and Alicia, two of my most inspiring colleagues. They have taught me a lot of things and are super nice as well. So it's about the new supervised machine learning workflows in Neo4j, so imagine that being even accessible to just try. I can't even think of this. If I would have guessed this five years ago, I'd be like, "Nah, that's not going to happen."

RVB: 00:16:29.543 It was impossible. Yeah, exactly.

SW: 00:16:31.365 "That's impossible. You can't do that on your computer at home in your sofa." But I think that's so cool. So that's a great article by Amy and Alicia; go check it out as well. I'm going to push that in there, but we're going to, of course, post the links, as I said.

RVB: 00:16:48.591 We will, for sure. Well, Stefan, thank you so much for taking the time running through this with me and making a little bit more sense out of it. It was great talking to you. You know that we want to keep these things shortish, at least, so we're going to wrap it up for now, and we're going to try to have this one published soon and then do another one in April, right? We should [crosstalk].

SW: 00:17:14.412 Yes, of course, 1st of April. I will record it from my new podcast studio in the barn in [inaudible] in southern Sweden.

RVB: 00:17:22.097 Ooh.

SW: 00:17:23.410 Ooh, yes. And then--

RVB: 00:17:25.195 I look forward to that one.

SW: 00:17:26.657 --as soon as we get to travel, this is a standing invitation for you to come join me and also for any of our listeners. Bear in mind--

RVB: 00:17:37.641 Absolutely [crosstalk].

SW: 00:17:37.964 --COVID restrictions has to be better before that, so I am not encouraging any anti-vaccine behaviour here, but as soon as we are allowed to travel, come join us. It's going to be a great talk about graphs.

RVB: 00:17:51.294 Fantastic. Thank you, Stefan. It was great talking to you, and I'll talk to you soon.

SW: 00:17:55.905 Likewise. Bye.

RVB: 00:17:57.591 Bye.

Hope you enjoyed that as much as we did. If you have any comments or questions, just reach out!

All the best

Rik & Stefan