Friday, 16 July 2021

Graphistania 2.0, Episode 15: The Summer Session with Emil

Well this makes me very happy: just before many of us are taking some summer vacations, and ON THE DAY OF MY 2ND VACCINATION SHOT, I am able to publish another Graphistania podcast episode - interviewing my friend and boss (how awesome is it to be able to say that!) Emil Eifrem. We talk about the world, the graph database market, Neo4j the company, and of course, the products. It was a ton of fun, and I even got Emil to agree to publishing the video recording too :) ... Hope you enjoy it as much as I did. Here goes:

And of course the video:


Here's the transcript of our conversation:

RVB 00:00:01.709 So have I got your consent to record, please?

EE 00:00:07.850 Fine. [laughter] I want to put it on the record that you have my consent to record this and release the audio.

Monday, 21 June 2021

Revisiting Covid-19 contact tracing with Neo4j 4.3's relationship indexes

Last week was a great week at the "office". One that I don't think I will easily forget. Not only did we host our Nodes 2021 conference, but we also launched our new website, published a MASSIVE trillion-relationship graph, and announced a crazy $325M series F investment round that will fuel our growth in years to come. 

In all that good news, the new release of Neo4j 4.3 kinda disappeared into the background - which is why I thought it would be fun to write a short blogpost about one of the key features that are part of this new release: relationship property indexes.

This is a really interesting feature for a number of different reasons. But let's draw your attention to two main points of attention:

  1. Relationship indexes will lead to performance improvements: all of a sudden the Neo4j Cypher query planner is going to be able to use a lot more information, provided by these relationship indexes. The planner is becoming smarter - and therefore queries will become faster. We will explore this below.
  2. Relationship indexes will actually have interesting modelling implications: the introduction of these indexes could have far-reaching implications with regards to how we model certain things. Here's what we mean with that

You can see that both alternative models could have good use, but that the second model is simpler and potentially more elegant. It will depend on the use case to decide between the two - but in the past we would most often use the first model for performance reasons - and we will see below that that will no longer be a main reason with the addition of relationship indexes. Let's investigate.

Thursday, 10 June 2021

Network Analysis of Shakespeare's plays

What do you do when a new colleague starts to talk to you about how they would love to experiment with getting a dataset about Romeo & Juliet into a graph? Yes, that's right, you get your graph boots on, and you start looking out for a great dataset that you could play around with. And as usual, one things leads to another (it's all connected, remember!), and you end up with this incredible experiment that twists, turns and meanders into something fascinating. That's what happened here too.  

William Shakespeare

Finding a Data source

That was so easy. I very quickly located a Dataset on Kaggle that I thought would be really interesting. It's a comma-separated file, about 110k lines long and 10MB in size, that holds all the lines that Shakespeare wrote for his plays. It's just an amazing dataset - not too complicated, but terribly interesting.

The structure of the file has the following File headers:

DatalinePlayPlayerLinenumberActSceneLinePlayerPlayerLine
abcdefghijklmnopqr

Of course you can find the dataset on Kaggle yourself, but I actually quickly imported it into a google sheet version that you can access as well. This gsheet is shared and made public on the internet, and can then be downloaded as a csv at any time from this URL. This URL is what we will use for importing this data into Neo4j.

Tuesday, 11 May 2021

Graphistania 2.0 - The one with all the GraphStuff

Yes! Here's another great Neo4j podcast episode for you. I hope you will enjoy it -  just as much as I enjoyed recording it with Stefan.

Note that I have put all the interesting links together at the very bottom of the post. They all come from the Twin4j newsletter - to which you should all subscribe, obviously!


Here's the transcript of our conversation:

RVB: 00:00:44.353 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j, and yes, it's that time again. We are recording another Graphistania Neo4j podcast. And on the other side of this Zoom call is my dear partner in crime, Stefan, Stefan Wendin. How are you, man?
SW: 00:01:05.215 Always good. Always good meeting up, doing this with you, Rik. It's one of the favourites of the month. And I don't know, what can be better, talking about graphs with your best friend Rik in a sunny southern part of Sweden? Amazing. So good to go.
RVB: 00:01:22.239 Good to go. Fantastic. Great to have you here. And actually, we need to specify one thing, right, before we move on to the real topic of our podcast recording.

Saturday, 24 April 2021

Making sense of the news with Neo4j, APOC and Google Cloud NLP

Recently I was talking to one of our clients who was looking to put in place a knowledge graph for their organisation. They were specifically interested in better monitoring and making sense of the industry news for their organisation. There's a ton of solutions to this problem, and some of them seem like a really simple and out of the box toolset that you could just implement by giving them your credit card details - off the shelf stuff. No doubt, that could be an interesting approach, but I wanted to demonstrate to them that it could be really much more interesting to build something - on top of Neo4j. I figured that it really could not be too hard to create something meaningful and interesting - and whipped out my cypher skills and started cracking to see what I could do. Let me take you through that.

The idea and setup

I wanted to find an easy way to aggregate data from a particular company or topic, and import that into Neo4j. Sounds easy enough, and there are actually a ton of commercial providers out there that can help with that. I ended up looking at Eventregistry.org, a very simple tool - that includes some out of the box graphyness, actually - that allows me to search for news articles and events on a particular topic.

So I went ahead and created a search phrase for specific article topics (in this case "Database", "NoSQL", and "Neo4j") on the Eventregistry site, and got a huge number of articles (46k!) back. 

Monday, 29 March 2021

Part 1/3: Wikipedia Clickstream analysis with Neo4j - the data import

Alright, here's a project that has been a long time in the making. As you may know from reading this blog, I have had an interest, a fascination even, with all the wonderful use cases that the graph ecosystem holds. To me, it continues to be such a fantastic thing to be able to work in - graphs are everywhere, and more and more people are waking up to the fact that they really should look at their data as a network, and leverage the important relationships that are often hidden from plain sight.

One of these use cases that has been intriguing me for years, literally, is clickstream analysis. In fact, I wrote about this already back in 2013 - amazing when you think about it. Other people, like our friends at Snowplow Analytics, have been writing about this as well, but somehow the use case has been snowed under a little maybe. With this blogpost, I want to illustrate why I think that this particular use case - which is really a typical pathfinding application when you think about it, is such a great fit for Neo4j.

A real dataset: Wikipedia clickstream data

This crazy journey obviously started with finding a good dataset. There's quite a few of them around, but I wanted to find something realistic, representative and useful. So after some digging around I found the fantastic site of Wikimedia, where they actually structurally make all aggregated clickstream data of Wikipedia's pages available. You can just download them from this their website, and grab the latest zipped up files. In this blogpost, I worked with the February 2021 data, which you can find over here.

When you dowload that fine, you will find a tab-separated text file that includes the following 4 fields
  • prev: the previous page that the navigation came from
  • curr: the current page that the navigation came into
  • type: the description of the type of navigation that was occuring. There's different possible values here
    • link: a regular link between pages
    • external: a link from an external page to the current page
    • other: a different type - which can occur if people try to hide their navigation patterns
  • n: the number of occurrences of the (prev, curr) pair - so the number of times this navigation took place.
So this is the dataset that we want to import into Neo4j. But - we need to do one tiny little fix: we need to escape the “ characters that are in the dataset. To do that, I just opened the file in a text editor (eg. TextEdit on OSX) and did a simple Find/Replace of " with "". This take care of it.

Part 2/3: Wikipedia Clickstream analysis with Neo4j - queries and exploration

In the previous blogpost, I showed you how easy it was to import data into Neo4j from the official Wikipedia clickstream data. I am sure you would agree that it was surprisingly easy to import a reasonably sized dataset like that, within a very reasonable timeframe. So now we can have some fun with that data, and start applying some graph queries to it. All of these queries are also on github, of course, and you can play around with them there as well.

So let's take a look at some of these queries. 

Some data profiling and exploration

Here's a very simple query to give you a feel for the dataset:

match (n)-[r:LINKS_TO]->(m)
return distinct r.type, count(r);
match (n) return count(n);

The results are telling:

And so now we can start taking a look at some specific links between pages. One place to investigate would be the Neo4j wikipedia page. Here's a query that looks at the source pages that are generating traffic into the Neo4j wikipedia page: