Bruggen Blog: 2020

Wednesday, 25 November 2020

Exporting Spotify Playlists into Neo4j - and creating a little dashboard

About two months ago, my colleague Niels published an amazing blogpost. He showed us how to solve a problem that I really recognized: to make sense of your age-old Spotify playlists that are getting seriously out of hand. I have this problem in the real world: I keep adding songs to my "favorites" playlist, or to some collaborative playlists that I have with my kids/friends - but I end up with these huge gathering pots of songs that... really don't make a lot of sense anymore, and really have not much use anymore.

So Niels' blogpost was really useful: he used python, the spotipy wrapper of the Spotify Web API, and of course our favourite database, Neo4j and some of it's graphy tools (Graph Data Science to the rescue) to make a really fancy new set of Spotify playlists that were much more useable. Take a look at Niels' script over here. So I wanted to have a play with Niels' work in my own environment - and do some more exploration in Neo4j. Here's what happened.

Graphistania 2.0 - Episode 11 - The Emil Update

Yey! I got to do it again. For the 4th time in the history of this weird thing called the Graphistania podcast, I have had the change to spend some quality time talking to Emil Eifrem, our fearless leader and CEO of Neo4j. As last time, we actually recorded the video, so you will find the zoom call, and the MP3 version of it, below in the blogpost - along with the habitual transcription.

Hope you will enjoy the chat as much as I did.

Graphistania · Graphistania 2.0 - Episode 11

Here's the link to the youtube video of the call:

Graphistania 2.0 - Episode 10 - This Month in Neo4j

Hi everyone

Hope you are all well, keeping safe, and finding some time to relax and enjoy life in this wonderful rollercoaster that is 2020. Think of it this way - we will never forget this ride, EVAH!

As you can imagine, things have been evolving at warp speed in the wonderful world of graphs as well. So me and my partner in crime Stefan had another chat about all the things we have seen pop up, mostly through the awesome This Week in Neo4j (Twin4J) newsletter. Here's the chat we recorded:

Graphistania · Graphistania 2.0 - Episode 10

Here's the transcript of our conversation:

RVB:00:00:01.448 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j, and here I am again recording another episode of our Graphistania Neo4j podcast. Wonderful time of the day to start with this type of conversation because I have my dear friend, Stefan, on the other side of this call. Hi, Stefan. How are you?

Making sense of 2020's mad calendar with Neo4j

As we enter November 2020, I - like many people I assume - can't help but feel quite "betwattled" by all of the events taking place this year. I took some time last weekend to look at all the crazy events that happened ... starting with pretty normal January and February, moving slowly to ominous March, and then living the weird, (semi-) locked down lives that we have been living until this very day I write this, which is the day after the bizarre US elections.

In any case, I decided to have some fun while reflecting about all this. And in my world, that means playing with data, using my favourite tools... CSV files, Google Sheets, and of course, Neo4j. Let me take you for a ride.

Starting out with my calendar

The starting point of all this is of course my Google Calendar - which is buried in online calls and meetings these days.

Graphistania 2.0 - Episode 9 - The one about the (Graph Databases for) Dummies (book)

Here's a nice new episode of the Graphistania podcast for you: for the first time in 5 years, I was able to get the fantastically awesome Chief Scientist of Neo4j, Dr. Jim Webber, back to the podcast. Jim is a great colleague and friend, and one of the best tech public speakers in the business - especially when you want to talk Graphs and distributed systems. Over the past few months, I had the pleasure of working together with Jim on a more regular basis - as we actually wrote a book together: the Graph Databases for Dummies book. It was announced on the Neo4j blog, and seems to have been doing really well in the past few weeks. Some of you may remember that Jim co-wrote The O'Reilly book on Graph Databases, and I wrote Learning Neo4j by Packt (2nd edition together with Jérôme Baton) - and we have had a bit of friendly banter going back and forth about the quality of both artifacts :) ... it has been a ton of fun.

So here's the chat that we recorded about the new book - hope you enjoy it as much as we did.

Graphistania · Graphistania 2.0 - Episode 9

Here's the transcript of our conversation:

RVB - 00:00:00.151 Hello, everyone. My name is Rik, Rik Van Bruggen, from Neo4j, and here I am again recording another episode of our Graphistania podcast. And this is a special one. This is a special episode, one that we've been talking about for some time, because I have a very special guest on this show, and that is my dear friend and colleague Jim Webber. Hey, Jim.

Using Apache Zeppelin with Neo4j to analyse the FinCEN Files

Last week, we got another great and widely publicised case of Graph Databases' usefulness throw our way. The ICIJ published their FinCEN Files research, and on top of allowing you to explore the data on their website they also published an anonymised subset of the data as a series of CSV/JSON files. My friends and colleagues Michael Hunger, Will Lyon and the rest of the team, helped with the process of making this subset available as a Neo4j database (see this github repo), and there's even a super easy FinCEN Files Neo4j Sandbox that you can spin up in no time for some investigation fun.

So of course I had to take this data for a spin myself - it seems really important to me that more eyeballs are looking at this, and more people exposing the sometimes very questionable behaviour of the world's largest financial institutions.

Introducing Zeppelin

I had heard of some great technology a while ago that would allow people to use their data in a very different way, by looking at these interactive webpages that would interact with a Neo4j database.

Exponential growth in Neo4j

With the current surges of the Covid-19 Pandemic globally, there is a huge amount of debate raging in our societies - everywhere. It’s almost as if the duality between left and right that has been dividing many political spectra in the past few years, is now also translating itself into a duality that is all about more freedom for the individual (and potentially - a higher spread of the SARS-CoV-2 virus), versus more restrictions for the individual. It’s such a difficult debate - with no clear definitive outcome that I know of. There’s just too many uncertainties and variations in the pandemic - I personally don’t see how you can make generic statements about it very easily.

One thing I do know though, is that very smart and loveable people, in my own social and professional circle and beyond, seem to be confused by some of the data. Very often, they make seemingly rational arguments about the numbers that are seeing - but ignoring the fact that we are looking at an Exponential Growth problem. In this post, I want to talk about that a little bit, and illustrate it with an example from the Neo4j world.

What is Exponential Growth exactly?

Let’s take a look at the definition from good old Wikipedia:

Exponential growth is a specific way that a quantity may increase over time. It occurs when the instantaneous rate of change (that is, the derivative) of a quantity with respect to time is proportional to the quantity itself. Described as a function, a quantity undergoing exponential growth is an exponential function of time, that is, the variable representing time is the exponent (in contrast to other types of growth, such as quadratic growth).

The basic functions that are being entertained here are very simple in terms of the maths:

OpenTrials in Neo4j - with a simple ETL job

I have been meaning to write about this for such a long time. Ever since the lockdown happened, I have been wanting to take a look at a particular biomedical dataset that looks extremely interesting to me: the OpenTrials dataset. If you are not familiar with this yet, this is what they say:

OpenTrials is a collaboration between Open Knowledge International and Dr Ben Goldacre from the University of Oxford DataLab. It aims to locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally.

It's a super interesting initiative, and it really flows from the idea that in much of the very intensive, expensive biomedical research, we should be looking at how to better use and re-use the knowledge that we are building up. Kind of like what people in the CovidGraph.org initiative, het.io (remember the interview I did with Daniel - so great!) and others are doing.

Downloading and restoring the dataset

It's a bit hidden, but you can actually download a (slightly older, but still) dataset of the OpenTrials dataset from their website. The dataset is actually a Postgres dump file: I got the latest one from http://datastore.opentrials.net/public/opentrials-api-2018-04-01.dump.

Graphistania 2.0 - Episode 8 - The one after the Covid-summer

No sure if we should be happy or sad - but hey - the Covid-19 summer of 2020 is almost behind us. Like most people, I found it quite a strange and unusual summer, with very few foreign adventures (although I did manage to squeeze in a cycling/camping trip to the French Alps in July), lots of cycling, some great family time... and of course lots of time with graphs :) ...

So that means that we are also kicking the Graphistania podcast back into gear - here's the next episode for you:

Graphistania · Graphistania 2.0 - Episode 8

Here's the transcript of our conversation:

RVB: 00:00:15.863 [music] Hey, Stefan, I do need to ask you for consent, I think, right?
SW: 00:00:19.847 Hi. Yeah, I consent. [laughter] This is always the weird moment.
RVB: 00:00:24.727 Exactly. I thought, "Start with that one again."
SW: 00:00:28.436 Exactly, just to create a little bit of tension in the air.

Graphistania 2.0 - Episode 7 - The one after the Covid-19 lockdown

Yes! We were able to record and publish another episode of our Graphistania podcast. It's been an amazing and turbulent couple of months - but before the summer holiday season really takes off we wanted to get this to you.

Wishing you a fantastic and relaxing time - and in the mean time enjoy this episode!

Graphistania · Graphistania 2.0 - episode 7

Here's the transcript of our conversation:

Executives of Belgian Public Companies - revisited!

Long time ago, when dinosaurs roamed the earth and Neo4j was just a tiny cute little junior graph database ;-), I wrote a 2 part blogpost about a newspaper article that I had come across in De Tijd about the network of executives of Belgian public companies. You can find the articles over here: Part 1 and Part 2. Turns out - and I really was not aware of this until recently - that the newspaper has been running this type of publication on a yearly basis. Here's another article from 2018 on De Tijd’s website.

So imagine my surprise a few weeks ago, when I was contacted by one of the authors of that article, Thomas Roelens, to verify some info for the 2020 edition of this analysis. We had a great chat, and Thomas basically asked me to double check some of the analysis that he had done himself already. So, contrary to what happened in 2017 (where I had to dig into the HTML source to download the info from the website - Thomas just sent it to me, and basically allowed me to take it for a spin :) ...

Meanwhile, Thomas' article has been published in the newspaper: you can find it over here or over here. But here's my update below too.

What VAT Fraud Detection and Contact Tracing have in common

In the previous blogpost we already illustrated in some detail that the contact tracing graph that we built, has a lot of similarities with a product recommendation system graph. We focused on a the Person-Visit-Place triangle that we had built in our Contact Tracing Graph data model, and converted the red and yellow bits into a Person-Purchase-Product triangles.

There is of course another part to the contact tracing graph that is also very interesting: the Person-Meets-Person subgraph. We derived that graph from the original contact tracing graph, by assuming that if a Person had visited a Place at the same time as another person, they would have been likely to have had a meeting there. This Person-Meets-Person subgraph was the basis for most of our graph analytics.

What Recommender Systems and Contact Tracing have in common

With the Covid-19 pandemic raging in the past few months, I have had a lot of interesting conversations about the use of graph technology and how it could help the world be a better, safer, healthier place. At Neo4j, we even put in place a specific Graphs4Good program, helping out where we can. There's splendid research going on at Covidgraph.org, companies like Elsevier chipping in (and using Neo4j) as well, and I have tried to write up my humble thoughts on how Contact Tracing could really benefit from using graphs as well. See some of my recent posts published on this blog.

Looking at that work, however, I always had a the feeling that I was looking at an excellent example of something else: an excellent example of a great "graph problem". The contact tracing example is a great fit for a tool like Neo4j, and the reason why that is the case is basically because the problem that we are trying to solve with contact tracing (understanding the pandemic spread in our societies, predicting potential evolutions of the pandemic based on contacts between healthy and sick individuals, protecting the healthcare systems by managing the rate of spreading this way) is very much suited for analysis with graph technology. It is a domain where the links between people, the links between people and places, their visits, their meetings are the main important data entities that we need to look at. It's the connections that matter. It's the connections that are becoming the "equal citizens" in the dataset - and therefore we need to spend time and resources analysing it.

But of course I know one thing for sure: there are plenty of other cases that are like that, that are true "graph problems" and that could really benefit from a graph approach to solving it. We know that from all the Neo4j project that we have been running for years. So how do I demonstrate that? How do I show that Contact Tracing is essentially the same thing like a recommendation engine? Or another graph application that we have come to know and love. Let's explore that.

Creating a Contact Tracing Testbed with Neo4j and Faker

Over the past few weeks and months, I have been living through the Covid-19 pandemic like many others. It's not been easy - but at the same time I feel very fortunate to have been able to stay healthy, active, working, and connected. There's a lot of people out there that are a lot less fortunate, and my heart goes out to them.

On this blog, I have been writing about using graphs for Contact Tracing quite a bit. See

Fortunately, these articles were very well received by the community - we have had a ton of discussions with a variety of different individuals, companies and governments about how to use this technology to prevent that the next lockdown would again require immobilising so many healthy people. If the pandemic's second wave hits, we all want people at risk / sick people to be separated from the healthy population, and manage the spread of the disease in this way. But all of that requires contact tracing to be effective and operational - which is not a trivial thing to do.

This is why I have been looking at creating a very easy to use testbed for Contact Tracing in Neo4j. I wanted to make it super easy for people to create synthetic contact tracing datasets, and then work with them to gain experience - valuable experience for the "real deal" when we have to manage that. That's what this post is about.

Contact tracing guide for the Neo4j Browser

Based on the past two blogposts on (Covid-19) contact tracing (see here for the posts, here for the movies), I thought it would be a good idea to pick up an old skill - to create a Browser Guide for Neo4j for people to look at this dataset example more easily. I did this a long time ago for my beergraph as well, so why not do it for the contacttracinggraph :) ...

About the Neo4j Browser and Browser guides

Here's what this is: with Neo4j, the native graph database, we always ship a default user interface called the "Neo4j Browser". It's a interactive application that communicates with the database, and that essentially allows you to fire of Cypher queries and look at / manipulate the contents of your database. Read up about it over here. Once you have done that you will realise that the Browser is actually more than that: it's also a great way for people to learn more about Neo4j, and has a built in mechanism to share "guides" to various topics. If you experiment a bit with the following commands:

Title

Description

Command

Intro

A guided tour of Neo4j Browser

:play intro

Concepts

Graph database basics

:play concepts

Cypher

Neo4j’s graph query language introduction

:play cypher

The Movie Graph

A mini graph model of connections between actors and movies

:play movie graph

The Northwind Database

A classic use case of RDBMS to graph with import instructions and queries

:play northwind graph

you will get to see a number of topics that allow you to familiarise yourself with it really easily. Most of these guides are either built in or available for serving from a webserver. But: you can also develop these guides yourself. There's a really nice worked example over here, but the process really is dead simple:

(Covid-19) Contact tracing follow-up - demo movies

In my previous post I outlined the 4 different blogposts that I wrote about using the Neo4j Graph Database for Contact Tracing. Each of these posts is actually interesting in and of its own, and actually makes for a really nice demo of the capabilities in Neo4j. So I created those today - and put them on a Youtube playlist for you:

(Covid-19) Contact tracing - an amazing graph problem & rabbit hole

In the past couple of days, I have been working with several of my colleagues on a number of projects, all around the world, that are preparing our societies for a post-lockdown strategy that will allow us to keep the Covid-19 pandemic under control, and still regain some of our freedoms. This will be tricky, for sure, but as in so many problems, technology can probably assist.

That's why I started experimenting with how a graph database like Neo4j could help with this. Some of the tracing problems that we will face, are uniquely well suited for a graph database approach: it allows for us to see and understand the indirect contacts that healthy and sick people may have had with one another, and the effects that this could cause in our environments. It also allows for some unique predictive analytics: the structure of our contacts, the network/graph that it constructs, actually says a lot about the importance that parts of the network may play in the evolution of the pandemic. Graph Data Science can give us pointers as to where this should direct our policies.

This has ended up being quite an extensive piece of work. In order to keep it readable, I have cut it up into 4 blogposts, which I will put up all at the same time:

Part 1: how I go about creating a synthetic dataset, and import that into Neo4j
Part 2: how I can start running some interesting queries on the dataset, making me understand some of the interesting data points in there and questions that one might ask
Part 3: how I can use graph data science on this dataset, and understand some of the predictive metrics like pagerank, betweenness and use community detection to direct policies
Part 4: a number of loose ends that I touched on during my exploration - but surely did not exhaust.

There's so much potential in this dataset, and in this problem domain in general. I feel like I have gone into the rabbit hole and have just resurfaced for some air. But who knows, maybe I will dive back in and do some more digging - after all, this is interesting stuff, and I love working on interesting topics.

Hope this is as interesting for you as it was for me.

All the best

Rik

Note that these demos will require the following environment:

Neo4j Desktop 1.2.7, Neo4j Enteprise 3.5.17, apoc 3.5.0.9, gds 1.1.0, or
Neo4j Desktop 1.2.7, Neo4j Enterprise 4.0.3, apoc 4.0.0.6 (NOT later! a bug in apoc.coll.max/apoc.coll.min needs to be resolved)

(Covid-19) Contact Tracing Blogpost - part 4/4

Part 4/4: Some loose ends for the Contact Tracing graph

In this last part of this blogpost series, I wanted to quickly articulate some interesting points that I found useful during these experiments.

Using the geospatial data for some additional insights

You may remember that back in part 1, I imported some geospatial properties into our graph - assigning coordinates to all of the Places nodes that we have in the graph. Clearly this also opens up further possibilities for additional analysis, which I have not explored yet in the previous posts. Suffice to say that this data is super easy to work with in Neo4j. Just run a query like this:

match (pl:Place) return pl.id, pl.name, pl.type, pl.location limit 10;

And you can see that the pl.location property has a real geospatial data type that I can use:

Part 3/4: Graph Analytics on the contact tracing graph

Note that these queries require environment: Neo4j Desktop 1.2.7, Neo4j Enteprise 3.5.17, apoc 3.5.0.9 and GDS 1.1. At the time of writing, Neo4j 4.0.3 is not yet supported by GDS 1.1.

One of the fantastic qualities of the graph data model, I have always found, is that it can give you interesting insights - without even looking at the data. The structure of the network can give you some really interesting new revelations, that you would not even have considered before. That is why Neo4j has invested a ton of effort in providing our industry with a completely new set of capabilities that allow us to discover these structural insights more easily - in the form of a new Graph Data Science Library. We have recently released the product, and you should read up on it in detail, and I think it would be a great and interesting idea to explore it on this Contact Tracing dataset that we have built in part 1 and queried in part 2.

Some data prep for analytics: inferring a new relationship

In order to do that, there's actually something that's missing: a new relationship between two Persons, which infers the fact that two people have MET. We can do that based on the overlap time of their visits to the same place - therefore leveraging a query from part 2. This is what are going to do: create a MEETS relationship between 2 Person nodes, based on the overlap - and we do that like this:

match (p1:Person)-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(p2:Person)
where id(p1)<id(p2)
with p1, p2, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
with p1, p2, sum(minEnd-maxStart) as meetTime
create (p1)-[:MEETS {meettime: duration({seconds: meetTime/1000})}]->(p2);

As you can see, we are storing the length of the inferred meeting as a duration property on the relationship. The result appears very quickly:

Part 2/4: Querying the contact tracing graph

Note that these queries require environment: Neo4j Desktop 1.2.7, Neo4j Enteprise 3.5.17, apoc 3.5.0.9 or Neo4j Enterprise 4.0.3, apoc 4.0.0.6 (NOT later! a bug in apoc.coll.max/apoc.coll.min needs to be resolved)

In Part 1 we created and imported a contact tracing graph. Now, we are ready to experiment with some interesting graphy queries.

The most interesting part about many if these queries, I find, is that they all relay on the fundamental principle of "hypothesis-free querying". What I mean by this is, is that graph querying, in my experience and opinion, have this wonderful quality about them that you can actually interact with the data in a way that does not require you to hypothesize too much about the structure of the dataset. This is important, because very often I just won't know what I don't know, and making meaningful hypotheses is actually really hard and complicated. The fact that we don't have to do that, is a great win.

As always, you will find all queries are on github, so that you can have a play with it yourself as well. So let's dive right into it.

Who has a sick person potentially infected

To answer that, I will "grab" a sick person from the dataset, and then just walk the dataset from the person to the other persons that are currently healthy. The query goes like this:

match (p:Person {healthstatus:"Sick"})
with p
limit 1
match (p)--(v1:Visit)--(pl:Place)--(v2:Visit)--(p2:Person {healthstatus:"Healthy"})
return p.name as Spreader, v1.starttime as SpreaderStarttime, v2.endtime as SpreaderEndtime, pl.name as PlaceVisited, p2.name as Target, v2.starttime as TargetStarttime, v2.endtime as TargetEndttime;

Part 1/4: creating and importing a synthetic contact tracing graph

As we are living in these very interesting times, and many countries are still going through a massive operation to slow down the devastating effects of the SARS-CoV-2 virus and its CoViD-19 effects, there is of course also a lot of discussion already going on what we will do after the initial surge of the virus has passed, and when the various countries and regions will start opening up their economies.

A tactic many countries seem to be taking is the implementation of some kind of Contact Tracing. Using the technology on our phones and our pervasive internet connectivity, we could imagine a way to implement "distancing" and isolation of people that are either already victim of, or vulnerable to, CoViD-19. This seems like a logical, and useful tactic, that could help us to open up our economies for business, while still maintaining the basic attitude of wanting to "flatten the curve". Of course there are still many, many issues with this approach, not in the least with regards to patient privacy and political freedoms, but it seems like an interesting track to explore, at least. Many government organisations have therefore started to explore this, and are working with some of the industry giants like Google and Apple to make this a reality.

This evolution started a whole range of discussions inside Neo4j, especially with regards to the usefulness of a graph database to make sense of some of these contact traceability databases. I remember reading Christakis and Fowler's Connected book, and understanding that virus outbreaks are one of those cases where our direct contacts don't necessarily matter - or at least not matter alone. Indirect contacts, between our friends' friends' friends, can be just as important. So lots of interesting, graph-oriented questions arise: How could we maximise the effect of our distancing measures, and of any contact tracing applications that we put in place? How could we use the excellent and predictive power of the graph to find out which of a person's connections could be most risky? How can we use graph analytics to better understand the structural power and weakness of our social networks? And many more.

So, being locked down myself (although Belgium clearly has a much software stance than for example France or Italy), I thought I would spend some time exploring this. That's what this blogpost series is going to be about - so let's get right to it.

Graphistania 2.0 - Episode 6 - The One with the CovidGraph

So, when I started working with Graphs in 2012, one of the first community use cases that I encountered was all about biotech. I met a few people from the University of Ghent, who were working on some amazing protein interaction networks - and it was fascinating. Over the years, we have done quite a few activities on this, and we have kind of built a nice life sciences and healthcare community around Neo4j. Some amazing work is being done there.

One of the most amazing cases out there, has been the use case of the German Center for Diabetes Research, who have been scouring the scientific universe for ways of finding cures against diabetes. Look at this brief video or read this article to know more about it:

Why am I telling you this? Well, with the global Covid-19 pandemic sweeping around the globe, and many of us being affected in small or big ways, our Neo4j Graph Community has been doing the most interesting things to try and apply the "power of the graph" to this complex and intricate problem. Take a look at covidgraph.org for their work. When I learned about it, I immediately thought about talking to some of the "chief instigators" and inviting them for a podcast interview - which we made happen at record speed :) ...

So here it is: a chat about Covid-19, and about how graphs will help us make sense of the data. Let's hope it proves to be useful.

Supply Chain Management with graphs: part 3/3 - some SCM analytics

I've been looking forward to writing this: this is the last of 3 blogposts that I have been planning to write for weeks about my experiments with a realistic Supply Chain Management Dataset. There's two posts before this one:

In the first post I found and wrangled a dataset into my favourite graph database, Neo4j
In the second post I got acquainted with the dataset in a bit more detail, and I was able to do some initial querying on it to figure out what patterns I might be able to expose.

In this this third and last post I would like to get a bit more analytical with the dataset, and do some more detail investigation in order to better understand some typical SCM questions. Note that I am far from a Supply Chain specialist - I barely understand the domain, and therefore I will probably be asking some silly questions initially. But bear with me - and let's explore and learn, right?

Supply Chain Management with graphs: part 2/3 - some querying

So in the previous post, we got introduced to a dataset that I have been wanting to get into Neo4j for a long time: a Supply Chain Management dataset. Read up about it over here, but the long and short of it is that we got ourselves into the situation where we have an up and running Neo4j database with 38 different multi-echelon supply chains. Result!

As a quick reminder, here's what the data model looked like after the import:

Or visually:

Data validation and profiling

The first thing to do when you have a new shiny dataset like that, is of course to get a bit of a feel for the data. In this case, it really helps to understand the nature of the different SupplyChains - as we know from the original Excel file that they are quite different between the 38 of them. So let's do some profiling:

match (n) return distinct labels(n), count(*)

Supply Chain Management with graphs: part 1/3 - data wrangling and import

Alright, I have been putting the writing of this blogpost off for too long. Finally, on this sunny Saturday afternoon where we are locked inside our homes because of the Covid-19 pandemic, I think I'll try to make a dent in it - I have a lot of stuff to share already.

The basic idea for this (series of) blogpost(s) is pretty simple: graph problems are often characterised by lots of connections between entities, and by queries that touch many (or an unknown quantity) of these entities. One of the prime examples is pathfinding: trying to understand how different entities are connected to one another, understanding the cost or duration of these connections, etc. So pretty quickly, you understand that logistics and supply chain management are great problems to tackle with graphs, if you think about it. Supply Chains are graphs. So why not story and retrieve these chains with a graph database? Seems obvious.

We've also had lots of examples of people trying to solve supply chain management problems in the past. Take a look at some of these examples:

And of course some of these presentations from different events that we organised:

Our friends at Caterpillar used Neo4j for this:
TransparencyOne actually built a business on it:

So I had long thought that it would be great to have some kind of a demo dataset for this use case. Of course it's not that difficult to create something hypothetical yourself - but it's always more interesting to work with real data - so I started to look around.

Graphistania 2.0 - Episode 5 - This Month in Neo4j

Friends.

These are interesting times. These are difficult times, but we can deal with it together, as a community, as a graph. So that's why we were super happy that, just as Belgium was going into lockdown last week, we were able to record another Graphistania podcast episode for you, talking about the world in general, but also covering some of the amazing graph use cases that drifted over our screens in the past month, in the This Week in Neo4j (TWIN4J) newsletter.

There were actually many things to talk about, in terms of fascinating graph use cases, and I will highlight only the most striking ones here.

Our friends at Kineviz did some really interesting and timely work on COVID-19 temporal and spatial data visualization. This stuff is really important to understand, as pandemic spreads clearly follow graph patterns. Read Connected if you are not convinced.

Worth highlighting: Bloodhound: Windows network penetration testing with Neo4j, had a new release that you might want to take a look at. If you are not familiar with Bloodhound yet, you may also want to check out my interview with the Bloodhound crew on this podcast a while back.

We published this fun little thing called a Neo4j Treasure Map - check it out!

Finally - we also have a a Winegraph! It's a great example of importing data from the web using Norconex.

Some interesting stuff on using Neo4j for Gene ID mapping: take a look!

Another examle of enriching graphs with Wikidata, from the one and only Mark Needham: look at Mark's blog over here!

Don't forget: we Introduced the Neo4j Graph Data Science plugin with examples from the "Graph Algorithms" book.

A really interesting tweet about a visualisation of the US Supreme court as a graph db... Would love to see more like that.

And for some fun: Pokégraph: Gotta Graph 'Em All!

Some important stuff: we did a great 4.0 webinar that is giving you a lot of info on what to expect in the new version of Neo4j.

There was a great update to NeoMap: Visualizing shortest paths with neomap ≥ 0.4.0 and the Neo4j Graph Data Science plugin.

Those were the most important ones. So let's talk about these now - I am sure there's a lot of cool stuff here fore everyone!

Graphistania 2.0 - Episode 4 - This Month in Neo4j

Yey! My friend StefanW and I got round to recording another Graphistania episode, episode 4 already - time flies when you are having fun! This month, again, we have so much great content popping up in the This Week in Neo4j (Twin4j) newsletter, that we could probably fill a few hours talking about it. So in the podcast, we will only talk about a handful - covering things like

the great momentum that we have in the Neo4jcommunity! Take a look at initiatives like Global Graph Celebration Day, and listen to Neo4j's mad scientist Michael Hunger on graphs, databases and relationships on a different podcast.
the start of the Neo4j Graphtour, for the 3rd year already, in Amsterdam. This year this actually coincided with the launch of Neo4j 4.0 - one of the best releases of Neo4j that I have personally ever witnessed. I have actually been writing a bit about this myself - read about how I how I child-proofed my beergraph with fine-grained security and how I did something similar by adding security to a fraud investigation graph. There's other articles as well, like When and how to implement Sharding in Neo4j 4.0.
Various personal graphs projects, like Mark's Australian Open graph and the QuickGraph #3 on Itsu Allergens. I love how Mark summarized it: most of analysis could be done in relational, but "Things got more interesting in the last section where we did set analysis. I found having the data in a graph structure made was helpful for answering these questions, especially when we were looking for the non existence of a relationship."
a number of Health related articles, like that visualisation of the data from the Personal Genome Project, or the Google project wiuth the “Largest Ever” Map of Brain Connectivity
And then of course there were various "Other" posts that we really liked, like how to be Working With Spatial Data In Neo4j GraphQL In The Cloud, and the post about Aaia - AWS Identity And Access Management Visualizer And Anomaly Finder. Last but not least, you should also take a look at the Graphaware Hume platform - with an excellent powerful demo recording over here.

Experimenting with Conflicting access privileges in Neo4j 4.0

In the past couple of weeks, I have been playing around with the shiny new security features of Neo4j 4.0. They are truly interesting - both for childproofing beergraphs and for ensuring that your sensitive fraud databases are properly secured. Take a look at the previous post, and I think you will understand why.

In this post, I wanted to talk about something that I have seen so many times in my previous lives in the security industry, and that also became evident in my 4.0 research. It's got to do with conflicting security privileges. In a nutshell, this is to do with the case where

a specific user / role would receive a particular set of privileges from one policy
the same user / role would receive a different, and contradictory privilege from another policy.

In that case, we need clear rules to understand what would happen. In the case of Neo4j 4.0, this is reasonably well explained as part of the documentation - see the documentation site on this topic - but in this post I will try to give you a realistic, but simple example.

Creating Conflict

We'll start working on this with the same database as the previous post, the fraud dataset. If you don't have it yet, just download it from this link. Once we have the database up and running as a separate user database, we can switch to the system database and create a separate user for these tests.

//create a separate user for engineering the conflicting privileges
CREATE USER conflicted_user SET PASSWORD "changeme" CHANGE NOT REQUIRED;
CREATE ROLE conflicted_role AS COPY OF reader;

Securing a sample fraud graph with Neo4j 4.0

This week, we at Neo4j formally released our brightest and shiniest new version of the Neo4j Graphg Database to the world. It's been an amazing journey to this point, and others have reported on this magnificent piece of engineering in more depth. Take a look at Jim's blogpost, or if you are in a hurry, checkout the graphcast below:

Last week, I started playing around with it myself - by digging up my good old faithful beergraph, and illustrating some of the new features in childproofing exercise for beers. Take a look at that post as well for some giggles. Now in this post, I wanted to essentially do the same thing as I did on the beergraph, but using a Fraud dataset.

Let's see how that would work.

Securing my Beergraph with Neo4j 4.0

Not sure if you have realised, but Neo4j has actually recently made the 4.0 version of the most fantastically awesome graph database on the planet available. You can get it ahead of the big launch event (on February 4th, 2020 - in case you were wondering!) from the Download Center and take it for a spin.

In this unbelievable release, there are so many new features, it's kind of hard to keep track of everything. But the ones that I can most easily get my head around are clearly

multi-database support - finally, Neo4j actually has this concept of running multiple databases on one database server. A multi-tenancy solution, that has been requested and anticipated by many of our users and customers.
a VERY advanced schema-based security module, that allows people to extend the existing role-based security model of Neo4j even further - and make it crazy powerful. We'll spend a lot of time on that in this blogpost.

Readers of this blog probably know that I am a big fan of getting my feet down and dirty with our products, so this evening - with a couple of hours to spare, so to speak - I decided to try out the shiny new release. I spun up my Neo4j Desktop, and started reading some manual pages where stuff was explained. Specifically, I loved

the manual pages on managing multiple databases
the pages on authentication and authorisation
a great video by Louise Söderström from Neo4j engineering:

Soon after flipping through this, I was on my way.

Graphistania 2.0 - Episode 3 - This Month in Neo4j

Happy new year everyone - although it actually seem like the holidays are already very far behind us! But great times were had, at least in my family, and so I feel super energised to make 2020 another great start to a decade of graphs :) ... Here's to that!

It also means that we are continuing to see all these awesome community stories pop up left right and center in the Neo4j "This week in Neo4j" developer newsletter. And so on our Graphistania podcast, we are going to continue talking about these on a monthly basis. So that's what we're doing - and I have again invited my friend and colleague Stefan Wendin to join me.

From the newsletter, we always select a few stories that we think will be more interesting and/or meaningful to discuss. This month, we found a number of them, and the interesting thing was that the graph-stories seemed to play at very different scales... The Personal, Corporate, and Society levels. Here are some of the ones we liked:

At the Personal scale

Alex Woolford - Network analysis with Neo4j / Kafka / Zeek. Alex also has this really cool video on "Event driven parenting": if your kid is getting bad grades, that event will lead to no more PS4 network access :) ... Bit harsh - but still!

At the Corporate scale

IT centric example: Managing VMware infrastructure with Neo4j - great example of how to understand infrastructure dependencies in an IT environment with graphs.
Business centric examples: Analysing online customer journeys in 3D and Using Augmented Reality to create an indoor navigation system with VIROREACT. This also reminded me of these examples of using Neo4j to build digital twins for wind farms and digital twins for subsea gas.
Maybe one more - very specific to the graph world and totally optional: Keeping track of graph changes using temporal versioning. This references: Neo4j Versioner - they recently released version 2. I really like that.

At the Society scale, we saw some amazing posts:

So I think you agree that we had plenty of stuff to talk about. Let's get into that!

Pages

Wednesday, 25 November 2020

Tuesday, 24 November 2020

Thursday, 12 November 2020

Wednesday, 4 November 2020

Starting out with my calendar

Tuesday, 6 October 2020

Tuesday, 29 September 2020

Introducing Zeppelin

Wednesday, 23 September 2020

What is Exponential Growth exactly?

Friday, 18 September 2020

Downloading and restoring the dataset

Monday, 7 September 2020

Wednesday, 8 July 2020

Monday, 29 June 2020

Tuesday, 16 June 2020

Friday, 12 June 2020

Tuesday, 9 June 2020

Tuesday, 28 April 2020

About the Neo4j Browser and Browser guides

Friday, 24 April 2020

Tuesday, 21 April 2020

Part 4/4: Some loose ends for the Contact Tracing graph

Using the geospatial data for some additional insights

Part 3/4: Graph Analytics on the contact tracing graph

Some data prep for analytics: inferring a new relationship

Part 2/4: Querying the contact tracing graph

Who has a sick person potentially infected

Part 1/4: creating and importing a synthetic contact tracing graph

Monday, 6 April 2020

Friday, 27 March 2020

Wednesday, 25 March 2020

Data validation and profiling

Saturday, 21 March 2020

Monday, 16 March 2020

Tuesday, 18 February 2020

Wednesday, 12 February 2020

Creating Conflict

Friday, 7 February 2020

Wednesday, 29 January 2020

Tuesday, 14 January 2020

Metricool