Wednesday 25 November 2020
Exporting Spotify Playlists into Neo4j - and creating a little dashboard
Tuesday 24 November 2020
Graphistania 2.0 - Episode 11 - The Emil Update
Yey! I got to do it again. For the 4th time in the history of this weird thing called the Graphistania podcast, I have had the change to spend some quality time talking to Emil Eifrem, our fearless leader and CEO of Neo4j. As last time, we actually recorded the video, so you will find the zoom call, and the MP3 version of it, below in the blogpost - along with the habitual transcription.
Hope you will enjoy the chat as much as I did.
Thursday 12 November 2020
Graphistania 2.0 - Episode 10 - This Month in Neo4j
Hi everyone
Hope you are all well, keeping safe, and finding some time to relax and enjoy life in this wonderful rollercoaster that is 2020. Think of it this way - we will never forget this ride, EVAH!
As you can imagine, things have been evolving at warp speed in the wonderful world of graphs as well. So me and my partner in crime Stefan had another chat about all the things we have seen pop up, mostly through the awesome This Week in Neo4j (Twin4J) newsletter. Here's the chat we recorded:
Here's the transcript of our conversation:
RVB:00:00:01.448 Hello, everyone. My name is Rik, Rik Van Bruggen from Neo4j, and here I am again recording another episode of our Graphistania Neo4j podcast. Wonderful time of the day to start with this type of conversation because I have my dear friend, Stefan, on the other side of this call. Hi, Stefan. How are you?
Wednesday 4 November 2020
Making sense of 2020's mad calendar with Neo4j
As we enter November 2020, I - like many people I assume - can't help but feel quite "betwattled" by all of the events taking place this year. I took some time last weekend to look at all the crazy events that happened ... starting with pretty normal January and February, moving slowly to ominous March, and then living the weird, (semi-) locked down lives that we have been living until this very day I write this, which is the day after the bizarre US elections.
In any case, I decided to have some fun while reflecting about all this. And in my world, that means playing with data, using my favourite tools... CSV files, Google Sheets, and of course, Neo4j. Let me take you for a ride.
Starting out with my calendar
The starting point of all this is of course my Google Calendar - which is buried in online calls and meetings these days.
Tuesday 6 October 2020
Graphistania 2.0 - Episode 9 - The one about the (Graph Databases for) Dummies (book)
So here's the chat that we recorded about the new book - hope you enjoy it as much as we did.
RVB - 00:00:00.151 Hello, everyone. My name is Rik, Rik Van Bruggen, from Neo4j, and here I am again recording another episode of our Graphistania podcast. And this is a special one. This is a special episode, one that we've been talking about for some time, because I have a very special guest on this show, and that is my dear friend and colleague Jim Webber. Hey, Jim.
Tuesday 29 September 2020
Using Apache Zeppelin with Neo4j to analyse the FinCEN Files
So of course I had to take this data for a spin myself - it seems really important to me that more eyeballs are looking at this, and more people exposing the sometimes very questionable behaviour of the world's largest financial institutions.
Wednesday 23 September 2020
Exponential growth in Neo4j
One thing I do know though, is that very smart and loveable people, in my own social and professional circle and beyond, seem to be confused by some of the data. Very often, they make seemingly rational arguments about the numbers that are seeing - but ignoring the fact that we are looking at an Exponential Growth problem. In this post, I want to talk about that a little bit, and illustrate it with an example from the Neo4j world.
What is Exponential Growth exactly?
Let’s take a look at the definition from good old Wikipedia:Exponential growth is a specific way that a quantity may increase over time. It occurs when the instantaneous rate of change (that is, the derivative) of a quantity with respect to time is proportional to the quantity itself. Described as a function, a quantity undergoing exponential growth is an exponential function of time, that is, the variable representing time is the exponent (in contrast to other types of growth, such as quadratic growth).The basic functions that are being entertained here are very simple in terms of the maths:
Friday 18 September 2020
OpenTrials in Neo4j - with a simple ETL job
I have been meaning to write about this for such a long time. Ever since the lockdown happened, I have been wanting to take a look at a particular biomedical dataset that looks extremely interesting to me: the OpenTrials dataset. If you are not familiar with this yet, this is what they say:
OpenTrials is a collaboration between Open Knowledge International and Dr Ben Goldacre from the University of Oxford DataLab. It aims to locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally.
It's a super interesting initiative, and it really flows from the idea that in much of the very intensive, expensive biomedical research, we should be looking at how to better use and re-use the knowledge that we are building up. Kind of like what people in the CovidGraph.org initiative, het.io (remember the interview I did with Daniel - so great!) and others are doing.
Downloading and restoring the dataset
It's a bit hidden, but you can actually download a (slightly older, but still) dataset of the OpenTrials dataset from their website. The dataset is actually a Postgres dump file: I got the latest one from http://datastore.opentrials.net/public/opentrials-api-2018-04-01.dump.
Monday 7 September 2020
Graphistania 2.0 - Episode 8 - The one after the Covid-summer
No sure if we should be happy or sad - but hey - the Covid-19 summer of 2020 is almost behind us. Like most people, I found it quite a strange and unusual summer, with very few foreign adventures (although I did manage to squeeze in a cycling/camping trip to the French Alps in July), lots of cycling, some great family time... and of course lots of time with graphs :) ...
So that means that we are also kicking the Graphistania podcast back into gear - here's the next episode for you:
Here's the transcript of our conversation:
RVB: 00:00:15.863 [music] Hey, Stefan, I do need to ask you for consent, I think, right?
SW: 00:00:19.847 Hi. Yeah, I consent. [laughter] This is always the weird moment.
RVB: 00:00:24.727 Exactly. I thought, "Start with that one again."
SW: 00:00:28.436 Exactly, just to create a little bit of tension in the air.
Wednesday 8 July 2020
Graphistania 2.0 - Episode 7 - The one after the Covid-19 lockdown
Monday 29 June 2020
Executives of Belgian Public Companies - revisited!
Long time ago, when dinosaurs roamed the earth and Neo4j was just a tiny cute little junior graph database ;-), I wrote a 2 part blogpost about a newspaper article that I had come across in De Tijd about the network of executives of Belgian public companies. You can find the articles over here: Part 1 and Part 2. Turns out - and I really was not aware of this until recently - that the newspaper has been running this type of publication on a yearly basis. Here's another article from 2018 on De Tijd’s website.
So imagine my surprise a few weeks ago, when I was contacted by one of the authors of that article, Thomas Roelens, to verify some info for the 2020 edition of this analysis. We had a great chat, and Thomas basically asked me to double check some of the analysis that he had done himself already. So, contrary to what happened in 2017 (where I had to dig into the HTML source to download the info from the website - Thomas just sent it to me, and basically allowed me to take it for a spin :) ...
Meanwhile, Thomas' article has been published in the newspaper: you can find it over here or over here. But here's my update below too.
Tuesday 16 June 2020
What VAT Fraud Detection and Contact Tracing have in common
Friday 12 June 2020
What Recommender Systems and Contact Tracing have in common
Tuesday 9 June 2020
Creating a Contact Tracing Testbed with Neo4j and Faker
- the original blogpost series over here,
- the demo movies that I made
- the Neo4j Browser guide that you can use
Tuesday 28 April 2020
Contact tracing guide for the Neo4j Browser
About the Neo4j Browser and Browser guides
Here's what this is: with Neo4j, the native graph database, we always ship a default user interface called the "Neo4j Browser". It's a interactive application that communicates with the database, and that essentially allows you to fire of Cypher queries and look at / manipulate the contents of your database. Read up about it over here. Once you have done that you will realise that the Browser is actually more than that: it's also a great way for people to learn more about Neo4j, and has a built in mechanism to share "guides" to various topics. If you experiment a bit with the following commands:
Title
|
Description
|
Command
|
Intro
|
A guided tour of Neo4j Browser
| :play intro |
Concepts
|
Graph database basics
| :play concepts |
Cypher
|
Neo4j’s graph query language introduction
| :play cypher |
The Movie Graph
|
A mini graph model of connections between actors and movies
| :play movie graph |
The Northwind Database
|
A classic use case of RDBMS to graph with import instructions and queries
| :play northwind graph |
Friday 24 April 2020
(Covid-19) Contact tracing follow-up - demo movies
Tuesday 21 April 2020
(Covid-19) Contact tracing - an amazing graph problem & rabbit hole
That's why I started experimenting with how a graph database like Neo4j could help with this. Some of the tracing problems that we will face, are uniquely well suited for a graph database approach: it allows for us to see and understand the indirect contacts that healthy and sick people may have had with one another, and the effects that this could cause in our environments. It also allows for some unique predictive analytics: the structure of our contacts, the network/graph that it constructs, actually says a lot about the importance that parts of the network may play in the evolution of the pandemic. Graph Data Science can give us pointers as to where this should direct our policies.
This has ended up being quite an extensive piece of work. In order to keep it readable, I have cut it up into 4 blogposts, which I will put up all at the same time:
- Part 1: how I go about creating a synthetic dataset, and import that into Neo4j
- Part 2: how I can start running some interesting queries on the dataset, making me understand some of the interesting data points in there and questions that one might ask
- Part 3: how I can use graph data science on this dataset, and understand some of the predictive metrics like pagerank, betweenness and use community detection to direct policies
- Part 4: a number of loose ends that I touched on during my exploration - but surely did not exhaust.
Note that these demos will require the following environment:
- Neo4j Desktop 1.2.7, Neo4j Enteprise 3.5.17, apoc 3.5.0.9, gds 1.1.0, or
- Neo4j Desktop 1.2.7, Neo4j Enterprise 4.0.3, apoc 4.0.0.6 (NOT later! a bug in apoc.coll.max/apoc.coll.min needs to be resolved)
(Covid-19) Contact Tracing Blogpost - part 4/4
Part 4/4: Some loose ends for the Contact Tracing graph
In this last part of this blogpost series, I wanted to quickly articulate some interesting points that I found useful during these experiments.Using the geospatial data for some additional insights
You may remember that back in part 1, I imported some geospatial properties into our graph - assigning coordinates to all of the Places nodes that we have in the graph. Clearly this also opens up further possibilities for additional analysis, which I have not explored yet in the previous posts. Suffice to say that this data is super easy to work with in Neo4j. Just run a query like this:(Covid-19) Contact Tracing Blogpost - part 3/4
Part 3/4: Graph Analytics on the contact tracing graph
Note that these queries require environment: Neo4j Desktop 1.2.7, Neo4j Enteprise 3.5.17, apoc 3.5.0.9 and GDS 1.1. At the time of writing, Neo4j 4.0.3 is not yet supported by GDS 1.1.One of the fantastic qualities of the graph data model, I have always found, is that it can give you interesting insights - without even looking at the data. The structure of the network can give you some really interesting new revelations, that you would not even have considered before. That is why Neo4j has invested a ton of effort in providing our industry with a completely new set of capabilities that allow us to discover these structural insights more easily - in the form of a new Graph Data Science Library. We have recently released the product, and you should read up on it in detail, and I think it would be a great and interesting idea to explore it on this Contact Tracing dataset that we have built in part 1 and queried in part 2.
Some data prep for analytics: inferring a new relationship
In order to do that, there's actually something that's missing: a new relationship between two Persons, which infers the fact that two people have MET. We can do that based on the overlap time of their visits to the same place - therefore leveraging a query from part 2. This is what are going to do: create a MEETS relationship between 2 Person nodes, based on the overlap - and we do that like this:match (p1:Person)-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(p2:Person)
where id(p1)<id(p2)
with p1, p2, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
with p1, p2, sum(minEnd-maxStart) as meetTime
create (p1)-[:MEETS {meettime: duration({seconds: meetTime/1000})}]->(p2);
As you can see, we are storing the length of the inferred meeting as a duration property on the relationship. The result appears very quickly:
(Covid-19) Contact Tracing Blogpost - part 2/4
Part 2/4: Querying the contact tracing graph
Note that these queries require environment: Neo4j Desktop 1.2.7, Neo4j Enteprise 3.5.17, apoc 3.5.0.9 or Neo4j Enterprise 4.0.3, apoc 4.0.0.6 (NOT later! a bug in apoc.coll.max/apoc.coll.min needs to be resolved)In Part 1 we created and imported a contact tracing graph. Now, we are ready to experiment with some interesting graphy queries.
The most interesting part about many if these queries, I find, is that they all relay on the fundamental principle of "hypothesis-free querying". What I mean by this is, is that graph querying, in my experience and opinion, have this wonderful quality about them that you can actually interact with the data in a way that does not require you to hypothesize too much about the structure of the dataset. This is important, because very often I just won't know what I don't know, and making meaningful hypotheses is actually really hard and complicated. The fact that we don't have to do that, is a great win.
As always, you will find all queries are on github, so that you can have a play with it yourself as well. So let's dive right into it.
Who has a sick person potentially infected
To answer that, I will "grab" a sick person from the dataset, and then just walk the dataset from the person to the other persons that are currently healthy. The query goes like this:match (p:Person {healthstatus:"Sick"})
with p
limit 1
match (p)--(v1:Visit)--(pl:Place)--(v2:Visit)--(p2:Person {healthstatus:"Healthy"})
return p.name as Spreader, v1.starttime as SpreaderStarttime, v2.endtime as SpreaderEndtime, pl.name as PlaceVisited, p2.name as Target, v2.starttime as TargetStarttime, v2.endtime as TargetEndttime;
(Covid-19) Contact Tracing Blogpost - part 1/4
Part 1/4: creating and importing a synthetic contact tracing graph
As we are living in these very interesting times, and many countries are still going through a massive operation to slow down the devastating effects of the SARS-CoV-2 virus and its CoViD-19 effects, there is of course also a lot of discussion already going on what we will do after the initial surge of the virus has passed, and when the various countries and regions will start opening up their economies.A tactic many countries seem to be taking is the implementation of some kind of Contact Tracing. Using the technology on our phones and our pervasive internet connectivity, we could imagine a way to implement "distancing" and isolation of people that are either already victim of, or vulnerable to, CoViD-19. This seems like a logical, and useful tactic, that could help us to open up our economies for business, while still maintaining the basic attitude of wanting to "flatten the curve". Of course there are still many, many issues with this approach, not in the least with regards to patient privacy and political freedoms, but it seems like an interesting track to explore, at least. Many government organisations have therefore started to explore this, and are working with some of the industry giants like Google and Apple to make this a reality.
This evolution started a whole range of discussions inside Neo4j, especially with regards to the usefulness of a graph database to make sense of some of these contact traceability databases. I remember reading Christakis and Fowler's Connected book, and understanding that virus outbreaks are one of those cases where our direct contacts don't necessarily matter - or at least not matter alone. Indirect contacts, between our friends' friends' friends, can be just as important. So lots of interesting, graph-oriented questions arise: How could we maximise the effect of our distancing measures, and of any contact tracing applications that we put in place? How could we use the excellent and predictive power of the graph to find out which of a person's connections could be most risky? How can we use graph analytics to better understand the structural power and weakness of our social networks? And many more.
So, being locked down myself (although Belgium clearly has a much software stance than for example France or Italy), I thought I would spend some time exploring this. That's what this blogpost series is going to be about - so let's get right to it.
Monday 6 April 2020
Graphistania 2.0 - Episode 6 - The One with the CovidGraph
One of the most amazing cases out there, has been the use case of the German Center for Diabetes Research, who have been scouring the scientific universe for ways of finding cures against diabetes. Look at this brief video or read this article to know more about it:
So here it is: a chat about Covid-19, and about how graphs will help us make sense of the data. Let's hope it proves to be useful.
Friday 27 March 2020
Supply Chain Management with graphs: part 3/3 - some SCM analytics
- In the first post I found and wrangled a dataset into my favourite graph database, Neo4j
- In the second post I got acquainted with the dataset in a bit more detail, and I was able to do some initial querying on it to figure out what patterns I might be able to expose.
Wednesday 25 March 2020
Supply Chain Management with graphs: part 2/3 - some querying
As a quick reminder, here's what the data model looked like after the import:
Or visually:
Data validation and profiling
The first thing to do when you have a new shiny dataset like that, is of course to get a bit of a feel for the data. In this case, it really helps to understand the nature of the different SupplyChains - as we know from the original Excel file that they are quite different between the 38 of them. So let's do some profiling:match (n) return distinct labels(n), count(*)
Saturday 21 March 2020
Supply Chain Management with graphs: part 1/3 - data wrangling and import
The basic idea for this (series of) blogpost(s) is pretty simple: graph problems are often characterised by lots of connections between entities, and by queries that touch many (or an unknown quantity) of these entities. One of the prime examples is pathfinding: trying to understand how different entities are connected to one another, understanding the cost or duration of these connections, etc. So pretty quickly, you understand that logistics and supply chain management are great problems to tackle with graphs, if you think about it. Supply Chains are graphs. So why not story and retrieve these chains with a graph database? Seems obvious.
We've also had lots of examples of people trying to solve supply chain management problems in the past. Take a look at some of these examples:
- https://neo4j.com/graphgist/supply-chain-management
- https://neo4j.com/blog/graph-technology-supply-chain-transparency-corporate-social-responsibility/
- https://neo4j.com/news/graphing-the-supply-chain/ and https://www.enterprisetimes.co.uk/2019/09/10/graphing-the-supply-chain/
- https://neo4j.com/blog/nlp-at-scale-maintenance-supply-chain-management/
- Our friends at Caterpillar used Neo4j for this:
- TransparencyOne actually built a business on it:
Monday 16 March 2020
Graphistania 2.0 - Episode 5 - This Month in Neo4j
These are interesting times. These are difficult times, but we can deal with it together, as a community, as a graph. So that's why we were super happy that, just as Belgium was going into lockdown last week, we were able to record another Graphistania podcast episode for you, talking about the world in general, but also covering some of the amazing graph use cases that drifted over our screens in the past month, in the This Week in Neo4j (TWIN4J) newsletter.
There were actually many things to talk about, in terms of fascinating graph use cases, and I will highlight only the most striking ones here.
Our friends at Kineviz did some really interesting and timely work on COVID-19 temporal and spatial data visualization. This stuff is really important to understand, as pandemic spreads clearly follow graph patterns. Read Connected if you are not convinced.
Worth highlighting: Bloodhound: Windows network penetration testing with Neo4j, had a new release that you might want to take a look at. If you are not familiar with Bloodhound yet, you may also want to check out my interview with the Bloodhound crew on this podcast a while back.
We published this fun little thing called a Neo4j Treasure Map - check it out!
Finally - we also have a a Winegraph! It's a great example of importing data from the web using Norconex.
Some interesting stuff on using Neo4j for Gene ID mapping: take a look!
Another examle of enriching graphs with Wikidata, from the one and only Mark Needham: look at Mark's blog over here!
Don't forget: we Introduced the Neo4j Graph Data Science plugin with examples from the "Graph Algorithms" book.
A really interesting tweet about a visualisation of the US Supreme court as a graph db... Would love to see more like that.
And for some fun: Pokégraph: Gotta Graph 'Em All!
Some important stuff: we did a great 4.0 webinar that is giving you a lot of info on what to expect in the new version of Neo4j.
There was a great update to NeoMap: Visualizing shortest paths with neomap ≥ 0.4.0 and the Neo4j Graph Data Science plugin.Those were the most important ones. So let's talk about these now - I am sure there's a lot of cool stuff here fore everyone!
Tuesday 18 February 2020
Graphistania 2.0 - Episode 4 - This Month in Neo4j
- the great momentum that we have in the Neo4jcommunity! Take a look at initiatives like Global Graph Celebration Day, and listen to Neo4j's mad scientist Michael Hunger on graphs, databases and relationships on a different podcast.
- the start of the Neo4j Graphtour, for the 3rd year already, in Amsterdam. This year this actually coincided with the launch of Neo4j 4.0 - one of the best releases of Neo4j that I have personally ever witnessed. I have actually been writing a bit about this myself - read about how I how I child-proofed my beergraph with fine-grained security and how I did something similar by adding security to a fraud investigation graph. There's other articles as well, like When and how to implement Sharding in Neo4j 4.0.
- Various personal graphs projects, like Mark's Australian Open graph and the QuickGraph #3 on Itsu Allergens. I love how Mark summarized it: most of analysis could be done in relational, but "Things got more interesting in the last section where we did set analysis. I found having the data in a graph structure made was helpful for answering these questions, especially when we were looking for the non existence of a relationship."
- a number of Health related articles, like that visualisation of the data from the Personal Genome Project, or the Google project wiuth the “Largest Ever” Map of Brain Connectivity
- And then of course there were various "Other" posts that we really liked, like how to be Working With Spatial Data In Neo4j GraphQL In The Cloud, and the post about Aaia - AWS Identity And Access Management Visualizer And Anomaly Finder. Last but not least, you should also take a look at the Graphaware Hume platform - with an excellent powerful demo recording over here.
Wednesday 12 February 2020
Experimenting with Conflicting access privileges in Neo4j 4.0
In this post, I wanted to talk about something that I have seen so many times in my previous lives in the security industry, and that also became evident in my 4.0 research. It's got to do with conflicting security privileges. In a nutshell, this is to do with the case where
- a specific user / role would receive a particular set of privileges from one policy
- the same user / role would receive a different, and contradictory privilege from another policy.
Creating Conflict
We'll start working on this with the same database as the previous post, the fraud dataset. If you don't have it yet, just download it from this link. Once we have the database up and running as a separate user database, we can switch to the system database and create a separate user for these tests.//create a separate user for engineering the conflicting privileges
CREATE USER conflicted_user SET PASSWORD "changeme" CHANGE NOT REQUIRED;
CREATE ROLE conflicted_role AS COPY OF reader;
Friday 7 February 2020
Securing a sample fraud graph with Neo4j 4.0
Wednesday 29 January 2020
Securing my Beergraph with Neo4j 4.0
In this unbelievable release, there are so many new features, it's kind of hard to keep track of everything. But the ones that I can most easily get my head around are clearly
- multi-database support - finally, Neo4j actually has this concept of running multiple databases on one database server. A multi-tenancy solution, that has been requested and anticipated by many of our users and customers.
- a VERY advanced schema-based security module, that allows people to extend the existing role-based security model of Neo4j even further - and make it crazy powerful. We'll spend a lot of time on that in this blogpost.
- the manual pages on managing multiple databases
- the pages on authentication and authorisation
- a great video by Louise Söderström from Neo4j engineering:
Tuesday 14 January 2020
Graphistania 2.0 - Episode 3 - This Month in Neo4j
It also means that we are continuing to see all these awesome community stories pop up left right and center in the Neo4j "This week in Neo4j" developer newsletter. And so on our Graphistania podcast, we are going to continue talking about these on a monthly basis. So that's what we're doing - and I have again invited my friend and colleague Stefan Wendin to join me.
From the newsletter, we always select a few stories that we think will be more interesting and/or meaningful to discuss. This month, we found a number of them, and the interesting thing was that the graph-stories seemed to play at very different scales... The Personal, Corporate, and Society levels. Here are some of the ones we liked:
At the Personal scale
- Alex Woolford - Network analysis with Neo4j / Kafka / Zeek. Alex also has this really cool video on "Event driven parenting": if your kid is getting bad grades, that event will lead to no more PS4 network access :) ... Bit harsh - but still!
- IT centric example: Managing VMware infrastructure with Neo4j - great example of how to understand infrastructure dependencies in an IT environment with graphs.
- Business centric examples: Analysing online customer journeys in 3D and Using Augmented Reality to create an indoor navigation system with VIROREACT. This also reminded me of these examples of using Neo4j to build digital twins for wind farms and digital twins for subsea gas.
- Maybe one more - very specific to the graph world and totally optional: Keeping track of graph changes using temporal versioning. This references: Neo4j Versioner - they recently released version 2. I really like that.