Tuesday, 15 November 2022

A 2nd, better way to WorldCupGraph

Hours after publishing my previous blogpost about the WorldCup Graph, I actually found a better, and more up to date dataset that contained all the data of the actual squads that are going to play in the actual World Cup in Qatar. I found it on this wikipedia page, which lists all the tables with the actual squads, some player details, coaches etc. as they were announced on 10th/11th of November.

So: I figured it would be nice to revisit the WorldcupGraph, and show a simpler and faster way to achieve the results of the previous exercise. So: I have actually put this data in this spreadsheet, and then downloaded a .csv version:

one .csv file containing the squads
one .csv file containing the tournament schedule (just like in the previous example, actually)

These two files are super nice and simple, and therefore we can actually use the Neo4j Data Importer toolset to import these really easily.

No WorldCup without a WorldCupGraph!

Last week I was having a conversation with one of my dear Neo4j colleagues, and we were talking about the fact that Graphs are simply so much fun to play around with, and that there's nothing like a great interesting dataset to have people really experiment and acquaint themselves with the technology. I know that to be extremely true, and I think I have demonstrated this elaborately over the years on this runaway blog of mine.

Then the conversation turned to a topic that I know very little about: the FIFA World Cup in Qatar that is starting next week. Now, reading this blog you may know that I am a little addicated to my 2 wheeled #mentalhealthmachine, and that chasing a ball across a field seems like a little bit of a game to me - but hey, that's ok! And with this conversation it actually dawned on me that at Neo4j, we had done "Worldcup Graphs" both in 2014 and in 2018: our friend and former colleague Mark Needham was the driving force behind both of those efforts.

You can still see some of the work that Mark did at the time on Github and Medium. It was truly another example of how a cool and timely dataset would get people to explore the wonderful world of graphs and get to know the technology in a fun and interesting way.

So: I decide that it would be nice to do that again. With all the new tech that is coming out of Neo4j with the release of Neo4j 5, that could not be very difficult, right? Let's take a look.

A Graph Database and a Dadjoke walk into a bar...

I just publised a blogpost series with 6 different articles about me having fun with Dadjokes, in an unusual sort of way. Here are the links to the articles:

All of the queries etc are put together in this markdown document. I plan to make a Neo4j Guide out of this as well in the next few days so that it would become easier to use.

Hope you will have as much fun with it as I did.

Rik

DadjokeGraph Part 6/6: Closing: some cool Dadjoke Queries

A Graph Database and a Dadjoke walk into a bar...

Now that we have a disambiguated graph of dadjokes, let's have some fun and explore it.

How many times does a joke get tweeted?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
WITH dj.Text AS Joke, count(r) AS NrOfTimesTweeted
RETURN Joke, NrOfTimesTweeted
ORDER BY NrOfTimesTweeted DESC
LIMIT 10;

How many times does a joke get favorited?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
RETURN dj.Text AS Joke, dj.SumOfFavorites AS NrOfTimesFavorited, dj.SumOfRetweets AS NrOfTimesRetweeted
ORDER BY NrOfTimesFavorited DESC
LIMIT 10;

A Graph Database and a Dadjoke walk into a bar...

This means that we want to convert the spreadsheet that we created before, or the .csv version of it, into a Neo4j Database.

Here's how we go about this. First things first: let's set up the indexes that we will need later on in this process:

CREATE INDEX tweet_index FOR (t:Tweet) ON t.Text;
CREATE INDEX dadjoke_index for (d:Dadjoke) ON d.Text;

Assuming the .csv file mentioned above is in the import directory of the Neo4j server, we can use load csv to create the initial dataset:

LOAD CSV WITH HEADERS FROM "file:/vicinitas_alldadjokes_user_tweets.csv" AS csv
CREATE (t:Tweet)
SET t = csv;

Import the Vicinitas .csv file

Or: if you want to create the graph straight from the Google Sheet:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1MwHX5hM-Vda5o4ZQVnCv4upKepL5rHxcUrqfT69u5Ro/export?format=csv&gid=1582640786" AS csv
CREATE (t:Tweet)
SET t = csv;

A Graph Database and a Dadjoke walk into a bar...

We noticed that many of the Tweet nodes referred to the same jokes - and resolved that already above. But this query makes us understand that we actually still have some work to do:

MATCH path = (dj:Dadjoke)-[*..2]-(conn)
WHERE dj.Text CONTAINS "pyjamazon"
    RETURN path;

The Amazon Dadjokes

We will come back to that example below.

We now notice that there are quite a few Dadjoke nodes that are a bit different, but very similar. We would like to disambiguate these too. We will use a couple of different strategies for this, but start with a strategy that is based on String Metrics.

A Graph Database and a Dadjoke walk into a bar...

As we can see in the pyjamazon example from before, the disambiguation of our Dadjokes has come a long way - but is not yet complete. Hence we we call the graph to the rescue here, and take it a final step further that will provide a wonderfully powerful example of how and why graphs are so good at analysing the structural characteristics of data, and make interesting and amazing recommendations on the back of that.

Here's what we are going to do:

we are going to use Natural Language Processing to extract the entities that are mentioned in our Dadjokes. Do do that, we are going to use the amazing Google Cloud NLP Service, and call it from APOC. This will yield a connected structure that will tell us exactly which entities are mentioned in every joke.
then we are going to use that graph of dadjokes connected to entities to figure out if the structure of the links can help us with further disambiguation of the jokes.

So let's start with the start.

A Graph Database and a Dadjoke walk into a bar...

The next, and final, step our dadjoke journey here, is going to be taking the disambiguation to the next level by applying Graph Data Science metrics to the new, enriched (using NLP), structure that we have in our graph database. The basic idea here is that, while the TEXT similarity of these jokes may be quite far apart, their structural similarity may still be quite high based on the connectivity between the joke and its (NLP based) entities.

Calculating the Jaccard similarity metric using GDS

To explore this, we will be using the Jaccard similarity coefficient, which is part of the Neo4j Graph Data Science library that we have installed on our server. More about this coefficient can be found on Wikipedia. The index is defined as the size of the intersection divided by the size of the union of the sample sets, which is very well illustrated on that Wikipedia page. I have used Neuler (the no-code graph data science playground that you can easily add to your Neo4j Desktop installation) to generate the code below - but you can easily run this in the Neo4j Browser as well.

A Graph Database and a Dadjoke walk into a bar...

I am a dad. Happily married and 3 wonderful kids of 13, 17 and 19 years old. So all teenagers - with all the positive and not so positive experiences that can come with that. And in our family, I have been accused of using dadjokes - preferably at very awkward or inconvenient times. I actually like dadjokes. A lot.

So I follow a few accounts on social media that post these jokes. Like for example baddadjokes, dadjokeman, dadsaysjokes, dadsjokes, groanbot, punsandoneliner, randomjokesio, and thepunnyworld and there are many others. These are all in this list, should you be interest. It's a very funny list.

Dadjokers List on Twitter

Graphs are everywhere - also in Religious Texts - part 6 and close - Analysing the Hadith Narrator Graph

So what we will do here, is we will start looking at some of the structural graph metrics that are going to give us a little bit more insight into the importance of different parts of the Hadith Narrator Graph. We will use the Neuler Graph Data Science playground to do that. Neuler is a so-called GraphApp that you can easily install into your Neo4j Desktop environment. You can download and install it, and learn more about it here.

Once installed, we will run a few easy algo's.

Pagerank centrality of the Scholars

Once of the advantages of the AGGREGATED_HADITH_CHAIN relationship, is that we now have a mono-partite, weighted subgraph that is very suitable for understanding which Scholars are actually more interesting than others - this is a great use case for the Pagerank algorithm. Here's how we configure it:

Graphs are everywhere - also in Religious Texts - part 5 - Exploring the Hadith Narrator Graph

Before we can start that exploration, we do need to put in place a few additional indexes that have not yet been created in the import process: on the english translations of the hadiths, and on the scholar names. That's simple operation:

CREATE TEXT INDEX hadith_text_en FOR (h:Hadith) ON (h.text_en);
CREATE TEXT INDEX scholar_name FOR (s:Scholar) ON (s.name);

Note that these indexes are not full text indexes - but they are more optimised for text fields.

The model now looks like this:

So now we can start some querying. This query gives us a flavour of what we could find:

MATCH (:Scholar)--(n:Hadith)--(:Source)
RETURN n LIMIT 25;

Graphs are everywhere - also in Religious Texts - part 4 - Connect the Hadiths with the scholars

Now, we will connect the Hadiths with the scholars:

:auto MATCH (h:Hadith)
WITH h, split(replace(h.chain_indx," ",""),",") as scholarlist
CALL
    {
    WITH h, scholarlist
    UNWIND scholarlist as scholar
        MATCH (s:Scholar {scholar_indx: scholar})
        MERGE (s)-[:NARRATED]->(h)
    } IN TRANSACTIONS OF 100 ROWS;

Graphs are everywhere - also in Religious Texts - part 2 - import the Hadith narrators into Neo4j

The source data that we found in part 1 is in a .csv format - so that means that it basically looks tabular:

Luckily, we nowadays have some fantastic tools to import these files, without writing any code at all using the all new Neo4j Data Importer. After drawing a few nodes and relationships, I was able to do the basic import:

It was super quick to return after a few seconds:

I am of course sharing the Data Importer config (model and data) as a zip file as well.

As usual, there is a bit of messyness in the data still, so I had to do some wranging to get a better/richer model.

First, we would want to split the two parents of a Scholar into different fields:

:auto MATCH (s:Scholar) 
CALL {
    WITH s
    SET s.parent1 = trim(split(s.parents,"/")[0]) 
    SET s.parent2 =  trim(split(s.parents,"/")[1])
} IN TRANSACTIONS of 1000 ROWS;

<!-- remove the brackets, introduce comma -->
:auto MATCH (s:Scholar) 
CALL {
    WITH s
    SET s.parent1 = replace(s.parent1," [",",")
    SET s.parent1 = replace(s.parent1,"]","")
    SET s.parent2 = replace(s.parent2," [",",")
    SET s.parent2 = replace(s.parent2,"]","")
} IN TRANSACTIONS of 1000 ROWS;

<!-- extract the IDs -->
:auto MATCH (s:Scholar) 
CALL {
    WITH s
    SET s.parent1_id = trim(split(s.parent1,",")[1])
    SET s.parent1 = trim(split(s.parent1,",")[0])
    SET s.parent2_id = trim(split(s.parent2,",")[1])    
    SET s.parent2 = trim(split(s.parent2,",")[0])
} IN TRANSACTIONS of 1000 ROWS;

This then allows us to create relationships between Scholars that have other Scholars as parents:

MATCH (s:Scholar)
WHERE s.parent1_id IS NOT NULL
WITH s
MATCH (parent:Scholar)
WHERE parent.scholar_indx = s.parent1_id
MERGE (s)-[:CHILD_OF]->(parent);

MATCH (s:Scholar)
WHERE s.parent2_id IS NOT NULL
WITH s
MATCH (parent:Scholar)
WHERE parent.scholar_indx = s.parent2_id
MERGE (s)-[:CHILD_OF]->(parent);

Next step is to create the marriage relationships between Scholars. To do that, we first have to split the s.spouse property and store that as a s.listofspouses:

:auto MATCH (s:Scholar)
CALL {
    WITH s
    SET s.listofspouses = split(replace(s.spouse," ",""),",")
} IN TRANSACTIONS OF 1000 ROWS;

Next, we UNWIND the s.listofspouses and get a list of scholar_indx properties that we can match and use to create the [:MARRIED_TO] relationships.

MATCH (s:Scholar)
UNWIND s.listofspouses as scholarspouse
WITH s, replace(split(scholarspouse,"[")[1],"]","") as scholarspouse_id
WHERE scholarspouse_id IS NOT NULL
MATCH (scholarspousenode:Scholar {scholar_indx: scholarspouse_id})
MERGE (s)-[:MARRIED_TO]->(scholarspousenode);

And then finally, we can create the teacher/student relationships between Scholars:

MATCH (s:Scholar)
WITH s, s.students_inds as students_of_scholar
UNWIND students_of_scholar as student
    MATCH (st:Scholar {scholar_indx: student})
    MERGE (st)-[:STUDENT_OF]->(s)
WITH s, s.teachers_inds as teachers_of_scholar
UNWIND teachers_of_scholar as teacher
    MATCH (tea:Scholar {scholar_indx: teacher})
    MERGE (tea)-[:TEACHER_OF]->(s);

After having done all of these manipulations, we actually can look at some really interesting subgraphs:

Note: there are some additional data in the dataset (and included in the (:Scholar) nodes) like areas of interest and tags. For the purpose of this exercise - the Narrator networks and the chains of narration for each Hadith - this is not as interesting and therefore we are not splitting that information off into separate nodes and relationships. It would be trivial to do so - but unnecessary at this point.

In the next blogpost, we will go and import the actual Hadiths that are being narrated into our graph.

Looking forward already!

Rik

PS: as always all the code/queries are available on github!

PPS: you can find all the parts in this blogpost on the following links

Graphs are everywhere - also in Religious Texts - part 3 - Importing the Hadiths into Neo4j

Again, just like in part 2, we will use the Neo4j Data Importer for this. You can find the .zip file with the model and the dataset over here. In this operation we actually first create a separate subgraph for the Hadiths and the sources:

This operation returns very quickly: And offers a good view of the result:

Graphs are everywhere - also in Religious Texts - part 1 - Introduction

This is going to be an interesting and in some ways even fascinating set of blogposts. I have thoroughly enjoyed researching it and playing around with the latest and greates Neo4j tools while doing so, but I must say that it's also one of the first blogposts that I can remember where I am a bit uneasy about the content. Why? Because it's about, or at least in some ways touches, religion.

First let's start with some background here. Some things that you should know about me:

I was born and raised in Belgium, which is - or at least was - a predominantly Catholic Christian country. There's churches and chapels on every corner of the street here.
My parents were/are far from religious, never took me to church, but did give me many of the Catholic Christian values - and these were engrained in me even more clearly because of my 13 ears in Jesuit schools in Antwerp and Turnhout.
as an adult, I became increasingly distantiated from all religious beliefs. In my twenties and thirties I was still a "cultural Christian", I guess, as exemplified by the fact that we got married in church, and baptised all of our 3 children. In my late thirties and forties, ie. now, I becaome convinced that not much good can come of religion - in general. I read Richard Dawkins, Christopher Hitchens, Sam Harris, and similar authors that have a very sceptic, atheist view on religion. And I like it that way, for me, personally.
that personal choice does not mean that I have something against people that still have a faith. I am totally fine with anyone believing what they want to believe - as longs as they don't hurt others or impose on others during the process.

But, and here's the sensitive bit: this blogpost will be about Islam - and Moslim holy books and texts very specifically. There's no reason for me choosing to write about this specific religion - other than the fact that it came across my path and I thought some of the material was absolutely fascinating.

Conway's Game of Life in Neo4j

A couple of weeks ago, me and my Neo4j Breakfast Club friends were just freewheeling our way into the day, and one of my colleagues started talking to me about Conway's Game of Life.. I had never heard of this thing, but was immediately fascinated. It basically allows you to simulate evolution in a rudimentary and simplified kind of way, but it's really fascinating how it works based on a very simple set of rules (see below). There's an entire Wiki dedicated just to this "game" - it's one of the most wonderful rabbitholes on the web that I have ever seen. Just take a look at this example and you will see the idea in action:

The Game of Life, also known simply as Life, is a cellular automaton devised by the British mathematician John Horton Conway in 1970. It is a zero-player game, meaning that its evolution is determined by its initial state, requiring no further input. One interacts with the Game of Life by creating an initial configuration and observing how it evolves. It is Turing complete and can simulate a universal constructor or any other Turing machine.

So when I heard about it, I immediately thought that it would be a ton of fun to run this experiment in Neo4j. Why? Because the rules are all about connections between members of a population. Things will evolve - or not - based on their connectivity.

The rules of the Game

The whole idea of the Game is that you will create some kind of a "population" of cells in a matrix of cells. Every cell will have a maximum of 8 neighbours, and will be evolving it's state (either Dead or Live) with every iteration.

So at every "turn", the game will evaluate what will happen to every cell based on a very simple set of rules:

Any live cell with fewer than two live neighbours dies, as if by underpopulation.
Any live cell with two or three live neighbours lives on to the next generation.
Any live cell with more than three live neighbours dies, as if by overpopulation.
Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.

So then you can actually see what would happen to a population and simulate it's evolution. I figured how hard could it be to emulate this in a graph, as clearly the connectivity between cells and their neighbours would lend itself to some serious graphiness.

So I took this idea for a spin.

Simulating the Game of Life in Neo4j

First we start by setting up that database.

Setup the database and the indexes

:use system;
create or replace database neo4j;
:use neo4j;
create index for (c:Cell) on c.x;
create index for (c:Cell) on c.y;

Then we can create the "field" that we will be playing the game in.

Create the 25x25 matrix and connect the cells

We can do this in one query, which included two steps:

we first create the cells
we connect the cells using the NEIGHBOUR_OF relationship

UNWIND range(1,25) as x
    UNWIND range (1,25) as y
    CREATE (c:Cell:Dead {x: x, y: y})
WITH c 
MATCH (c2:Cell)
    WHERE c2.x-1<=c.x<=c2.x+1
    AND c2.y-1<=c.y<=c2.y+1
    AND id(c)<id(c2)
    MERGE (c)-[:NEIGHBOUR_OF]->(c2);

The graph then looks like this:

Now we are ready to start playing the game.

Seeding the graph with live cells

In the Game of Life, there's always a need to introduce the starting state of the field. We call this the seeding of the game, and there are obviously lots of ways that we could do that. Here, I will start by setting approximately 20% of the cells/nodes to Live, randomly throughout the field.

    :auto UNWIND range(1,200) as range
    CALL {
        WITH range
        MATCH (c:Cell)
        WHERE
            c.x = round(rand()*25) AND
            c.y = round(rand()*25)
        SET c:Live
        REMOVE c:Dead
    } in transactions of 10 rows;

You can now see the difference before and after by running a simple query:

match p = ()-[r:NEIGHBOUR_OF]->() return p;

As you can see from the screenshot, the "Live" nodes are a bit spread out in the field right now. This is actually a bit of an issue for the game - as it will immediately eliminate a huge part of the population upon the first iteration of executing the rules. Which in an of itself is interesting, but leads to very unpredictable behaviour in the rounds of the game. That's why my friend https://twitter.com/mesirii handed me the suggestion of starting out with a different type of seeding for the field.

Alternative seeding strategy

Michael suggested that we would use an (https://conwaylife.com/wiki/R-pentomino)[R-Pentomino], the smallest 5 element starting point.

First we need to reset the field to all dead - all Cells need to have the Dead Label.

match (n:Cell)
remove n:Live
set n:Dead;

Then we create the R-Pentomino using this query:

UNWIND [[1,0],[2,0],[0,1],[1,1],[1,2]] as pento
MATCH (c:Cell {x: 5+pento[0], y:5+pento[1]})
SET c:Live
REMOVE c:Dead;

This then makes the field look very different:

With this setup, we can now start playing the game for real.

Actually playing the Game of Life

Now that we have the field in its starting state, we are going to start iterating from there on using the rules. This turned out to be a little more complicated than I thought, probably a bit above my paygrade, and hence I really have to thank https://twitter.com/mesirii again for his generous help with the iteration query below.

To start, let's look at current graph composition, by understanding how many Dead or Live cells there are currently in the system. We do that with a very simple query:

match (n)
return labels(n), count(n);

So let's start iterating by applying the rules with a Cypher query that we will run time and again.

Iterate using the rules

The query below allows you to run iterations of the rules. Here's how it works:

we first match for all the Cell nodes
then we evaluate a Case to evaluate if a particular Cell node should stay "alive" or not. There are basically three cases:
- when a cell has 2 connections to a :Live node, the state after the iteration will depend on whether or not the cell is :Live itself. If it is :Live, then the cell will stay alive. If it is not, then it will turn dead.
- when a cell has 3 connections, then the state after the iteration will be :Live.
- when a cell has any other number of connections (less than 2, more than 3), then the cell will turn dead.
once we have that outcome, we will have two subsequent subqueries in a call statement, that will add and/or remove the :Live or :Dead labels.

It looks like this in Cypher:

match (c:Cell)
with c, 
    case size((c)-[:NEIGHBOUR_OF]-(:Live))
        when 2 then c:Live
        when 3 then true
        else false
    end as alive
call { with c, alive
    WITH * 
        WHERE alive 
            SET c:Live 
            REMOVE c:Dead
        }
call { with c, alive
    WITH *
        WHERE not alive 
            SET c:Dead 
            REMOVE c:Live
    }
return labels(c), count(c);

Now, in order to make it easy to see how the game actually works, it's easier to create a Bloom perspective to look at the result - as Bloom will allow you to see the evolution of the graph without having to completely reload the entire visualisation. In the perspective, I have setup 2 search phrases:

one to "Show the graph"
one to run the above Cypher query that would iterate over the field and allow us to see the result

So now, when you run the iterate search phrase time and time again, it will allow you to see the animation like this:

That brings me to the end of my fun experiment with our friend John Horton Conway. It has been a fascinating journey for sure - and I am sure that there are many other things that we could do to make it even more interesting to run this simulation in Neo4j.

Hope this was as fun for you as it was for me. As always

the source of this blogpost is on github
a Neo4j browser guide (.mdx file - just drop this onto your Neo4j Desktop files section) is also there.
the Bloom perspective is also there.

All the best

Rik

Friday, 25 February 2022

Importing (BEER) data into Neo4j - WITHOUT CODING!

Importing data into a graph structure stored in a graph database can be a real pain. Always has been, probably always will be to some degree. But we can really make the pain be a lot more tolerable - and today's blogpost is going to be about just that. The reason for this is pretty great: Neo4j has just launched a new online tool that allowed me to make the whole process a really easy and straightforward experience - take a look at it at http://data-importer.graphapp.io.

So let me try to explain how it works in the next few paragraphs.

First: find a dataset

Obviously the internet is flooded with data these days - but for this exercise I used https://datasetsearch.research.google.com/ for the first time. Amazing tool, as usual from Google. And I quickly found an interesting one that I could download from Kaggle.

This dataset contains information about the different types of beers and various aspects of it such beer style, absolute beer volume, beer name, brewer name, beer appearance, beer taste, its aroma, overall ratings, review, etc. - and it does so in a single .csv file with about 500k rows. Cool.

So I was ready to take that to the importer.

Pages

Tuesday, 15 November 2022

Monday, 14 November 2022

Thursday, 6 October 2022

A Graph Database and a Dadjoke walk into a bar...

Now that we have a disambiguated graph of dadjokes, let's have some fun and explore it.

How many times does a joke get tweeted?

How many times does a joke get favorited?

A Graph Database and a Dadjoke walk into a bar...

A Graph Database and a Dadjoke walk into a bar...

A Graph Database and a Dadjoke walk into a bar...

A Graph Database and a Dadjoke walk into a bar...

Calculating the Jaccard similarity metric using GDS

A Graph Database and a Dadjoke walk into a bar...

Tuesday, 5 July 2022

Pagerank centrality of the Scholars

Tuesday, 10 May 2022

The rules of the Game

Simulating the Game of Life in Neo4j

Setup the database and the indexes

Create the 25x25 matrix and connect the cells

Seeding the graph with live cells

Alternative seeding strategy

Actually playing the Game of Life

Iterate using the rules

Friday, 25 February 2022

First: find a dataset

Metricool