Thursday, 6 October 2022

DadjokeGraph Part 6/6: Closing: some cool Dadjoke Queries

A Graph Database and a Dadjoke walk into a bar...

Now that we have a disambiguated graph of dadjokes, let's have some fun and explore it.

How many times does a joke get tweeted?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
WITH dj.Text AS Joke, count(r) AS NrOfTimesTweeted
RETURN Joke, NrOfTimesTweeted
ORDER BY NrOfTimesTweeted DESC
LIMIT 10;

How many times does a joke get favorited?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
RETURN dj.Text AS Joke, dj.SumOfFavorites AS NrOfTimesFavorited, dj.SumOfRetweets AS NrOfTimesRetweeted
ORDER BY NrOfTimesFavorited DESC
LIMIT 10;

Different ways of finding jokes about cars

Let's explore 3 alternative ways to find jokes about cars.

1. Matching the text of the Dadjoke for the word "car"

MATCH (dj:Dadjoke) WHERE dj.Text CONTAINS "car" RETURN dj.Text LIMIT 10;

2. Checking if the Entity contains the word "car"

MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text CONTAINS "car" RETURN dj.Text LIMIT 10;

3. Checking if the Entity equals the word "car"

MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text = "car" RETURN dj.Text LIMIT 10;

Finding jokes about cars and wives

This was another great example:

MATCH p=(h:Handle)--(t:Tweet)--(dj:Dadjoke)-[r:JACCARD_SIMILAR]->() 
WHERE dj.Text CONTAINS "spaghetti" 
    AND (dj.Text CONTAINS "bike" OR dj.Text CONTAINS "car")
    RETURN p;

Jokes about Bikes & Cars made from Spaghetti

It's amazing to see how the same conceptual joke is being reused in different ways!

Some interesting structural characteristics about the #dadjoke twitterspace

Now we can of course also start to look at some of the structural charactersistics of this part of the Twitterspace. Just from looking at some of the subgraph results of our queries, it becomes obvious that

  • lots of jokes are being repeated, time and time again
  • different Twitter handles actually borrow each others jokes - all the time

So let's explore that a little more.

How many jokes are tweeted identically by different tweeters

MATCH path = (h1:Handle)-[*2..2]->(dj:Dadjoke)<-[*2..2]-(h2:Handle)
WHERE id(h1)<id(h2)
RETURN path;

This takes a while to load, but you can clearly see a few cliques in this picture. Paths between twitter Handles

Let's see how many such paths are actually there:

MATCH path = (h1:Handle)-[*2..2]-(dj:Dadjoke)-[*2..2]-(h2:Handle)
WHERE id(h1)<id(h2)
WITH h1.name AS FirstHandle, h2.name AS SecondHandle, count(path) AS NrOfSharedJokes
RETURN FirstHandle, SecondHandle,NrOfSharedJokes
ORDER BY NrOfSharedJokes DESC;

The result is quite enlightning: GroanBot and RandomJokesIO are clearly reinforcing one another. My personal guess is that they are truly just bots.

Count of the paths between twitter Handles

What are the most frequent entities

We already have the Favorite/Retweet scores of all the dadjokes summed up, so we can also look at which Entity nodes have the highest scores that way:

MATCH (e:Entity)--(dj:Dadjoke)
WITH e, sum(toInteger(dj.SumOfFavorites)) AS sumofsumoffavorites, sum(toInteger(dj.SumOfRetweets)) AS sumofsumofretweets
SET e.SumOfSumOfFavorites = sumofsumoffavorites
SET e.SumOfSumOfRetweets = sumofsumofretweets;

This operation finishes very quickly, and so then we can do the exploration quite easily, and figure out what the entities are that our dadjokers are mostly joking about:

MATCH (e:Entity)
RETURN e.text, e.SumOfSumOfFavorites AS EntityFavoriteScore, e.SumOfSumOfRetweets AS EntityRetweetScore
ORDER BY EntityFavoriteScore DESC
LIMIT 10;

What entities are dads joking about?

Surprise: it's about wives and bosses. Right!

Wrapping up

What a crazy ride this has been. I could actually think of many different things that I would want to do with this dataset - but I will leave it at this for now. I do think that this has been one of the best (and most FUN) examples that I have come across recently that combines data import, data wrangling, NLP, text analysis, graph data science and disambiguation in one exercise. I really loved it - and hope it will inspire others to explore this or other datasets in the same graphy way.

Cheers

Rik

Here are the different parts to this blogpost series:
Hope they are as fun for you as they were for me.

No comments:

Post a Comment