Thursday, 6 October 2022

DadjokeGraph Part 2/6: Importing the Dadjokes into the Dadjoke Graph

A Graph Database and a Dadjoke walk into a bar...

This means that we want to convert the spreadsheet that we created before, or the .csv version of it, into a Neo4j Database.

Here's how we go about this. First things first: let's set up the indexes that we will need later on in this process:

CREATE INDEX tweet_index FOR (t:Tweet) ON t.Text;
CREATE INDEX dadjoke_index for (d:Dadjoke) ON d.Text;

Assuming the .csv file mentioned above is in the import directory of the Neo4j server, we can use load csv to create the initial dataset:

LOAD CSV WITH HEADERS FROM "file:/vicinitas_alldadjokes_user_tweets.csv" AS csv
CREATE (t:Tweet)
SET t = csv;

Import the Vicinitas .csv file

Or: if you want to create the graph straight from the Google Sheet:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1MwHX5hM-Vda5o4ZQVnCv4upKepL5rHxcUrqfT69u5Ro/export?format=csv&gid=1582640786" AS csv
CREATE (t:Tweet)
SET t = csv;

Then, we will proceed to

  1. discard some of the obviously useless tweets that will not give us any good dadjokes, ie. the ones that
    • just contain a link,
    • are retweets of previous tweets (starting with the RT marker)
    • are just tagging someone else (starting with the @ marker)
  2. extract the Twitter handles (aka screen names) from the Tweets, and put them into seperate Handle nodes.

This is how we do both these things:

1. Remove some of the redundant Tweets

Here's how we use DETACH DELETE to weed out some of the tweets that will not yield interesting dadjokes:

MATCH (t:Tweet)
WHERE t.Text STARTS WITH "http"
    OR t.Text STARTS WITH "RT"
    OR t.Text STARTS WITH "@"
    DETACH DELETE t;

Remove redundant Tweets

2. Connect the Tweets to the Handles

The Tweet nodes will be connected to their Handle nodes, using the MERGE command:

MATCH (t:Tweet)
MERGE (h:Handle {name: t.`Screen Name`})
CREATE (t)<-[:TWEETS {datetime: datetime(t.UTC)}]-(h);

Now comes the interesting part - creating Dadjokes out of the Tweets.

Extract the Handles from the Tweets

Now we have the database full of Tweets and Handles, let's proceed to make that into a true Dadjoke database.


From tweet to dadjoke!

In the next few sections, we are going to explore how we can convert the Tweet nodes that we currently have in the database into a set of Dadjokes that we can work with later on. The complexity here is, obviously, that there will be quite a few tweets that will reference the same dadjokes, and we would like to actually make sure that we disambiguate and deduplicate these jokes. That will make for an interesting journey, and this is where Neo4j will clearly show its value.

Create Jokes out of Tweets

The MERGE statement will make sure that identical jokes don't get created twice, thereby already deduplicating some jokes.

MATCH (t:Tweet)
MERGE (dj:Dadjoke {Text: t.Text})
CREATE (t)-[:REFERENCES_DADJOKE]->(dj);

Extracting the dadjokes from the tweets

This is already interesting: we went from 18185 Tweets in the database, to 14313 Dadjokes - meaning that we have 3872 Tweets that had exactly the same text, and therefore the same Dadjoke. That alone is quite crazy.

Counts of Tweets, Handles, Dadjokes

We can see that by running this query:

MATCH (t1:Tweet), (t2:Tweet)
WHERE t1 <> t2
AND t1.Text = t2.Text
RETURN t1.`Screen Name`, t2.`Screen Name`, t1.Text;

2 Handles, 1 dadjoke

So there is clearly a disambiguation task at hand here! Let's pursue it! But before we do that, we should also think about some housekeeping, and aggregate the Favorite/Retweet scores of the tweets into the dadjokes.

Aggregate the Favorite/Retweet scores of the tweets into the jokes

As we will have created some jokes now that have more than 1 tweet referencing it, it is important to aggregate the Favorites and Retweets into the Dadjoke nodes, so that we can still have an understanding of the popularity of the joke (based on the popularity of the referring tweets).

MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) as sumoffavorites, sum(toInteger(t.Retweets)) as sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;

Aggregating Favorites/Retweets for Dadjokes

We will come back to this later on as well as we further disambiguate the jokes (and will need to sum it all up again).

Cheers

Rik

Here are the different parts to this blogpost series:
Hope they are as fun for you as they were for me.

No comments:

Post a Comment