A Graph Database and a Dadjoke walk into a bar...
This means that we want to convert the spreadsheet that we created before, or the .csv
version of it, into a Neo4j Database.
Here's how we go about this. First things first: let's set up the indexes that we will need later on in this process:
CREATE INDEX tweet_index FOR (t:Tweet) ON t.Text;
CREATE INDEX dadjoke_index for (d:Dadjoke) ON d.Text;
Assuming the .csv file mentioned above is in the import
directory of the Neo4j server, we can use load csv
to create the initial dataset:
LOAD CSV WITH HEADERS FROM "file:/vicinitas_alldadjokes_user_tweets.csv" AS csv
CREATE (t:Tweet)
SET t = csv;
Or: if you want to create the graph straight from the Google Sheet:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1MwHX5hM-Vda5o4ZQVnCv4upKepL5rHxcUrqfT69u5Ro/export?format=csv&gid=1582640786" AS csv
CREATE (t:Tweet)
SET t = csv;
Then, we will proceed to
- discard some of the obviously useless tweets that will not give us any good dadjokes, ie. the ones that
- just contain a link,
- are retweets of previous tweets (starting with the
RT
marker) - are just tagging someone else (starting with the
@
marker)
- extract the Twitter handles (aka screen names) from the
Tweet
s, and put them into seperateHandle
nodes.
This is how we do both these things:
1. Remove some of the redundant Tweets
Here's how we use DETACH DELETE
to weed out some of the tweets that will not yield interesting dadjokes:
MATCH (t:Tweet)
WHERE t.Text STARTS WITH "http"
OR t.Text STARTS WITH "RT"
OR t.Text STARTS WITH "@"
DETACH DELETE t;
2. Connect the Tweets to the Handles
The Tweet nodes will be connected to their Handle nodes, using the MERGE
command:
MATCH (t:Tweet)
MERGE (h:Handle {name: t.`Screen Name`})
CREATE (t)<-[:TWEETS {datetime: datetime(t.UTC)}]-(h);
Now comes the interesting part - creating Dadjoke
s out of the Tweets.
Now we have the database full of Tweet
s and Handle
s, let's proceed to make that into a true Dadjoke
database.
From tweet to dadjoke!
In the next few sections, we are going to explore how we can convert the Tweet
nodes that we currently have in the database into a set of Dadjoke
s that we can work with later on. The complexity here is, obviously, that there will be quite a few tweets that will reference the same dadjokes, and we would like to actually make sure that we disambiguate and deduplicate these jokes. That will make for an interesting journey, and this is where Neo4j will clearly show its value.
Create Jokes out of Tweets
The MERGE
statement will make sure that identical jokes don't get created twice, thereby already deduplicating some jokes.
MATCH (t:Tweet)
MERGE (dj:Dadjoke {Text: t.Text})
CREATE (t)-[:REFERENCES_DADJOKE]->(dj);
This is already interesting: we went from 18185 Tweet
s in the database, to 14313 Dadjoke
s - meaning that we have 3872 Tweets that had exactly the same text, and therefore the same Dadjoke. That alone is quite crazy.
We can see that by running this query:
MATCH (t1:Tweet), (t2:Tweet)
WHERE t1 <> t2
AND t1.Text = t2.Text
RETURN t1.`Screen Name`, t2.`Screen Name`, t1.Text;
So there is clearly a disambiguation task at hand here! Let's pursue it! But before we do that, we should also think about some housekeeping, and aggregate the Favorite/Retweet scores of the tweets into the dadjokes.
Aggregate the Favorite/Retweet scores of the tweets into the jokes
As we will have created some jokes now that have more than 1 tweet referencing it, it is important to aggregate the Favorites
and Retweets
into the Dadjoke
nodes, so that we can still have an understanding of the popularity of the joke (based on the popularity of the referring tweets).
MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) as sumoffavorites, sum(toInteger(t.Retweets)) as sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;
We will come back to this later on as well as we further disambiguate the jokes (and will need to sum it all up again).
Cheers
Rik
No comments:
Post a Comment