A Graph Database and a Dadjoke walk into a bar...
Now that we have a disambiguated graph of dadjokes, let's have some fun and explore it.
How many times does a joke get tweeted?
MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke) WITH djJoke, count(r) AS NrOfTimesTweeted RETURN Joke, NrOfTimesTweeted ORDER BY NrOfTimesTweeted DESC LIMIT 10;AS
How many times does a joke get favorited?
MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke) RETURN dj.Text AS Joke, dj.SumOfFavorites AS NrOfTimesFavorited, dj.SumOfRetweets AS NrOfTimesRetweeted ORDER BY NrOfTimesFavorited DESC LIMIT 10;
Let's explore 3 alternative ways to find jokes about cars.
1. Matching the text of the
Dadjoke for the word "car"
MATCH (dj:Dadjoke) WHERE dj.Text CONTAINS "car" RETURN dj.Text LIMIT 10;
2. Checking if the
Entity contains the word "car"
MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text CONTAINS "car" RETURN dj.Text LIMIT 10;
3. Checking if the
Entity equals the word "car"
MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text = "car" RETURN dj.Text LIMIT 10;
Finding jokes about cars and wives
This was another great example:
MATCH p=(h:Handle)--(t:Tweet)--(dj:Dadjoke)-[r:JACCARD_SIMILAR]->() WHERE dj CONTAINS "spaghetti" AND (dj CONTAINS "bike" OR dj CONTAINS "car") RETURN p;
It's amazing to see how the same conceptual joke is being reused in different ways!
Some interesting structural characteristics about the #dadjoke twitterspace
Now we can of course also start to look at some of the structural charactersistics of this part of the Twitterspace. Just from looking at some of the subgraph results of our queries, it becomes obvious that
- lots of jokes are being repeated, time and time again
- different Twitter handles actually borrow each others jokes - all the time
So let's explore that a little more.
How many jokes are tweeted identically by different tweeters
MATCH path = (h1:Handle)-[*2..2]->(dj:Dadjoke)<-[*2..2]-(h2:Handle) WHERE id(h1)<id(h2) RETURN path;
This takes a while to load, but you can clearly see a few cliques in this picture.
Let's see how many such paths are actually there:
MATCH path = (h1:Handle)-[*2..2]-(dj:Dadjoke)-[*2..2]-(h2:Handle) WHERE id(h1)<id(h2) WITH h1.name AS FirstHandle, h2.name AS SecondHandle, count(path) AS NrOfSharedJokes RETURN FirstHandle, SecondHandle,NrOfSharedJokes ORDER BY NrOfSharedJokes DESC;
What are the most frequent entities
We already have the Favorite/Retweet scores of all the dadjokes summed up, so we can also look at which
Entity nodes have the highest scores that way:
MATCH (e:Entity)--(dj:Dadjoke) WITH e, sum(toInteger(dj.SumOfFavorites)) AS sumofsumoffavorites, sum(toInteger(dj.SumOfRetweets)) AS sumofsumofretweets SET e.SumOfSumOfFavorites = sumofsumoffavorites SET e.SumOfSumOfRetweets = sumofsumofretweets;
This operation finishes very quickly, and so then we can do the exploration quite easily, and figure out what the entities are that our dadjokers are mostly joking about:
MATCH (e:Entity) RETURN e.text, e.SumOfSumOfFavorites AS EntityFavoriteScore, e.SumOfSumOfRetweets AS EntityRetweetScore ORDER BY EntityFavoriteScore DESC LIMIT 10;
Surprise: it's about wives and bosses. Right!
What a crazy ride this has been. I could actually think of many different things that I would want to do with this dataset - but I will leave it at this for now. I do think that this has been one of the best (and most FUN) examples that I have come across recently that combines data import, data wrangling, NLP, text analysis, graph data science and disambiguation in one exercise. I really loved it - and hope it will inspire others to explore this or other datasets in the same graphy way.