Thursday, 6 October 2022

DadjokeGraph Part 1/6: Building a Dadjoke database - from nothing

A Graph Database and a Dadjoke walk into a bar...

I am a dad. Happily married and 3 wonderful kids of 13, 17 and 19 years old. So all teenagers - with all the positive and not so positive experiences that can come with that. And in our family, I have been accused of using dadjokes - preferably at very awkward or inconvenient times. I actually like dadjokes. A lot.

So I follow a few accounts on social media that post these jokes. Like for example baddadjokes, dadjokeman, dadsaysjokes, dadsjokes, groanbot, punsandoneliner, randomjokesio, and thepunnyworld and there are many others. These are all in this list, should you be interest. It's a very funny list.

Dadjokers List on Twitter

Working at Neo4j, I have always been looking at different datasets that would be a great way to demonstrate the wonderful power of graphs in a fun and engaging way. Over the years, this has pulled me into a variety of different ratholes - and I have always loved that and learned so much by doing that. So... here we go again: I have been looking at different ways of building a dadjokes database. Wouldn't that be a ton of fun? Not just because of the jokes, but also as a means to learn more about the twitterspace, the topics of the jokes, the frequency and similarities of these jokes. Etc Etc. So: like always, I figured why not just give it a try. Here goes.

Here are the techniques that I tried to actually build the database:

  • scraping the data off of different websites - very cumbersome and really I gave up and did not even try very hard.
  • exporting stuff from instagram with instaloader and OCR-ing it with Tesseract. Was actually pleasantly surprised with how easy this was - but still felt quite clunky. While the OCR engine would get a LOT right, without any tuning, it still got a lot of stuff wrong. So I figured there had to be a better way.
  • using the Twitter API: if you have a Twitter Developer Account, you can get access to the API and pull information from there. For example like this:

Twitter API integration

:param listofaccounts => [["baddadjokes",268948349],["dadsaysjokes",905028905026846720],["dadsjokes",106358414],["ThePunnyWorld",2472108787],["DadJokeMan",766659901242413057],["GroanBot",1065122605483474944],["RandomJokesIO",954397214004031489]];

:param twitterbearer => "<somerandomstringsthatserveasbearer>";

UNWIND $listofaccounts AS account
CALL apoc.load.jsonParams(
  "https://api.twitter.com/2/users/" + account[1] + "/tweets",
  {Authorization: "Bearer "+$twitterbearer, exclude: "retweets,replies"},
  null
)
YIELD value
UNWIND value.data AS status
CREATE (t:Tweet {id: status.id, handle: account[0], text: status.text});
  • a set of tweets that I have started to gather using IFTTT, which adds tweets to a spreadsheet. This will take a long time to build up but should make for a fun follow up.

IFTTT sheet

IFTTT Tweets by Dadjokers Google Sheet

  • and FINALLY, LAST BUT NOT LEAST: using vicinitas.io: This allows you to dowload a really good chunk of tweets from specific Twitter handles - but it is rate-limited to 3200 tweets per account, for the free version. For what I want to do here, though, that is plenty. Downloaded it for the accounts above, and put everything together into one gsheet.

Vicinitas Google Sheet

Now, you can either

  • created a .csv version of it (example over here) and put that in the ./import directory of your Neo4j server,
  • or download the .csv version of the Google Sheet straight from the export URL.

Now that we have a nice little dataset to play with, let's proceed to importing that into our database.

Cheers

Rik

Here are the different parts to this blogpost series:
Hope they are as fun for you as they were for me.

No comments:

Post a Comment