Thursday 6 October 2022

DadjokeGraph Part 3/6: Taking on the real disambiguation of the jokes

A Graph Database and a Dadjoke walk into a bar...

We noticed that many of the Tweet nodes referred to the same jokes - and resolved that already above. But this query makes us understand that we actually still have some work to do:

MATCH path = (dj:Dadjoke)-[*..2]-(conn)
WHERE dj.Text CONTAINS "pyjamazon"
    RETURN path;

The Amazon Dadjokes

We will come back to that example below.

We now notice that there are quite a few Dadjoke nodes that are a bit different, but very similar. We would like to disambiguate these too. We will use a couple of different strategies for this, but start with a strategy that is based on String Metrics.

Disambiguation based on String Metrics

The first strategy is all about comparing the Text property on the Dadjoke nodes. We know they are not identical (see the MERGE that we did previously - that would have captured the identical ones!), but we actually know that some of these jokes are not the same but similar. So how to proceed with that? Well, turns out there are a lot of really interesting algorithms that calculate "string metrics" that compare two different strings with one another, and express how similar they are in a numerical score. I found this guide really useful to understand it a bit better. The article explains that, like with so many things, there are different ways of assessing that similarity, and none of them are per definition perfect. They all make trade-offs in terms of efficacy and accuracy, and largely complement each other instead of competing with one another. Let's explore a few of these metrics, as they are readily available in our apoc library.

Comparing jokes using Levenshtein and Sørensen-Dice similarities

The first interesting metric that we will use is based on the Levenshtein distance and use that to calculate how similar the strings are. The Levenshtein distance has some simple lower/upper bounds:

  • It is at least the difference of the sizes of the two strings.
  • It is at most the length of the longer string.
  • It is zero if and only if the strings are equal. We will use that distance to create a similarity indicator (between 0 and 1) that is calculated as

    (length of longest string - Levenshtein distance between the strings) / length of the longest string. which will yield a value between 0 and 1.

The second interesting metric that we will use is based on the Sørensen-dice coefficient, which also yields a score between 0 and 1.

So let's calculate this using (apoc.text.levenshteinsimilarity)[] and (apoc.text.sorencenDiceSimilarity)[]. Both of these will yield a value between 0 and 1 - where 0 is very low similarity, and 1 is where the strings are identical.

First let's see what we find, and figure out what would be a useful cutoff point for our dataset:

MATCH (dj1:Dadjoke), (dj2:Dadjoke)
  WHERE id(dj1)<id(dj2)
  AND dj1.Text <> dj2.Text
  AND left(dj1.Text,30) = left(dj2.Text,30)
WITH dj1.Text AS dj1text, dj2.Text AS dj2text, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity,
apoc.text.sorensenDiceSimilarity(dj1.Text, dj2.Text) AS SorensenDiceSimilarity
  WHERE LevenshteinSimilarity < 0.65
RETURN left(dj1text,60) as `First 60 chars of dadjoke1`,left(dj2text,60) as `First 60 chars of dadjoke2`,LevenshteinSimilarity,SorensenDiceSimilarity
ORDER BY LevenshteinSimilarity DESC;

String Metrix Levenshtein- and SorensenDice-similarities

Note that I made a conscious decision to compare Dadjokes that

  • had identical first 30 characters
  • had different "complete" text strings We then calculated the string metrics on the entirity of the Dadjoke text, and got the above results.

Then we can decide, by looking at the data, where we should be cutting off the similarity scores. When I looked at this, I judged that if the similarity was greater than 0.7, the two texts of the Dadjoke nodes were likely to be the same. Lower than that and the texts started to diverge quite significantly. So in the next phase I will be merging the jokes that have similarity scores above my cutoff point.

Merge the nodes that have a Levenshtein similarity > 0.7

In case the Levenshtein similarity is larger than 0.7, we would want the Dadjoke nodes to be merged (and the links to the Tweets to be moved as well). That would be an interesting use case for one of the graph refactoring procedures in apoc: apoc.refactor.mergeNodes will do that trick.

:auto MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,20) = left(dj2.Text,20)
WITH [dj1,dj2] AS nodes, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity
WHERE LevenshteinSimilarity > 0.7
    WITH nodes
    CALL apoc.refactor.mergeNodes(nodes, {properties: {Text: "discard", SumOfFavorites: "combine", SumOfRetweets: "combine"}, mergeRels:True}) YIELD node
    RETURN count(node) AS mergednodes
RETURN sum(mergednodes) as countofmergedbodes;

Refactor the nodes based on LevenShtein similarity

Note that we have to make these changes one by one, based on the fact that it could otherwise lead to a deadlock situation.

Before ending this section, we do need to re-aggregate the Tweet stats onto the Dadjoke nodes. Same procedure as before:

Sum up the Retweet/Favorite stats

MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) AS sumoffavorites, sum(toInteger(t.Retweets)) AS sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;

That wraps up the disambiguation based on string metrics. Now, let's get to the true power of the graph, but using NLP and Entity Extraction first, and graph analytics second.



Here are the different parts to this blogpost series:
Hope they are as fun for you as they were for me.

No comments:

Post a Comment