# A Graph Database and a Dadjoke walk into a bar...

We noticed that many of the `Tweet` nodes referred to the same jokes - and resolved that already above. But this query makes us understand that we actually still have some work to do:

``````MATCH path = (dj:Dadjoke)-[*..2]-(conn)
WHERE dj.Text CONTAINS "pyjamazon"
RETURN path;
``````

We will come back to that example below.

We now notice that there are quite a few `Dadjoke` nodes that are a bit different, but very similar. We would like to disambiguate these too. We will use a couple of different strategies for this, but start with a strategy that is based on String Metrics.

## Disambiguation based on String Metrics

The first strategy is all about comparing the `Text` property on the `Dadjoke` nodes. We know they are not identical (see the `MERGE` that we did previously - that would have captured the identical ones!), but we actually know that some of these jokes are not the same but similar. So how to proceed with that? Well, turns out there are a lot of really interesting algorithms that calculate "string metrics" that compare two different strings with one another, and express how similar they are in a numerical score. I found this guide really useful to understand it a bit better. The article explains that, like with so many things, there are different ways of assessing that similarity, and none of them are per definition perfect. They all make trade-offs in terms of efficacy and accuracy, and largely complement each other instead of competing with one another. Let's explore a few of these metrics, as they are readily available in our `apoc` library.

### Comparing jokes using Levenshtein and SÃ¸rensen-Dice similarities

The first interesting metric that we will use is based on the Levenshtein distance and use that to calculate how similar the strings are. The Levenshtein distance has some simple lower/upper bounds:

• It is at least the difference of the sizes of the two strings.
• It is at most the length of the longer string.
• It is zero if and only if the strings are equal. We will use that distance to create a similarity indicator (between 0 and 1) that is calculated as

(length of longest string - Levenshtein distance between the strings) / length of the longest string. which will yield a value between 0 and 1.

The second interesting metric that we will use is based on the SÃ¸rensen-dice coefficient, which also yields a score between 0 and 1.

So let's calculate this using (`apoc.text.levenshteinsimilarity`)[https://neo4j.com/labs/apoc/4.3/overview/apoc.text/apoc.text.levenshteinSimilarity/] and (`apoc.text.sorencenDiceSimilarity`)[https://neo4j.com/labs/apoc/4.3/overview/apoc.text/apoc.text.sorensenDiceSimilarity/]. Both of these will yield a value between 0 and 1 - where 0 is very low similarity, and 1 is where the strings are identical.

First let's see what we find, and figure out what would be a useful cutoff point for our dataset:

``````MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,30) = left(dj2.Text,30)
WITH dj1.Text AS dj1text, dj2.Text AS dj2text, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity,
apoc.text.sorensenDiceSimilarity(dj1.Text, dj2.Text) AS SorensenDiceSimilarity
WHERE LevenshteinSimilarity < 0.65
RETURN left(dj1text,60) as `First 60 chars of dadjoke1`,left(dj2text,60) as `First 60 chars of dadjoke2`,LevenshteinSimilarity,SorensenDiceSimilarity
ORDER BY LevenshteinSimilarity DESC;
``````

• had identical first 30 characters
• had different "complete" text strings We then calculated the string metrics on the entirity of the `Dadjoke` text, and got the above results.

Then we can decide, by looking at the data, where we should be cutting off the similarity scores. When I looked at this, I judged that if the similarity was greater than `0.7`, the two texts of the `Dadjoke` nodes were likely to be the same. Lower than that and the texts started to diverge quite significantly. So in the next phase I will be merging the jokes that have similarity scores above my cutoff point.

## Merge the nodes that have a Levenshtein similarity > 0.7

In case the Levenshtein similarity is larger than 0.7, we would want the `Dadjoke` nodes to be merged (and the links to the Tweets to be moved as well). That would be an interesting use case for one of the graph refactoring procedures in apoc: `apoc.refactor.mergeNodes` will do that trick.

``````:auto MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,20) = left(dj2.Text,20)
WITH [dj1,dj2] AS nodes, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity
WHERE LevenshteinSimilarity > 0.7
CALL {
WITH nodes
CALL apoc.refactor.mergeNodes(nodes, {properties: {Text: "discard", SumOfFavorites: "combine", SumOfRetweets: "combine"}, mergeRels:True}) YIELD node
RETURN count(node) AS mergednodes
} IN TRANSACTIONS OF 1 ROWS
RETURN sum(mergednodes) as countofmergedbodes;
``````

Note that we have to make these changes one by one, based on the fact that it could otherwise lead to a deadlock situation.

Before ending this section, we do need to re-aggregate the `Tweet` stats onto the `Dadjoke` nodes. Same procedure as before:

### Sum up the Retweet/Favorite stats

``````MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) AS sumoffavorites, sum(toInteger(t.Retweets)) AS sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;
``````

That wraps up the disambiguation based on string metrics. Now, let's get to the true power of the graph, but using NLP and Entity Extraction first, and graph analytics second.

Cheers

Rik

Here are the different parts to this blogpost series:
Hope they are as fun for you as they were for me.