A Graph Database and a Dadjoke walk into a bar...
We noticed that many of the Tweet
nodes referred to the same jokes - and resolved that already above. But this query makes us understand that we actually still have some work to do:
MATCH path = (dj:Dadjoke)-[*..2]-(conn)
WHERE dj.Text CONTAINS "pyjamazon"
RETURN path;
We will come back to that example below.
We now notice that there are quite a few Dadjoke
nodes that are a bit different, but very similar. We would like to disambiguate these too. We will use a couple of different strategies for this, but start with a strategy that is based on String Metrics.
Disambiguation based on String Metrics
The first strategy is all about comparing the Text
property on the Dadjoke
nodes. We know they are not identical (see the MERGE
that we did previously - that would have captured the identical ones!), but we actually know that some of these jokes are not the same but similar. So how to proceed with that? Well, turns out there are a lot of really interesting algorithms that calculate "string metrics" that compare two different strings with one another, and express how similar they are in a numerical score. I found this guide really useful to understand it a bit better. The article explains that, like with so many things, there are different ways of assessing that similarity, and none of them are per definition perfect. They all make trade-offs in terms of efficacy and accuracy, and largely complement each other instead of competing with one another. Let's explore a few of these metrics, as they are readily available in our apoc
library.
Comparing jokes using Levenshtein and Sørensen-Dice similarities
The first interesting metric that we will use is based on the Levenshtein distance and use that to calculate how similar the strings are. The Levenshtein distance has some simple lower/upper bounds:
- It is at least the difference of the sizes of the two strings.
- It is at most the length of the longer string.
- It is zero if and only if the strings are equal.
We will use that distance to create a similarity indicator (between 0 and 1) that is calculated as
(length of longest string - Levenshtein distance between the strings) / length of the longest string. which will yield a value between 0 and 1.
The second interesting metric that we will use is based on the Sørensen-dice coefficient, which also yields a score between 0 and 1.
So let's calculate this using (apoc.text.levenshteinsimilarity
)[https://neo4j.com/labs/apoc/4.3/overview/apoc.text/apoc.text.levenshteinSimilarity/] and (apoc.text.sorencenDiceSimilarity
)[https://neo4j.com/labs/apoc/4.3/overview/apoc.text/apoc.text.sorensenDiceSimilarity/]. Both of these will yield a value between 0 and 1 - where 0 is very low similarity, and 1 is where the strings are identical.
First let's see what we find, and figure out what would be a useful cutoff point for our dataset:
MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,30) = left(dj2.Text,30)
WITH dj1.Text AS dj1text, dj2.Text AS dj2text, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity,
apoc.text.sorensenDiceSimilarity(dj1.Text, dj2.Text) AS SorensenDiceSimilarity
WHERE LevenshteinSimilarity < 0.65
RETURN left(dj1text,60) as `First 60 chars of dadjoke1`,left(dj2text,60) as `First 60 chars of dadjoke2`,LevenshteinSimilarity,SorensenDiceSimilarity
ORDER BY LevenshteinSimilarity DESC;
Note that I made a conscious decision to compare Dadjokes that
- had identical first 30 characters
- had different "complete" text strings
We then calculated the string metrics on the entirity of the
Dadjoke
text, and got the above results.
Then we can decide, by looking at the data, where we should be cutting off the similarity scores. When I looked at this, I judged that if the similarity was greater than 0.7
, the two texts of the Dadjoke
nodes were likely to be the same. Lower than that and the texts started to diverge quite significantly. So in the next phase I will be merging the jokes that have similarity scores above my cutoff point.
Merge the nodes that have a Levenshtein similarity > 0.7
In case the Levenshtein similarity is larger than 0.7, we would want the Dadjoke
nodes to be merged (and the links to the Tweets to be moved as well). That would be an interesting use case for one of the graph refactoring procedures in apoc: apoc.refactor.mergeNodes
will do that trick.
:auto MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,20) = left(dj2.Text,20)
WITH [dj1,dj2] AS nodes, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity
WHERE LevenshteinSimilarity > 0.7
CALL {
WITH nodes
CALL apoc.refactor.mergeNodes(nodes, {properties: {Text: "discard", SumOfFavorites: "combine", SumOfRetweets: "combine"}, mergeRels:True}) YIELD node
RETURN count(node) AS mergednodes
} IN TRANSACTIONS OF 1 ROWS
RETURN sum(mergednodes) as countofmergedbodes;
Note that we have to make these changes one by one, based on the fact that it could otherwise lead to a deadlock situation.
Before ending this section, we do need to re-aggregate the Tweet
stats onto the Dadjoke
nodes. Same procedure as before:
Sum up the Retweet/Favorite stats
MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) AS sumoffavorites, sum(toInteger(t.Retweets)) AS sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;
That wraps up the disambiguation based on string metrics. Now, let's get to the true power of the graph, but using NLP and Entity Extraction first, and graph analytics second.
Cheers
Rik
No comments:
Post a Comment