A Graph Database and a Dadjoke walk into a bar...
The next, and final, step our dadjoke journey here, is going to be taking the disambiguation to the next level by applying Graph Data Science metrics to the new, enriched (using NLP), structure that we have in our graph database. The basic idea here is that, while the TEXT similarity of these jokes may be quite far apart, their structural similarity may still be quite high based on the connectivity between the joke and its (NLP based) entities.
Calculating the Jaccard similarity metric using GDS
To explore this, we will be using the Jaccard similarity coefficient, which is part of the Neo4j Graph Data Science library that we have installed on our server. More about this coefficient can be found on Wikipedia. The index is defined as the size of the intersection divided by the size of the union of the sample sets, which is very well illustrated on that Wikipedia page. I have used Neuler (the no-code graph data science playground that you can easily add to your Neo4j Desktop installation) to generate the code below - but you can easily run this in the Neo4j Browser as well.
:param limit => ( 500);
:param graphConfig => ({
nodeProjection: '*',
relationshipProjection: {
relType: {
type: 'HAS_ENTITY',
orientation: 'NATURAL',
properties: {}
}
}
});
:param config => ({
similarityMetric: 'Jaccard',
similarityCutoff: 0,
degreeCutoff: 2,
writeProperty: 'score',
writeRelationshipType: 'JACCARD_SIMILAR'
});
:param communityNodeLimit => ( 10);
:param generatedName => ('in-memory-graph-1663777188212');
CALL gds.graph.project($generatedName, $graphConfig.nodeProjection, $graphConfig.relationshipProjection, {});
CALL gds.nodeSimilarity.write($generatedName, $config);
MATCH (from)-[rel:`JACCARD_SIMILAR`]-(to)
WHERE exists(rel.`score`)
RETURN from, to, rel.`score` AS similarity
ORDER BY similarity DESC
LIMIT toInteger($limit);
CALL gds.graph.drop($generatedName);
This will generate a set of relationships between Dadjoke
nodes that will indicate how similar they are based on a JACCARD_SIMILAR
relationship that will have a score
weight property on it. We can easily add an index on that relationship:
CREATE INDEX jaccard_similarity_index FOR ()-[s:JACCARD_SIMILAR]-() ON (s.score);
We can explore this similarity with a few queries.
MATCH p=(h:Handle)--(t:Tweet)--(d:Dadjoke)-[r:JACCARD_SIMILAR]->()
AND d.Text contains "pyjamazon"
RETURN p;
Using Jaccard similarity for disambiguation
Returning to our objective of disambiguation of the jokes: there seem to be hundreds of additional disambiguation results that we could eliminate using the Jaccard metric.
MATCH (d1:Dadjoke)-[r:JACCARD_SIMILAR]->(d2:Dadjoke)
WHERE r.score >0.9
and id(d1)>id(d2)
AND d1.Text contains "Doctor"
RETURN d1.Text, d2.Text, r.score;
Turns out there are about 844 of these types of eliminations that we could do.
MATCH p=()-[r:JACCARD_SIMILAR]->()
WHERE r.score >0.9
WITH count(p) AS count
RETURN count;
We could then of course also actually perform the disambiguation now and remove the duplicate dadjokes based on the JACCARD_SIMILAR
score. I have not done that in this case as I think it is interesting to see how this structural analysis yields it's insights. But clearly that's what you would consider doing as your last disambiguation step, using Neo4j.
Cheers
Rik
No comments:
Post a Comment