Thursday, 6 October 2022

DadjokeGraph Part 5/6: Disambiguation using Graph Data Science on the NLP-based Entities

A Graph Database and a Dadjoke walk into a bar...

The next, and final, step our dadjoke journey here, is going to be taking the disambiguation to the next level by applying Graph Data Science metrics to the new, enriched (using NLP), structure that we have in our graph database. The basic idea here is that, while the TEXT similarity of these jokes may be quite far apart, their structural similarity may still be quite high based on the connectivity between the joke and its (NLP based) entities.

Calculating the Jaccard similarity metric using GDS

To explore this, we will be using the Jaccard similarity coefficient, which is part of the Neo4j Graph Data Science library that we have installed on our server. More about this coefficient can be found on Wikipedia. The index is defined as the size of the intersection divided by the size of the union of the sample sets, which is very well illustrated on that Wikipedia page. I have used Neuler (the no-code graph data science playground that you can easily add to your Neo4j Desktop installation) to generate the code below - but you can easily run this in the Neo4j Browser as well.

:param limit => ( 500);
:param graphConfig => ({
  nodeProjection: '*',
  relationshipProjection: {
    relType: {
      type: 'HAS_ENTITY',
      orientation: 'NATURAL',
      properties: {}
    }
  }
});
:param config => ({
  similarityMetric: 'Jaccard',
  similarityCutoff: 0,
  degreeCutoff: 2,
  writeProperty: 'score',
  writeRelationshipType: 'JACCARD_SIMILAR'
});
:param communityNodeLimit => ( 10);
:param generatedName => ('in-memory-graph-1663777188212');

CALL gds.graph.project($generatedName, $graphConfig.nodeProjection, $graphConfig.relationshipProjection, {});

CALL gds.nodeSimilarity.write($generatedName, $config);

MATCH (from)-[rel:`JACCARD_SIMILAR`]-(to)
WHERE exists(rel.`score`)
RETURN from, to, rel.`score` AS similarity
ORDER BY similarity DESC
LIMIT toInteger($limit);

CALL gds.graph.drop($generatedName);

Run Jaccard Similarity with GDS

This will generate a set of relationships between Dadjoke nodes that will indicate how similar they are based on a JACCARD_SIMILAR relationship that will have a score weight property on it. We can easily add an index on that relationship:

CREATE INDEX jaccard_similarity_index FOR ()-[s:JACCARD_SIMILAR]-() ON (s.score);

We can explore this similarity with a few queries.

MATCH p=(h:Handle)--(t:Tweet)--(d:Dadjoke)-[r:JACCARD_SIMILAR]->() 
AND d.Text contains "pyjamazon"
RETURN p;

Amazon jokes with Jaccard similarities

Using Jaccard similarity for disambiguation

Returning to our objective of disambiguation of the jokes: there seem to be hundreds of additional disambiguation results that we could eliminate using the Jaccard metric.

MATCH (d1:Dadjoke)-[r:JACCARD_SIMILAR]->(d2:Dadjoke) 
WHERE r.score >0.9
and id(d1)>id(d2)
AND d1.Text contains "Doctor"
RETURN d1.Text, d2.Text, r.score;

Dadjokes with different text but a very high Jaccard Similarity

Turns out there are about 844 of these types of eliminations that we could do.

MATCH p=()-[r:JACCARD_SIMILAR]->() 
WHERE r.score >0.9
WITH count(p) AS count
RETURN count;

How many jokes with a JACCARD-similarity score over 0.9?

We could then of course also actually perform the disambiguation now and remove the duplicate dadjokes based on the JACCARD_SIMILAR score. I have not done that in this case as I think it is interesting to see how this structural analysis yields it's insights. But clearly that's what you would consider doing as your last disambiguation step, using Neo4j.

Cheers

Rik

Here are the different parts to this blogpost series:
Hope they are as fun for you as they were for me.

No comments:

Post a Comment