Thursday 6 October 2022

DadjokeGraph Part 4/6: Adding NLP and Entity Extraction to prepare for further disambiguation

A Graph Database and a Dadjoke walk into a bar...

As we can see in the pyjamazon example from before, the disambiguation of our Dadjokes has come a long way - but is not yet complete. Hence we we call the graph to the rescue here, and take it a final step further that will provide a wonderfully powerful example of how and why graphs are so good at analysing the structural characteristics of data, and make interesting and amazing recommendations on the back of that.

Here's what we are going to do:

  1. we are going to use Natural Language Processing to extract the entities that are mentioned in our Dadjokes. Do do that, we are going to use the amazing Google Cloud NLP Service, and call it from APOC. This will yield a connected structure that will tell us exactly which entities are mentioned in every joke.
  2. then we are going to use that graph of dadjokes connected to entities to figure out if the structure of the links can help us with further disambiguation of the jokes.

So let's start with the start.

1. NLP on the Dadjokes

I have actually experimented with this technique before - and found it to be extremely powerful. See some posts that I wrote before: see over here and over here if you want to see other examples. Using the Natural Language Processing (NLP) procedures in the apoc libraries, we can now call the NLP services of GCP, AWS and Azure really easily, and use them to automatically extract the entities from the text of the Dadjokes.

So, after having installed the required NLP .jar file in the Plugin directory of the Neo4j server, we can start analysing the descriptions using the Google Cloud NLP service. Here's how that works:

:param apiKey =>("XYZ-XYZABCXYZABC");

:auto 
MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
WITH dj
CALL {
    WITH dj
    CALL apoc.nlp.gcp.entities.graph(dj, {
        key: $apiKey,
        nodeProperty: "Text",
        scoreCutoff: 0.01,
        writeRelationshipType: "HAS_ENTITY",
        writeRelationshipProperty: "gcpEntityScore",
        write: true
        })
    YIELD graph AS g
    RETURN g
    } IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";

Note that the apiKey is of course to be replaced with your own personal key.

Run NLP using GCP

Note that we do these NLP operations in transactions of 10 rows: the apoc NLP procedure chunks things into groups of 25, so to reduce the number of (expensive) calls to the GCP services, it's a good idea to use multiples of 25 for the throttling of the calls.

Note that I actually added the

where not((dj)-[:HAS_ENTITY]->())
WITH dj

part because I have had some issues with timeouts recently, where apoc.nlp.gcp.* would just take too long to return and therefore the connection would get reset. So I would have to relaunch the operation, but I would only want to do it for the Dadjoke nodes that had not been processed yet, ie. that would not have Entity nodes connected to it yet. The result then looks something like this for every Dadjoke:

Tweet, Dadjoke and Entities

Two more comments on the NLP topic:

  1. There seem to be some types of text that the Google NLP engine was struggling with, and could not extract entities from. Look at this query:
MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
RETURN dj.Text

and notice that the results have a lot of plays on words related to numbers:

Dadjokes that GCP NLP struggles with

There could be ways to try to fix that, for example by replacing numbers with words - but I did not go into that.

  1. If you don't have the time, or the $$$, to run the NLP on all these nodes, you could also just do the NLP on a limited subset of nodes:
:auto
MATCH (dj:Dadjoke)
WHERE dj.Text CONTAINS "amazon"
WITH dj
CALL {
    WITH dj
    CALL apoc.nlp.gcp.entities.graph(dj, {
        key: $apiKey,
        nodeProperty: "Text",
        scoreCutoff: 0.01,
        writeRelationshipType: "HAS_ENTITY",
        writeRelationshipProperty: "gcpEntityScore",
        write: true
        })
    YIELD graph AS g
    RETURN g
    } IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";

That really only has a handful of nodes to process, so it would return quite quickly, and the financial cost would negligible.

Indexing the Entities that we extracted

CREATE INDEX entity_index FOR (e:Entity) ON e.text;
CREATE INDEX entity_rel_index FOR ()-[he:HAS_ENTITY]-() ON (he.gcpEntityScore);

Refactoring after NLP-based entity extraction: consolidating the Entity nodes with different capitalisation

After running the NLP procedures, we quickly notice that there is some duplication in the Entity nodes that have been detected. Specifically, the capitalisation of the Entity nodes can be quite confusing and we should take some care to resolve this - which is very easily done.

:auto match (e1:Entity), (e2:Entity)
where id(e1)<id(e2)
and toLower(e1.text) = toLower(e2.text)
call {
  with e1, e2
  match (e2)<-[:HAS_ENTITY]-(dj:Dadjoke)
  create (dj)-[:HAS_ENTITY]->(e1)
  with e1, e2
  call apoc.create.setLabels(e1,labels(e2))
    yield node
  set e1.text = toLower(e1.text)
  detach delete e2
} in transactions of 10 rows
return "Success!";

Refactor the entities to lowercase

We can just quickly explore a specific part of the graph that we have explored before now:

MATCH path = (e:Entity)--(dj:Dadjoke)-[REFERENCES_DADJOKE]-(t:Tweet)--(h:Handle)
WHERE dj.Text CONTAINS "amazon"
    RETURN path;

Dadjokes about Amazon

Note that we now see some notable differences:

  • 2 dadjokes out of 3 mention Jeff's pyjamas, and 1 of the dadjokes mentions pajamas
  • 2 dadjokes out of 3 mention Jeff Bezos, and 1 of the dadjokes mentions eff Bezos
  • there is no overlap between those sets of 2 dadjokes
  • all they have in common is the Bed entity, therefore

This will be of importance when we calculate our next set of similarity scores.

Cheers

Rik

Here are the different parts to this blogpost series:
Hope they are as fun for you as they were for me.

No comments:

Post a Comment