A Graph Database and a Dadjoke walk into a bar...
As we can see in the pyjamazon example from before, the disambiguation of our Dadjoke
s has come a long way - but is not yet complete. Hence we we call the graph to the rescue here, and take it a final step further that will provide a wonderfully powerful example of how and why graphs are so good at analysing the structural characteristics of data, and make interesting and amazing recommendations on the back of that.
Here's what we are going to do:
- we are going to use Natural Language Processing to extract the entities that are mentioned in our
Dadjoke
s. Do do that, we are going to use the amazing Google Cloud NLP Service, and call it from APOC. This will yield a connected structure that will tell us exactly which entities are mentioned in every joke. - then we are going to use that graph of dadjokes connected to entities to figure out if the structure of the links can help us with further disambiguation of the jokes.
So let's start with the start.
1. NLP on the Dadjokes
I have actually experimented with this technique before - and found it to be extremely powerful. See some posts that I wrote before: see over here and over here if you want to see other examples. Using the Natural Language Processing (NLP) procedures in the apoc libraries, we can now call the NLP services of GCP, AWS and Azure really easily, and use them to automatically extract the entities from the text of the Dadjoke
s.
So, after having installed the required NLP .jar
file in the Plugin
directory of the Neo4j server, we can start analysing the descriptions using the Google Cloud NLP service. Here's how that works:
:param apiKey =>("XYZ-XYZABCXYZABC");
:auto
MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
WITH dj
CALL {
WITH dj
CALL apoc.nlp.gcp.entities.graph(dj, {
key: $apiKey,
nodeProperty: "Text",
scoreCutoff: 0.01,
writeRelationshipType: "HAS_ENTITY",
writeRelationshipProperty: "gcpEntityScore",
write: true
})
YIELD graph AS g
RETURN g
} IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";
Note that the apiKey
is of course to be replaced with your own personal key.
Note that we do these NLP operations in transactions of 10 rows: the apoc NLP procedure chunks things into groups of 25, so to reduce the number of (expensive) calls to the GCP services, it's a good idea to use multiples of 25 for the throttling of the calls.
Note that I actually added the
where not((dj)-[:HAS_ENTITY]->())
WITH dj
part because I have had some issues with timeouts recently, where apoc.nlp.gcp.*
would just take too long to return and therefore the connection would get reset. So I would have to relaunch the operation, but I would only want to do it for the Dadjoke
nodes that had not been processed yet, ie. that would not have Entity
nodes connected to it yet. The result then looks something like this for every Dadjoke
:
Two more comments on the NLP topic:
- There seem to be some types of text that the Google NLP engine was struggling with, and could not extract entities from. Look at this query:
MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
RETURN dj.Text
and notice that the results have a lot of plays on words related to numbers:
There could be ways to try to fix that, for example by replacing numbers with words - but I did not go into that.
- If you don't have the time, or the $$$, to run the NLP on all these nodes, you could also just do the NLP on a limited subset of nodes:
:auto
MATCH (dj:Dadjoke)
WHERE dj.Text CONTAINS "amazon"
WITH dj
CALL {
WITH dj
CALL apoc.nlp.gcp.entities.graph(dj, {
key: $apiKey,
nodeProperty: "Text",
scoreCutoff: 0.01,
writeRelationshipType: "HAS_ENTITY",
writeRelationshipProperty: "gcpEntityScore",
write: true
})
YIELD graph AS g
RETURN g
} IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";
That really only has a handful of nodes to process, so it would return quite quickly, and the financial cost would negligible.
Indexing the Entities that we extracted
CREATE INDEX entity_index FOR (e:Entity) ON e.text;
CREATE INDEX entity_rel_index FOR ()-[he:HAS_ENTITY]-() ON (he.gcpEntityScore);
Refactoring after NLP-based entity extraction: consolidating the Entity
nodes with different capitalisation
After running the NLP procedures, we quickly notice that there is some duplication in the Entity
nodes that have been detected. Specifically, the capitalisation of the Entity
nodes can be quite confusing and we should take some care to resolve this - which is very easily done.
:auto match (e1:Entity), (e2:Entity)
where id(e1)<id(e2)
and toLower(e1.text) = toLower(e2.text)
call {
with e1, e2
match (e2)<-[:HAS_ENTITY]-(dj:Dadjoke)
create (dj)-[:HAS_ENTITY]->(e1)
with e1, e2
call apoc.create.setLabels(e1,labels(e2))
yield node
set e1.text = toLower(e1.text)
detach delete e2
} in transactions of 10 rows
return "Success!";
We can just quickly explore a specific part of the graph that we have explored before now:
MATCH path = (e:Entity)--(dj:Dadjoke)-[REFERENCES_DADJOKE]-(t:Tweet)--(h:Handle)
WHERE dj.Text CONTAINS "amazon"
RETURN path;
Note that we now see some notable differences:
- 2 dadjokes out of 3 mention Jeff's
pyjamas
, and 1 of the dadjokes mentionspajamas
- 2 dadjokes out of 3 mention
Jeff Bezos
, and 1 of the dadjokes mentionseff Bezos
- there is no overlap between those sets of 2 dadjokes
- all they have in common is the
Bed
entity, therefore
This will be of importance when we calculate our next set of similarity scores.
Cheers
Rik
No comments:
Post a Comment