Finding a Data source
That was so easy. I very quickly located a Dataset on Kaggle that I thought would be really interesting. It's a comma-separated file, about 110k lines long and 10MB in size, that holds all the lines that Shakespeare wrote for his plays. It's just an amazing dataset - not too complicated, but terribly interesting.
The structure of the file has the following File headers:
Dataline | Play | PlayerLinenumber | ActSceneLine | Player | PlayerLine |
---|---|---|---|---|---|
abc | def | ghi | jkl | mno | pqr |
Of course you can find the dataset on Kaggle yourself, but I actually quickly imported it into a google sheet version that you can access as well. This gsheet is shared and made public on the internet, and can then be downloaded as a csv at any time from this URL. This URL is what we will use for importing this data into Neo4j.
So let's see how we can do that.
Prepare the database
Assuming that you are using one of the latest versions of Neo4j, which supports multiple databases, you should start by creating the database for this exercise:
:use system;
create or replace database shakespeare;
:use shakespeare;
Once that is done, you should also create some indexes on the database, as that will help with the data import and querying later on:
create index on :Play(name);
create index on :Player(name);
create index on :Scene(name);
create index on :Act(name); create index on :Line(PlayerLine);
create index on :Line(Dataline);
create index on :Line(Play);
create index on :Line(Act);
create index on :Line(Scene);
Importing the data into Neo4j
Because dhte data is already in .csv format, and available on the web via the google-sheet-link above, importing the data as is into Neo4j is a no-brainer. All you need to do is use the LOAD CSV
command. Here's what that looks like:
Loading the lines of all of Shakespeare's plays - with some dataformats as Integers
Note that I needed to do one specific trick, and that is to convert the Dataline
and PlayerLinenumber
fields to integers, so that we could sort/sequence them later on. Other wise the create (l:Line)
statement could have just been folled by set l = line
- but we can't do that now.
Here's the import statement:
cypher
load csv with headers from "https://docs.google.com/spreadsheets/d/15c6eUbRMNDrPa0RTuzdrY46OAr2FzKH8tD0KZNoaG8c/export?format=csv&gid=1470339152" as line
create (l:Line)
set l.Dataline = toInteger(line.Dataline)
set l.Play = line.Play
set l.PlayerLinenumber = toInteger(line.PlayerLinenumber)
set l.ActSceneLine = line.ActSceneLine
set l.Player = line.Player
set l.PlayerLine = line.PlayerLine;
Now that the data is in Neo4j, we can start wrangling it into a much more graphy data structure. Here's how we do that.
Refactoring the data
We already have Player
in the Line
nodes. So let's extract that first and make them into separate nodes.
Creating the Players of the Lines
We will use a MERGE
operation for this to create the node and make sure that it does not get created twice. Next we add the relationship between the player and the line.
cypher
match (l:Line)
where l.Player is not null
with l
merge (pl:Player {name: l.Player})
create (pl)-[:ARTICULATES]->(l);
Next, we are going to look at
* where the Line fits into the Scene
,
* where the Scene
fits into the Act
,
* and where the Act
fits into the Play
Here's how we do that.
Extracting the Act and Scene info of the Lines
We have a property on the Line
that has the ActSceneLine
for every line, separated by a .
. Let's first make separate properties of this composite property. Note that we have to account for some Line
nodes that don't have an ActSceneLine
property, as the original dataset did not have it.
match (l:Line)
where l.ActSceneLine is not null
with l, split(l.ActSceneLine,".") as Array
set l.Act = Array[0]
set l.Scene = Array[1]
set l.Line = Array[2];
So now we can proceed with creating a hierarchy (Play>>Act>>Scene>>Line) for every Play
.
Creating the Scene Nodes, linking the lines to the Scenes
Here's how we create the Scene
nodes, and link them to the Line
s.
match (l:Line)
where l.ActSceneLine is not null
merge (sc:Scene {name: l.Play+" - Act "+l.Act+" - Scene "+l.Scene})
create (l)-[:PART_OF]->(sc);
Creating the Act nodes, linking the Scenes to the Acts
Next we can link the Scene
s to the Act
s:
match (l:Line)-->(sc:Scene)
where l.ActSceneLine is not null
merge (a:Act {name: l.Play+" - Act "+l.Act})
erge (sc)-[:PART_OF]->(a);
Linking the Act to the Plays
And finally we can link the Act
s to the Play
s:
match (l:Line)-->(sc:Scene)-->(a:Act)
where l.ActSceneLine is not null
merge (p:Play {name: l.Play})
merge (a)-[:PART_OF]->(p);
One last thing to clean up, is the fact that there are some Line
nodes that don't have an ActScenLine
property, and therefore don't have a Scene
or an Act
, but that do need to be linked to the Play
:
Some lines don't have Acts, but are part of the Play!
match (l:Line)
where l.ActSceneLine is null
merge (p:Play {name: l.Play})
merge (l)-[:PART_OF]->(p);
Next, we will start making the model a bit more understandable.
Making the model more understandable
Currently, the model basically have a Line
connected to every Scene
that is in an Act
of a Play
. That works fine, but it does not give us a lot of clues as to how the play would work. That's why I wanted to create a sequential loop of Lines for every Scene: every Line in the Scene would basically connect to the next one, and then the next one, and then... and so on. Here's how we do that.
Linking the lines in a loop for every scene
We start by linking the lines in a chain.
Link the lines in a chain
We will use the Dataline
property of every Line
for this.
match (l1:Line), (l2:Line)
where id(l1)>id(l2)
and l1.Play = l2.Play
and l1.Dataline = l2.Dataline + 1
create (l2)-[:FOLLOWED_BY]->(l1);
Then we proceed by connecting the first and last line to the scene with a specific STARTS_WITH
and ENDS
relationship.
Connect the chain to the Scene with Start and Ending
We find the first Dataline
element and start with that:
match (l:Line)-->(s:Scene)
with s, min(l.Dataline) as startline
match (l:Line)
where l.Dataline = startline
create (s)-[:STARTS_WITH]->(l);
And then we find the last Dataline
element and end with that:
match (l:Line)-->(s:Scene)
with s, max(l.Dataline) as endline
match (l:Line)
where l.Dataline = endline
create (s)<-[:ENDS]-(l);
Now we can also remove the link between Line
and Scene
:
match (l:Line)-[pao:PART_OF]-(sc:Scene)
delete pao;
So how do we use this? We'll took a look at that later when we start querying the data.
Let's now explore some more advanced, data science style use cases for this dataset.
Understanding the importance of different players
One thing that I was trying to figure out, is if the graph could help me understand which characters/players in the graph are more important than others. There's different ways of doing that for sure, and I will just explore two in this article.
Which Players have most Lines in a Play?
Sounds like a simple enough proxy for importance, right? If a Player has more lines, there's a likelihood that they will have a more important role in the story. So let's go there.
Linking players to plays
First we need to connect the Players
to the Plays
for this. That's easy enough - as the indirect connection is of course already there. Here's an easy way to achieve what we need:
match (pl:Player)-->(l:Line)
with pl, l
match (p:Play {name:l.Play})
merge (pl)-[pi:PLAYS_IN]->(p)
on create set pi.nroflines=1
on match set pi.nroflines= pi.nroflines+1;
Note that the [PLAYS_IN]
relationship now also aggregates the nroflines
that a Player has had in a property on the relationship, aka the number of lines that a Player has spoken in a particular play.
Top 3 players (by number of lines) in every play
Next, I wanted to write a query that would find the top 3 Players
in every Play
. We use a query with a subquery for that: the first part finds all the Plays, and then for every Play I look for the Players and the number of lines that I have stored on the relationship.
match (p:Play)
call {
with p
match (pl:Player)-[pi:PLAYS_IN]->(p)
return p.name as Play, pl.name as Players, pi.nroflines as NrOfLines
order by NrOfLines desc
limit 3
}
return Play, collect(Players) as TopPlayers, collect(NrOfLines) as TopPlayersLines
order by Play;
This already gives us a nice little indication of the importance of the Players, but I would like to suggest a more advanced approach.
Understanding player importance because of the player-to-player relationships
Here's what I want to do: I would like to infer a new kind of relationship in our graph, called RELATED_TO
. This relationship would be introduced between two Player nodes, if the Players had been appearing together in one of 3 levels:
- appearing together in the Play, ie level 1
- appearing together in an Act of a Play, ie level 2
- appearing together in a Scene of an Act of a Play, ie level 3
This new relationship will create a mono-partite subgraph of (Players)-[:RELATED_TO]->(OtherPlayers)
, which will be very useful for graph data science work later on. So let's create this.
Level 1: Players in Plays that have played together
Here's the query for that:
match (pl1:Player)-->(p:Play)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=1;
Level 2: Players in Acts that have played together
This will require a two step process:
Step 1: Link Players to Acts
It's very similar to how we linked Players to Plays:
match (pl:Player)-->(l:Line)
with pl, l
match (a:Act {name:l.Play+" - Act "+l.Act})
merge (pl)-[pi:PLAYS_IN]->(a)
on create set pi.nroflines=1
on match set pi.nroflines= pi.nroflines+1;
Step 2: Relate the Players if they were in the same Act
Here's how we can create the relationships between players based on being in the same act:
match (pl1:Player)-->(a:Act)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=2;
Level 3: Players in Scenes that have played together
Again, we need two steps:
Step 1: Link Players to Scenes
We go about this in a very similar way:
match (pl:Player)-->(l:Line)
with pl, l
match (s:Scene {name:l.Play+" - Act "+l.Act+" - Scene "+l.Scene})
merge (pl)-[pi:PLAYS_IN]->(s)
on create set pi.nroflines=1
on match set pi.nroflines= pi.nroflines+1;
Step 2: Relate the Players if they were in the same Scene
Again, very similar to the above:
match (pl1:Player)-->(s:Scene)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=3;
That sets us up nicely for a couple of interesting explorations. Let's get into that.
Some queries and visualisations
Of course there are some great ways to now start working with the data. First we will do some simple queries in the Neo4j Browser.
Look at an entire scene
Let's look at this in two ways:
In the Neo4j Browser
Here's a fairly simple Cypher query, that would look at one entire scene. We are taking a scene from Romeo and Juliet in this case.
match entirescene = (p:Play)--(a:Act)--(s:Scene)-[:STARTS_WITH]->(firstline:Line)-[:FOLLOWED_BY*]-(lastline:Line)-[:ENDS]-(s)
where p.name contains "Romeo"
with entirescene, nodes(entirescene) as nodes
limit 1
unwind nodes as node
match (node)-[r]-(conn)
return entirescene, node, r, conn;
Obviously that's not the greatest visualisation. So let's improve that.
In Neo4j Bloom
In Neo4j Bloom, we can actually customize this query above, by making it into a search phrase. Essentially we parametrise the Play
name in the search phrase (look for the $param
in the screenshot below):
The result then looks like this: This is clearly a lot easier to look at.
Let's look at another query pattern.
Show network of players and their relatedness
Based on the [RELATED_TO]
relationship that we created, we can now look at the players and their "network" of interactions during the play.
In the Neo4j Browser
Here's a simple view of the network of Player relations based on the relations above, for the Romeo and Juliet
Play. If we run this query:
match (pl1:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
with pl1
match playerrelations = (pl1)-[:RELATED_TO]-(pl2:Player)
return playerrelations;
The result very quickly becomes a bit of a hairball:
But luckily, we can also parametrise this as a search phrase in Bloom.
In Neo4j Bloom
Here's what the phrase looks like: And applying that becomes a much more interesting picture: Which allows me to very quickly zoom into the more important "Player nodes":
The point here is of course that, without reading a single line of the text, the graph is telling me which Players are likely to be more important than others. I just love that. I think this is why we can also apply this to so many other domains. The graph structure is immediately giving us insights.
Now let's see how we can enhance this even further, by applying graph algorithms from the Graph Data Science Library to this structure. Should be fun!
Running Graph Data Science on the Shakespeare network
Now that we have that RELATED_TO
relationship, we can actually do some very interesting graph data science work, as this is now a mono-partite subgraph, containing only Player
nodes and RELATED_TO
relationships.
I am a big fan of using Neuler for doing some of this simple graph data science work. It's just a few clicks away, and it generates the code for the most interesting algorithms. I have picked two in this case: Pagerank and Betweenness, both of them different variations of Centrality calculation algorithms.
Calculating Pagerank centrality
Here's how we do that. With a few clicks we can actually configure the algorithm on Neuler.
The code that is actually being run for this looks like this:
:param limit => ( 42);
:param config => ({
nodeProjection: 'Player',
relationshipProjection: {
relType: {
type: 'RELATED_TO',
orientation: 'UNDIRECTED',
properties: {
level: {
property: 'level',
defaultValue: 1
}
}
}
},
relationshipWeightProperty: 'level',
dampingFactor: 0.85,
maxIterations: 20,
writeProperty: 'pagerank'
});
:param communityNodeLimit => ( 10);
CALL gds.pageRank.write($config);
Once that's done, we can run a very simple Cypher query to show the Pagerank property of all the Players:
match (pl:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
return distinct pl.name, pl.pagerank, pl.betweenness
order by pl.pagerank desc
limit 10;
Then we can also run another interesting centrality metric. Here's how we do that:
Calculating Betweenness centrality
With a few clicks we can actually configure the algorithm on Neuler.
:param limit => ( 42);
:param config => ({
nodeProjection: 'Player',
relationshipProjection: {
relType: {
type: 'RELATED_TO',
orientation: 'UNDIRECTED',
properties: {}
}
},
writeProperty: 'betweenness'
});
:param communityNodeLimit => ( 10);
CALL gds.betweenness.write($config);
Once that's done, we can run a very simple Cypher query to show the Betweenness of players
match (pl:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
return distinct pl.name, pl.pagerank, pl.betweenness
order by pl.betweenness desc
limit 10;
No doubt there are tons of additional things we could do with this dataset, but here's where my exercise will end. I am hoping that this was a useful story for you - it definitely was for me. All the code for this exercise is also available as a .mdx markdown file on Github. Download that file and immediately you will have a Neo4j Browser guide that walks you through this entire post right inside Neo4j - isn't that handy?
All the best
Rik Van Bruggen
No comments:
Post a Comment