Thursday, 10 June 2021

Network Analysis of Shakespeare's plays

What do you do when a new colleague starts to talk to you about how they would love to experiment with getting a dataset about Romeo & Juliet into a graph? Yes, that's right, you get your graph boots on, and you start looking out for a great dataset that you could play around with. And as usual, one things leads to another (it's all connected, remember!), and you end up with this incredible experiment that twists, turns and meanders into something fascinating. That's what happened here too.  

William Shakespeare

Finding a Data source

That was so easy. I very quickly located a Dataset on Kaggle that I thought would be really interesting. It's a comma-separated file, about 110k lines long and 10MB in size, that holds all the lines that Shakespeare wrote for his plays. It's just an amazing dataset - not too complicated, but terribly interesting.

The structure of the file has the following File headers:


Of course you can find the dataset on Kaggle yourself, but I actually quickly imported it into a google sheet version that you can access as well. This gsheet is shared and made public on the internet, and can then be downloaded as a csv at any time from this URL. This URL is what we will use for importing this data into Neo4j.

So let's see how we can do that.

Prepare the database

Assuming that you are using one of the latest versions of Neo4j, which supports multiple databases, you should start by creating the database for this exercise:

:use system; 
create or replace database shakespeare; 
:use shakespeare; 

Once that is done, you should also create some indexes on the database, as that will help with the data import and querying later on:

create index on :Play(name); 
create index on :Player(name);
create index on :Scene(name);
create index on :Act(name); 
create index on :Line(PlayerLine);
create index on :Line(Dataline);
create index on :Line(Play);
create index on :Line(Act);
create index on :Line(Scene);

Importing the data into Neo4j

Because dhte data is already in .csv format, and available on the web via the google-sheet-link above, importing the data as is into Neo4j is a no-brainer. All you need to do is use the LOAD CSV command. Here's what that looks like:

Loading the lines of all of Shakespeare's plays - with some dataformats as Integers

Note that I needed to do one specific trick, and that is to convert the Dataline and PlayerLinenumber fields to integers, so that we could sort/sequence them later on. Other wise the create (l:Line) statement could have just been folled by set l = line - but we can't do that now.

Here's the import statement: 

cypher load csv with headers from "" as line
create (l:Line)
 set l.Dataline = toInteger(line.Dataline)
 set l.Play = line.Play
 set l.PlayerLinenumber = toInteger(line.PlayerLinenumber)
 set l.ActSceneLine = line.ActSceneLine
 set l.Player = line.Player
 set l.PlayerLine = line.PlayerLine;

Now that the data is in Neo4j, we can start wrangling it into a much more graphy data structure. Here's how we do that.

Refactoring the data

We already have Player in the Line nodes. So let's extract that first and make them into separate nodes.

Creating the Players of the Lines

We will use a MERGE operation for this to create the node and make sure that it does not get created twice. Next we add the relationship between the player and the line. 

cypher match (l:Line)
where l.Player is not null
with l
merge (pl:Player {name: l.Player})
create (pl)-[:ARTICULATES]->(l);

Next, we are going to look at * where the Line fits into the Scene, * where the Scene fits into the Act, * and where the Act fits into the Play

Here's how we do that.

Extracting the Act and Scene info of the Lines

We have a property on the Line that has the ActSceneLine for every line, separated by a .. Let's first make separate properties of this composite property. Note that we have to account for some Line nodes that don't have an ActSceneLine property, as the original dataset did not have it.

match (l:Line)
where l.ActSceneLine is not null
with l, split(l.ActSceneLine,".") as Array
set l.Act = Array[0]
set l.Scene = Array[1]
set l.Line = Array[2];

So now we can proceed with creating a hierarchy (Play>>Act>>Scene>>Line) for every Play.

Creating the Scene Nodes, linking the lines to the Scenes

Here's how we create the Scene nodes, and link them to the Lines.

match (l:Line)
where l.ActSceneLine is not null
merge (sc:Scene {name: l.Play+" - Act "+l.Act+" - Scene "+l.Scene})
create (l)-[:PART_OF]->(sc);

Creating the Act nodes, linking the Scenes to the Acts

Next we can link the Scenes to the Acts:

match (l:Line)-->(sc:Scene)
where l.ActSceneLine is not null
merge (a:Act {name: l.Play+" - Act "+l.Act})
erge (sc)-[:PART_OF]->(a);

Linking the Act to the Plays

And finally we can link the Acts to the Plays: 

match (l:Line)-->(sc:Scene)-->(a:Act)
where l.ActSceneLine is not null
merge (p:Play {name: l.Play})
merge (a)-[:PART_OF]->(p); 

One last thing to clean up, is the fact that there are some Line nodes that don't have an ActScenLine property, and therefore don't have a Scene or an Act, but that do need to be linked to the Play:

Some lines don't have Acts, but are part of the Play!

match (l:Line)
where l.ActSceneLine is null
merge (p:Play {name: l.Play})
merge (l)-[:PART_OF]->(p);

Next, we will start making the model a bit more understandable.

Making the model more understandable

Currently, the model basically have a Line connected to every Scene that is in an Act of a Play. That works fine, but it does not give us a lot of clues as to how the play would work. That's why I wanted to create a sequential loop of Lines for every Scene: every Line in the Scene would basically connect to the next one, and then the next one, and then... and so on. Here's how we do that.

Linking the lines in a loop for every scene

We start by linking the lines in a chain.

Link the lines in a chain

We will use the Dataline property of every Line for this. 

match (l1:Line), (l2:Line)
where id(l1)>id(l2)
and l1.Play = l2.Play
and l1.Dataline = l2.Dataline + 1
create (l2)-[:FOLLOWED_BY]->(l1);

Then we proceed by connecting the first and last line to the scene with a specific STARTS_WITH and ENDS relationship.

Connect the chain to the Scene with Start and Ending

We find the first Dataline element and start with that: 

match (l:Line)-->(s:Scene)
with s, min(l.Dataline) as startline
match (l:Line)
where l.Dataline = startline
create (s)-[:STARTS_WITH]->(l);

And then we find the last Dataline element and end with that: 

match (l:Line)-->(s:Scene)
with s, max(l.Dataline) as endline
match (l:Line)
where l.Dataline = endline
create (s)<-[:ENDS]-(l); 

Now we can also remove the link between Line and Scene

match (l:Line)-[pao:PART_OF]-(sc:Scene)
delete pao; 

So how do we use this? We'll took a look at that later when we start querying the data.

Let's now explore some more advanced, data science style use cases for this dataset.

Understanding the importance of different players

One thing that I was trying to figure out, is if the graph could help me understand which characters/players in the graph are more important than others. There's different ways of doing that for sure, and I will just explore two in this article.

Which Players have most Lines in a Play?

Sounds like a simple enough proxy for importance, right? If a Player has more lines, there's a likelihood that they will have a more important role in the story. So let's go there.

Linking players to plays

First we need to connect the Players to the Plays for this. That's easy enough - as the indirect connection is of course already there. Here's an easy way to achieve what we need: 

match (pl:Player)-->(l:Line)
with pl, l
match (p:Play {name:l.Play})
merge (pl)-[pi:PLAYS_IN]->(p)
    on create set pi.nroflines=1
    on match set pi.nroflines= pi.nroflines+1;

Note that the [PLAYS_IN] relationship now also aggregates the nroflines that a Player has had in a property on the relationship, aka the number of lines that a Player has spoken in a particular play.

Top 3 players (by number of lines) in every play

Next, I wanted to write a query that would find the top 3 Players in every Play. We use a query with a subquery for that: the first part finds all the Plays, and then for every Play I look for the Players and the number of lines that I have stored on the relationship.

match (p:Play)
call {
    with p
    match (pl:Player)-[pi:PLAYS_IN]->(p)
    return as Play, as Players, pi.nroflines as NrOfLines
    order by NrOfLines desc
    limit 3 }
return Play, collect(Players) as TopPlayers, collect(NrOfLines) as TopPlayersLines
order by Play; 

This already gives us a nice little indication of the importance of the Players, but I would like to suggest a more advanced approach.

Understanding player importance because of the player-to-player relationships

Here's what I want to do: I would like to infer a new kind of relationship in our graph, called RELATED_TO. This relationship would be introduced between two Player nodes, if the Players had been appearing together in one of 3 levels:

  1. appearing together in the Play, ie level 1
  2. appearing together in an Act of a Play, ie level 2
  3. appearing together in a Scene of an Act of a Play, ie level 3

This new relationship will create a mono-partite subgraph of (Players)-[:RELATED_TO]->(OtherPlayers), which will be very useful for graph data science work later on. So let's create this.

Level 1: Players in Plays that have played together

Here's the query for that: 

match (pl1:Player)-->(p:Play)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=1;

Level 2: Players in Acts that have played together

This will require a two step process:

Step 1: Link Players to Acts

It's very similar to how we linked Players to Plays: 

match (pl:Player)-->(l:Line)
with pl, l
match (a:Act {name:l.Play+" - Act "+l.Act})
merge (pl)-[pi:PLAYS_IN]->(a)
     on create set pi.nroflines=1
     on match set pi.nroflines= pi.nroflines+1;

Step 2: Relate the Players if they were in the same Act

Here's how we can create the relationships between players based on being in the same act: 

match (pl1:Player)-->(a:Act)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=2;

Level 3: Players in Scenes that have played together

Again, we need two steps:

Step 1: Link Players to Scenes

We go about this in a very similar way: 

match (pl:Player)-->(l:Line)
with pl, l
match (s:Scene {name:l.Play+" - Act "+l.Act+" - Scene "+l.Scene})
merge (pl)-[pi:PLAYS_IN]->(s)
     on create set pi.nroflines=1
     on match set pi.nroflines= pi.nroflines+1;

Step 2: Relate the Players if they were in the same Scene

Again, very similar to the above: 

match (pl1:Player)-->(s:Scene)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=3;

That sets us up nicely for a couple of interesting explorations. Let's get into that.

Some queries and visualisations

Of course there are some great ways to now start working with the data. First we will do some simple queries in the Neo4j Browser.

Look at an entire scene

Let's look at this in two ways:

In the Neo4j Browser

Here's a fairly simple Cypher query, that would look at one entire scene. We are taking a scene from Romeo and Juliet in this case. 

match entirescene = (p:Play)--(a:Act)--(s:Scene)-[:STARTS_WITH]->(firstline:Line)-[:FOLLOWED_BY*]-(lastline:Line)-[:ENDS]-(s)
where contains "Romeo" 
with entirescene, nodes(entirescene) as nodes
limit 1
unwind nodes as node
match (node)-[r]-(conn)
return entirescene, node, r, conn;

Obviously that's not the greatest visualisation. So let's improve that.

In Neo4j Bloom

In Neo4j Bloom, we can actually customize this query above, by making it into a search phrase. Essentially we parametrise the Play name in the search phrase (look for the $param in the screenshot below):

The result then looks like this: This is clearly a lot easier to look at.

Let's look at another query pattern.

Show network of players and their relatedness

Based on the [RELATED_TO] relationship that we created, we can now look at the players and their "network" of interactions during the play.

In the Neo4j Browser

Here's a simple view of the network of Player relations based on the relations above, for the Romeo and Juliet Play. If we run this query:

match (pl1:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
with pl1
match playerrelations = (pl1)-[:RELATED_TO]-(pl2:Player)
return playerrelations;

The result very quickly becomes a bit of a hairball:

But luckily, we can also parametrise this as a search phrase in Bloom.

In Neo4j Bloom

Here's what the phrase looks like: And applying that becomes a much more interesting picture: Which allows me to very quickly zoom into the more important "Player nodes":

The point here is of course that, without reading a single line of the text, the graph is telling me which Players are likely to be more important than others. I just love that. I think this is why we can also apply this to so many other domains. The graph structure is immediately giving us insights.

Now let's see how we can enhance this even further, by applying graph algorithms from the Graph Data Science Library to this structure. Should be fun!

Running Graph Data Science on the Shakespeare network

Now that we have that RELATED_TO relationship, we can actually do some very interesting graph data science work, as this is now a mono-partite subgraph, containing only Player nodes and RELATED_TO relationships.

I am a big fan of using Neuler for doing some of this simple graph data science work. It's just a few clicks away, and it generates the code for the most interesting algorithms. I have picked two in this case: Pagerank and Betweenness, both of them different variations of Centrality calculation algorithms.

Calculating Pagerank centrality

Here's how we do that. With a few clicks we can actually configure the algorithm on Neuler.

The code that is actually being run for this looks like this: 

:param limit => ( 42);
:param config => ({ nodeProjection: 'Player', relationshipProjection: { relType: { type: 'RELATED_TO', orientation: 'UNDIRECTED', properties: { level: { property: 'level', defaultValue: 1 } } } }, relationshipWeightProperty: 'level', dampingFactor: 0.85, maxIterations: 20, writeProperty: 'pagerank' });
:param communityNodeLimit => ( 10);
CALL gds.pageRank.write($config); 

Once that's done, we can run a very simple Cypher query to show the Pagerank property of all the Players:

match (pl:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
return distinct, pl.pagerank, pl.betweenness
order by pl.pagerank desc
limit 10;

Then we can also run another interesting centrality metric. Here's how we do that:

Calculating Betweenness centrality

With a few clicks we can actually configure the algorithm on Neuler.

:param limit => ( 42);
:param config => ({ nodeProjection: 'Player', relationshipProjection: { relType: { type: 'RELATED_TO', orientation: 'UNDIRECTED', properties: {} } }, writeProperty: 'betweenness' });
:param communityNodeLimit => ( 10);
CALL gds.betweenness.write($config);

Once that's done, we can run a very simple Cypher query to show the Betweenness of players

match (pl:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
return distinct, pl.pagerank, pl.betweenness
order by pl.betweenness desc
limit 10;

No doubt there are tons of additional things we could do with this dataset, but here's where my exercise will end. I am hoping that this was a useful story for you - it definitely was for me. All the code for this exercise is also available as a .mdx markdown file on Github. Download that file and immediately you will have a Neo4j Browser guide that walks you through this entire post right inside Neo4j - isn't that handy?

All the best

Rik Van Bruggen 

No comments:

Post a Comment