Last week, my friend and colleague
Michael wrote a really interesting
blogpost on natural language analytics using
Neo4j. He used the
One Ring poem as an example of how you could use Cypher to analyse a text file and put it into a Neo4j database for some advanced analytics. That immediately made me think about my
Graph Karaoke Playlist, and how I could use this technique for some more Graph Karaoke generation. Wouldn't that be nice? More graph karaoke == good!
So in this post I will show you how easy it is to get this done. A couple of quick steps is all what is needed. Let's run through it and show you how it's done.
Loading a song
The first thing to do, as always, was picking a song. So this time, my kids picked it:
Billie Jean, by the
King of Pop himself. Not wanting to sound pretentious, but I think it's great for my kids to big fans of that kind of music - seems like all of our educational efforts are yielding some results :) ...
Then I picked up the lyrics of the song
over here, and put it into
a google doc. The reason why, is that I wanted to do one small manipulation to the file in order to be able to use it for Karaoke: I added the Songpart and the Songpartsentence in two additional columns. Plus: the Google sheet has a very easy conversion into a
csv file that we can then point the Load CSV process to.
Customizing the query
With that CSV file available, I then proceeded to customize Michael's query. Here it is:
//create the karaoke graph
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE/export?format=csv&id=1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE&gid=0" as csv
with csv.Songpart as songpart, csv.Songpartsentence as songpartsentence, csv.Songsentence as row
unwind row as text
with songpart, songpartsentence, reduce(t=tolower(text), delim in [",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as normalized
with songpart, songpartsentence, [w in split(normalized," ") | trim(w)] as words
unwind range(0,size(words)-2) as idx
MERGE (w1:Word {name:words[idx]})
MERGE (w2:Word {name:words[idx+1]})
MERGE (w1)-[r:NEXT {songpart:toInt(songpart), songpartsentence:toInt(songpartsentence)}]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1
Let's run through this query to make it easier for you to digest. We start with the "load csv" statement. We point to the csv download link mentioned above, user the first row as headers and identify that with an identifier called "csv".
load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE/export?format=csv&id=1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE&gid=0" as csv
Then we pull the csv into three different sets that we can address separately with separate identifiers:
with csv.Songpart as songpart, csv.Songpartsentence as songpartsentence, csv.Songsentence as row
Then we use the Cypher "
unwind" operator to create separate rows out of the "row" collection, and call these rows containing lyrics "text".
Afterwards, we are gong to be using "reduce" to remove punctuation marks and then split the text into individual lyrical words:
with songpart, songpartsentence, reduce(t=tolower(text), delim in [",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as normalized
with songpart, songpartsentence, [w in split(normalized," ") | trim(w)] as words
Lastly, we want to write these words into the graph. In order to do that, we are going to use "unwind" to generate an in-memory index, and then stepping through every sentence to generate the sequences. We do that with "Merge", first for the words, and then for the relationships. On every relationship, we will "karaoke-ize" the graph by assigning "songpart" and "songpartsentence" identifiers to every relationship.
UNWIND range(0,size(words)-2) as idx
MERGE (w1:Word {name:words[idx]})
MERGE (w2:Word {name:words[idx+1]})
MERGE (w1)-[r:NEXT {songpart:toInt(songpart), songpartsentence:toInt(songpartsentence)}]->(w2)
ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1
That was easy!
So where is the KARAOKE???
Hah! That's what you came here for huh? Well, here's the result.
I have put the queries on a
gist so that you can take a look at it yourself. If you have any comments, then please let me know!
Cheers
Rik