Monday, 12 January 2015

Graph Karaoke using "Natural Language Analytics": Billie Jean

Last week, my friend and colleague Michael wrote a really interesting blogpost on natural language analytics using Neo4j. He used the One Ring poem as an example of how you could use Cypher to analyse a text file and put it into a Neo4j database for some advanced analytics. That immediately made me think about my Graph Karaoke Playlist, and how I could use this technique for some more Graph Karaoke generation. Wouldn't that be nice? More graph karaoke == good!

So in this post I will show you how easy it is to get this done. A couple of quick steps is all what is needed. Let's run through it and show you how it's done.

Loading a song

The first thing to do, as always, was picking a song. So this time, my kids picked it:
Billie Jean, by the King of Pop himself. Not wanting to sound pretentious, but I think it's great for my kids to big fans of that kind of music - seems like all of our educational efforts are yielding some results :) ...

Then I picked up the lyrics of the song over here, and put it into a google doc. The reason why, is that I wanted to do one small manipulation to the file in order to be able to use it for Karaoke: I added the Songpart and the Songpartsentence in two additional columns. Plus: the Google sheet has a very easy conversion into a csv file that we can then point the Load CSV process to.

Customizing the query

With that CSV file available, I then proceeded to customize Michael's query. Here it is:

 //create the karaoke graph  
 load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE/export?format=csv&id=1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE&gid=0" as csv  
 with csv.Songpart as songpart, csv.Songpartsentence as songpartsentence, csv.Songsentence as row  
 unwind row as text  
 with songpart, songpartsentence, reduce(t=tolower(text), delim in [",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as normalized  
 with songpart, songpartsentence, [w in split(normalized," ") | trim(w)] as words  
 unwind range(0,size(words)-2) as idx  
 MERGE (w1:Word {name:words[idx]})  
 MERGE (w2:Word {name:words[idx+1]})  
 MERGE (w1)-[r:NEXT {songpart:toInt(songpart), songpartsentence:toInt(songpartsentence)}]->(w2)  
  ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1  

Let's run through this query to make it easier for you to digest. We start with the "load csv" statement. We point to the csv download link mentioned above, user the first row as headers and identify that with an identifier called "csv".
 load csv with headers from "https://docs.google.com/a/neotechnology.com/spreadsheets/d/1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE/export?format=csv&id=1DLu2bl1ZO7Zm8zU1UXNCDZGxsnBkicAJD4J-FSbVXLE&gid=0" as csv  

Then we pull the csv into three different sets that we can address separately with separate identifiers:
 with csv.Songpart as songpart, csv.Songpartsentence as songpartsentence, csv.Songsentence as row   

Then we use the Cypher "unwind" operator to create separate rows out of the "row" collection, and call these rows containing lyrics "text". 
 unwind row as text  

Afterwards, we are gong to be using "reduce" to remove punctuation marks and then split the text into individual lyrical words:
 with songpart, songpartsentence, reduce(t=tolower(text), delim in [",",".","!","?",'"',":",";","'","-"] | replace(t,delim,"")) as normalized  
 with songpart, songpartsentence, [w in split(normalized," ") | trim(w)] as words  

Lastly, we want to write these words into the graph. In order to do that, we are going to use "unwind" to generate an in-memory index, and then stepping through every sentence to generate the sequences. We do that with "Merge", first for the words, and then for the relationships. On every relationship, we will "karaoke-ize" the graph by assigning "songpart" and "songpartsentence" identifiers to every relationship.
 UNWIND range(0,size(words)-2) as idx  
 MERGE (w1:Word {name:words[idx]})  
 MERGE (w2:Word {name:words[idx+1]})  
 MERGE (w1)-[r:NEXT {songpart:toInt(songpart), songpartsentence:toInt(songpartsentence)}]->(w2)  
  ON CREATE SET r.count = 1 ON MATCH SET r.count = r.count +1  

That was easy!

So where is the KARAOKE???

Hah! That's what you came here for huh? Well, here's the result. 

I have put the queries on a gist so that you can take a look at it yourself. If you have any comments, then please let me know!

Cheers

Rik

No comments:

Post a Comment