So: now that we have had some fun setting up our local Musicbrainz database (part 1), and importing the data into our Neo4j database (part 2), we can now start having some fun. That means: checking if that actual 6 degrees of Kanye West, and the actual "Kanye Number", is findable and reproducible in our Neo4j database, in an efficient way. Let's take a look at that.
Note: part of this effort was actually motivated by the fact that I have noticed that the python code that powers the above website, actually caches the results (see the github repo for more info) rather than calculate the Kanye Number in real time like we will do here. I guess that speaks to the power of graph databases, right?
But let's take a look at some queries.
Find other artists that worked together with Kanye
Let's start with some simple
match (kanye:Artist {name: "Kanye West"})--(r:Recording)--(a2:Artist)
return kanye,r,a2
limit 100
That gives you a bit of a peek already:
Now let's continue on our Kanye Number mission.
Find Kanye Number for Helmut Lotti
Here's the query for that:
match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains "Helmut Lotti"
with kanye, a
match path = shortestpath((kanye)-[*]-(a))
return a.name, length(path)/2 as kanyenumber;
Of course, you can also look at that graphically, and return ALL the shortest paths as paths:
match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains "Helmut Lotti"
with kanye, a
match path = allshortestpaths((kanye)-[*]-(a))
return path;
Find Kanye Number for Bruce Springsteen using fulltext
Now, you will have noticed that I had to spell Helmut Lotti
with a capital H
and a capital L
. That's not very handy - as we never really know how an artist may get spelled. So therefore, I actually implemented a so-called fulltext index on the .name
property of the Artists. This allows us to ignore the case of the name of the artists, and even get a bit of a benefit from using some of the smart Lucene indexes that Neo4j provides for fulltext indexing of text.
Let's use a search for Bruce Springsteen, uncapitalised, as an example. The query for the Kanye Number then looks a little different, as we have to look up The Boss a little differently:
match (kanye:Artist {name: "Kanye West"})
with kanye
CALL db.index.fulltext.queryNodes("fulltext_artist_name", "bruce springsteen") YIELD node
with kanye, node
limit 5
match path = shortestpath((kanye)-[*]-(node))
return node.name, length(path)/2 as kanyenumber;
So that worked really well as well :) ... and it works in near real time for most, if not all, artists in the Musicbrainz database. Pretty neat!
Now, let's see how we could actually mimic the process that the 6degreesofkanyewest.com uses, ie. precalculating the Kanye Number for every artist. In my mind, this is not a great approach (as we would need to rerun this with every new release, right?), but it's very much possible to do this in Neo4j. Let's see how that would work.
Calculate the Kanye Number in Batch for ALL artists
Essentially, we would need / want to create a process that runs over all the artists, runs the shortestpath algo, and then writes back the found number to the database. Here's a query that does that, for 500 artists at a time:
call apoc.periodic.iterate(
"match (a:Artist), (k:Artist {name:'Kanye West'})
return a,k",
"match path = shortestpath((k)-[*]-(a))
set a.kanyenr = length(path)/2",
{batchSize:500, parallel:false});
Now, there's no magic here: running that query takes hours - I had to let it churn away at the dataset for an entire night (38406500ms / 1000 / 60 / 60 = 10,66 hours), but then it was done:
Of course, the big plus then is that you can now just read the kanyenumber from the property:
match (a:Artist)
where a.name contains "Bruce Springsteen"
return a.name, a.kanyenr
There's a few more interesting experiments that I wanted to take a look at.
What are the most important artists with lowest Kanyenr
This was something I wanted to know. Artists with a low Kanyenr (ie. they have directly worked with the Big K himself) are not always equal: some artists have actually worked with Kanye a LOT, and others just once or twice. So the number of relationships matters here. Let's run that query:
match (kanye:Artist {name: "Kanye West"})--(r:Recording)--(a2:Artist)
return distinct a2.name, count(r) as nrofrecordings, a2.kanyenr
order by nrofrecordings desc
limit 100;
So we see that Jay-Z and Kanye are actually great friends, it seems. Or, as I just found out (there goes my popmusic street credibility!!!), they were friends, then they weren't, and now they are again? OMG!
Recordings with the most artists
Another interesting thought that came in my mind. What recording, would actually have the most participating artists? I am thinking Band Aid or USA for Africa style.
Well, let's take a look at the latter first:
That's interesting - there's about 54 artists in that subgraph - but is that it? Are there recordings with more artists?
Let's explore:
match (r:Recording)
with r, apoc.node.degree(r) as degree
order by degree desc
limit 10
match (a:Artist)-[rec:RECORDED]->(r)
return r.name as Recording, degree as NrOfArtists, apoc.coll.sort(collect(a.name));
There's super interesting. Lots of Asian songs in the top, but also this We Have Seen God's Glory song, which indeed does host a very large number of participating artists!
Last but not least: exploring MusicBrainz in Bloom
Getting to the end of our exploration here, I decided to take the dataset for a spin inside Neo4j Bloom as well. So I created a perspective and a few search phrases as well:
Helmut Lotti's Kanyenr in Bloom using a search phrase
This search phrase allows me to look at Helmut Lotti's Kany Number:
match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains $artistname
with kanye, a
match path = allshortestpaths((kanye)-[*]-(a))
return path
Pretty neat!
Co-recordings of artist in Bloom
And this one allows me to look at the co-recordings of an artist:
match path = (a:Artist)-[:RECORDED]->(r:Recording)<-[:RECORDED]-(a2:Artist)
where a.name contains $artist
return path
There's obviously endless additional possibilities, and we can share more of them in the Github repo.
In any case I hope you have enjoyed reading my experiments here - and I look forward to your feedback.
Cheers
Rik Van Bruggen
This post is part of a 3-part series. You can find
No comments:
Post a Comment