Monday, 2 August 2021

Summer fun with Musicbrainz: the "real" Six Degrees of Kanye West (part 3/3)

So: now that we have had some fun setting up our local Musicbrainz database (part 1), and importing the data into our Neo4j database (part 2), we can now start having some fun. That means: checking if that actual 6 degrees of Kanye West, and the actual "Kanye Number", is findable and reproducible in our Neo4j database, in an efficient way. Let's take a look at that.

Note: part of this effort was actually motivated by the fact that I have noticed that the python code that powers the above website, actually caches the results (see the github repo for more info) rather than calculate the Kanye Number in real time like we will do here. I guess that speaks to the power of graph databases, right?

But let's take a look at some queries.

Find other artists that worked together with Kanye

Let's start with some simple

match (kanye:Artist {name: "Kanye West"})--(r:Recording)--(a2:Artist)
return kanye,r,a2
limit 100

That gives you a bit of a peek already: Kanye's co-recorders

Now let's continue on our Kanye Number mission.

Find Kanye Number for Helmut Lotti

Here's the query for that:

match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains "Helmut Lotti"
with kanye, a
match path = shortestpath((kanye)-[*]-(a))
return a.name, length(path)/2 as kanyenumber;

Helmut Lotti KanyeNR

Of course, you can also look at that graphically, and return ALL the shortest paths as paths:

match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains "Helmut Lotti"
with kanye, a
match path = allshortestpaths((kanye)-[*]-(a))
return path;

Helmut Lotti KanyeNr as a graph

Find Kanye Number for Bruce Springsteen using fulltext

Now, you will have noticed that I had to spell Helmut Lotti with a capital H and a capital L. That's not very handy - as we never really know how an artist may get spelled. So therefore, I actually implemented a so-called fulltext index on the .name property of the Artists. This allows us to ignore the case of the name of the artists, and even get a bit of a benefit from using some of the smart Lucene indexes that Neo4j provides for fulltext indexing of text.

Let's use a search for Bruce Springsteen, uncapitalised, as an example. The query for the Kanye Number then looks a little different, as we have to look up The Boss a little differently:

match (kanye:Artist {name: "Kanye West"})
with kanye
CALL db.index.fulltext.queryNodes("fulltext_artist_name", "bruce springsteen") YIELD node
with kanye, node
limit 5
match path = shortestpath((kanye)-[*]-(node))
return node.name, length(path)/2 as kanyenumber;

Bruce Springsteen Kanyenr fulltext

So that worked really well as well :) ... and it works in near real time for most, if not all, artists in the Musicbrainz database. Pretty neat!

Now, let's see how we could actually mimic the process that the 6degreesofkanyewest.com uses, ie. precalculating the Kanye Number for every artist. In my mind, this is not a great approach (as we would need to rerun this with every new release, right?), but it's very much possible to do this in Neo4j. Let's see how that would work.

Calculate the Kanye Number in Batch for ALL artists

Essentially, we would need / want to create a process that runs over all the artists, runs the shortestpath algo, and then writes back the found number to the database. Here's a query that does that, for 500 artists at a time:

call apoc.periodic.iterate(
  "match (a:Artist), (k:Artist {name:'Kanye West'})
  return a,k",
  "match path = shortestpath((k)-[*]-(a))
    set a.kanyenr = length(path)/2",
  {batchSize:500, parallel:false});

Calculate Kanyenr in batch

Now, there's no magic here: running that query takes hours - I had to let it churn away at the dataset for an entire night (38406500ms / 1000 / 60 / 60 = 10,66 hours), but then it was done:

Result of batch kanyenr calculation

Of course, the big plus then is that you can now just read the kanyenumber from the property:

match (a:Artist)
where a.name contains "Bruce Springsteen"
return a.name, a.kanyenr

Reading kanyenr from property

There's a few more interesting experiments that I wanted to take a look at.

What are the most important artists with lowest Kanyenr

This was something I wanted to know. Artists with a low Kanyenr (ie. they have directly worked with the Big K himself) are not always equal: some artists have actually worked with Kanye a LOT, and others just once or twice. So the number of relationships matters here. Let's run that query:

match (kanye:Artist {name: "Kanye West"})--(r:Recording)--(a2:Artist)
return distinct a2.name, count(r) as nrofrecordings, a2.kanyenr
order by nrofrecordings desc
limit 100;

artists with low kanyenr and lots of recordings

So we see that Jay-Z and Kanye are actually great friends, it seems. Or, as I just found out (there goes my popmusic street credibility!!!), they were friends, then they weren't, and now they are again? OMG!

Recordings with the most artists

Another interesting thought that came in my mind. What recording, would actually have the most participating artists? I am thinking Band Aid or USA for Africa style.

Well, let's take a look at the latter first: USA for Africa 

That's interesting - there's about 54 artists in that subgraph - but is that it? Are there recordings with more artists?

Let's explore:

match (r:Recording)
with r, apoc.node.degree(r) as degree
order by degree desc
limit 10
match (a:Artist)-[rec:RECORDED]->(r)
return r.name as Recording, degree as NrOfArtists, apoc.coll.sort(collect(a.name));

Recordings with most artists in recording 

There's super interesting. Lots of Asian songs in the top, but also this We Have Seen God's Glory song, which indeed does host a very large number of participating artists!

Last but not least: exploring MusicBrainz in Bloom

Getting to the end of our exploration here, I decided to take the dataset for a spin inside Neo4j Bloom as well. So I created a perspective and a few search phrases as well:

Helmut Lotti's Kanyenr in Bloom using a search phrase

This search phrase allows me to look at Helmut Lotti's Kany Number:

match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains $artistname
with kanye, a
match path = allshortestpaths((kanye)-[*]-(a))
return path

Helmut Lotti in Bloom

Pretty neat!

Co-recordings of artist in Bloom

And this one allows me to look at the co-recordings of an artist:

match path = (a:Artist)-[:RECORDED]->(r:Recording)<-[:RECORDED]-(a2:Artist)
where a.name contains $artist
return path

Co-recordings of artist in Bloom

There's obviously endless additional possibilities, and we can share more of them in the Github repo.

In any case I hope you have enjoyed reading my experiments here - and I look forward to your feedback.

Cheers

Rik Van Bruggen

This post is part of a 3-part series. You can find

No comments:

Post a Comment