Bruggen Blog: No WorldCup without a WorldCupGraph!

Monday, 14 November 2022

No WorldCup without a WorldCupGraph!

Last week I was having a conversation with one of my dear Neo4j colleagues, and we were talking about the fact that Graphs are simply so much fun to play around with, and that there's nothing like a great interesting dataset to have people really experiment and acquaint themselves with the technology. I know that to be extremely true, and I think I have demonstrated this elaborately over the years on this runaway blog of mine.

Then the conversation turned to a topic that I know very little about: the FIFA World Cup in Qatar that is starting next week. Now, reading this blog you may know that I am a little addicated to my 2 wheeled #mentalhealthmachine, and that chasing a ball across a field seems like a little bit of a game to me - but hey, that's ok! And with this conversation it actually dawned on me that at Neo4j, we had done "Worldcup Graphs" both in 2014 and in 2018: our friend and former colleague Mark Needham was the driving force behind both of those efforts.

You can still see some of the work that Mark did at the time on Github and Medium. It was truly another example of how a cool and timely dataset would get people to explore the wonderful world of graphs and get to know the technology in a fun and interesting way.

So: I decide that it would be nice to do that again. With all the new tech that is coming out of Neo4j with the release of Neo4j 5, that could not be very difficult, right? Let's take a look.

Looking for the world cup data

As with much of the sports data, it's not always that easy to find up to data interesting data. I think that's because much of that data is probably safely hidden behind copyright protection and paywalls - but there's always some stuff around. So after a bit of searching, I found a couple of interesting data sources:

Andre Becklas' matches, players and cups
David Muñoz Garcia's crazy interesting prediction project
Josh Fjelstul's World Cup database So that's already quite a bit!

To make life a bit easier for myself, I selected a couple of the most interesting .csv files abd put them together in a Google Spreadsheet.

There's a couple of worksheets in this document, but I will only use some of them:

squad_stats with the definition of the different participating teams
players_22 with the different players' information
Qatar_group_stage with the definitions of the group stage matches
historical_win-loose-draw_ratios_qatar2022_teams with the historical data with regards to the different team's matches

So, now we can proceed to actually importing the data into Neo4j, and have some fun!

Constructing a model

Whenever we do any kind of data wrangling, it's always important to ask yourself the question what kind of a data model you want to import the data into. The nice thing about graphs is of course that this super flexible and adaptable, but still you want to take some care to do the import appropriately. After going throught the data, I decided on the following structure:

Some interesting components to this:

on the left hand side of the model you find the player data, including the Clubs, Leagues, Positions and Traits of the Players.
in the center of the model is the NationalTeam that the Players play for, based on their nationality. Now, the player dataset actually includes a lot of players that will never get selected to their national team - so there's two potential relationships between the Player and the NationalTeam.
- HAS_NATIONALITY means that the Player is a potential player for the NationalTeam, but has not necessarily debuted yet.
- PLAYS_FOR means that the Player has already played for the NationalTeam
on the right hand side of the model we find the Matches and the Phases of the Worldcup2022 Tournament.
on the far right you will find the stats of the historic Results of the Matches that have been scheduled.

So let's get cracking with the import now.

Importing the WorldcupGraph, 2022 edition

Setting up the indexes

Quite a trivial thing to do, but it's always a good idea to set up some indexes before making some imports, as it will make the import transactions more efficient, especially if you are planning to do a lot of MERGEs.

CREATE INDEX team_name FOR (t:NationalTeam) ON t.name;
CREATE INDEX player_short_name FOR (p:Player) ON p.short_name;

Running the import

Once we have that, the import transactions can be done in a large multi-statement import:

Let's take you through these statements quickly.

Importing the national teams in Qatar2022

Using the .csv export of the spreadsheet, we can just point LOAD CSV to that URL and create the NationalTeam nodes:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1122940013" AS csv
CREATE (t:NationalTeam)
SET t = csv;

Importing all the FIFA players whose national team is in Qatar2022

Same process, except that this time we will filter on a specific part of the loaded .csv file, specifically the players of countries that are actually present at the 2022 WorldCup tournament:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1368794561" AS csv
WITH csv
WHERE csv.in_qatar = "TRUE"
CREATE (p:Player)
SET p = csv;

As a formality, we also changed the date properties on these Player nodes, so that we could use them later on using the right datatypes.

MATCH (p:Player)
SET p.dob = datetime(p.dob)
SET p.club_joined = datetime(p.club_joined);

Connecting players to positions that they play for their teams

Next, we are creating the Position nodes and associating the players with these nodes. That involves splitting and unwiding these positions, as they are currently all put together ion the player_positions field of the Player nodes. Here's how that works:

MATCH (p:Player)
UNWIND split(p.player_positions,", ") as positions
MERGE (po:Position {name: positions})
CREATE (p)-[:PLAYS_TEAM_POSITION]->(po);

Connecting players to Leagues and Clubs

We also want to make sure that we split off the League and Club nodes of the Players - as that will be interesting to use later on when we do some queries around the capabilities of players in the tournament.

Note that we use the COALESCE function quite a bit at this stage of our import: this allows us to efficiently deal with potential NULL values in our dataset. If we know the league or club of a player, we use that value - but if we don't we use the "UNKNOWN" value for that player.

MATCH (p:Player)
MERGE (l:League {name: coalesce(p.league_name,"UNKNOWN")})
SET l.level = coalesce(p.league_level,"UNKNOWN")
MERGE (cl:Club {id: coalesce(p.club_team_id,"UNKNOWN")})
SET cl.name = coalesce(p.club_name,"UNKNOWN")
CREATE (p)-[:PLAYS_FOR {position: p.club_position, jersey_number: p.club_jersey_number, join_date: p.club_joined, contract_valid_until: p.club_contract_valid_until}]->(cl)
MERGE (cl)-[:PART_OF_LEAGUE]->(l);

Connecting Players to Qatar2022 teams

There's two scenarios in this case:

A first scenario, where the player has a Position recorded for the NationalTeam. Then we create the PLAYS_FOR relationship between the Player and the NationalTeam.

MATCH (p:Player), (t:NationalTeam)
WHERE  p.nationality_name = t.name
AND p.nation_position IS NOT NULL
MERGE (p)-[:PLAYS_FOR]->(t);

A second scenarion where we don't have a Position recorded for the Player in the NationalTeam. In that case, we just record that the Player has the nationality of the team - with the HAS_NATIONALITY relationship.

MATCH (p:Player), (t:NationalTeam)
WHERE  p.nationality_name = t.name
AND p.nation_position IS NULL
MERGE (p)-[:HAS_NATIONALITY]->(t);

Connecting Players to traits

Because we have the info, we may as well connect the Player to the Traits in our dataset:

MATCH (p:Player)
UNWIND split(p.player_traits,", ") as traits
MERGE (tr:Trait {name: traits})
CREATE (p)-[:HAS_TRAIT]->(tr);

Adding Ranking information to Teams of Qatar2022

We can also add the ranking information to the NationalTeam nodes like this:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=320760693" AS csv
MERGE (t:NationalTeam {name: csv.team})
SET t.date_of_score = csv.date
SET t.rank = csv.rank
SET t.goalkeeper_score = csv.goalkeeper_score
SET t.defense_score = csv.defense_score
SET t.offense_score = csv.offense_score
SET    t.midfield_score = csv.midfield_score;

Adding Qatar2022 matches

The last exercise is to add the actual Match information of the tournament to the database. The only thing special here is that every match will also have a IS_HOME_TEAM and an IS_AWAY_TEAM relationship to the teams.

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=256106829" AS csv
MERGE (t:Tournament {name:"Qatar2022"})
MERGE (ph:Phase {name: csv.phase})
MERGE (ph)-[:PHASE_OF]->(t)
CREATE (m:Match)-[:IN_PHASE]->(ph)
SET m.date = datetime(csv.date)
SET m.id = csv.match
SET m.home_team = csv.country1
SET m.away_team = csv.country2
WITH m
MATCH (t:NationalTeam {name: m.home_team})
CREATE (m)-[:IS_HOME_TEAM]->(t)
with m
MATCH (t:NationalTeam {name: m.away_team})
CREATE (m)-[:IS_AWAY_TEAM]->(t);

As a very last step in the import process, I have also added historical results to every one of the matches - so that we can potentially better understand likely outcomes of these matches:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=272034195" AS csv
MATCH (t:Tournament {name:"Qatar2022"})--(ph:Phase)--(m:Match),
(t1:NationalTeam {name: csv.country1})--(m)--(t2:NationalTeam {name: csv.country2})
CREATE (m)-[:HAS_HISTORIC_RESULTS]->(r:RESULTS)
SET r = csv;

Note that this actually adds two sets of RESULTS to every match, one always being the inverse of the other.

Now that we have the data loaded into our graph - we can start doing some queries.

Some queries to get the juices going

Let's build this up slowly and nicely.

Simple query: Let's find "Belgium"

All we do is lookup the NationalTeamnode:

MATCH (t:NationalTeam {name: "Belgium"}) 
RETURN t;

Next: Things connected to Belgium

Then we look at a path that explores direct and indirect connections to the "Belgium" node:

Direct connections we find like this:

  MATCH path = (t:NationalTeam {name: "Belgium"})--(conn) 
  RETURN path;

indirect connections we find like this:

  MATCH path = (t:NationalTeam {name: "Belgium"})-[*..2]-(conn) 
  RETURN path;

The result looks like this subgraph:

Next, let's look at some Player-centric queries.

Paths between players

First, let's find the shortest paths between Lionel Messi and Kevin De Bruyne:

MATCH path = ALLSHORTESTPATHS ((p1:Player {short_name: "L. Messi"})-[*..5]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;

The results are unsurprising, through some of the Trait nodes.

If we look at the paths that are not necessarily the shortest ones, we get very different subgraphs, of course.

MATCH path =  ((p1:Player {short_name: "L. Messi"})-[:PLAYS_FOR]-()-[*..4]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;

Finally, I also wanted to take a look at how the matches are going to be played, and how the team composition might have an impact on this.

Find matches with players from same club on two sides

The basic idea behind this query is simple: would there be any WorldCup matches, where there would be players playing for National teams, that are actually on the same team during the rest of the season. That could be interesting in many different ways, right?

So here's a simple query specifically for Belgium, essentially looking for a cyclic pattern (starting and ending with the Club node):

MATCH path = (c)--(p1:Player)--(t1:NationalTeam {name: "Belgium"})--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path;

We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:

MATCH path = (c)--(p1:Player)--(t1:NationalTeam)--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path
LIMIT 100;

That's all I have for now. I hope this was an interesting set of ideas and queries for you, and that you can easily experiment with this yourself as well.

Have fun!

Cheers

Rik

My Blog: https://blog.bruggen.com/
My Tweets: https://twitter.com/rvanbruggen
My LinkedIn: https://www.linkedin.com/in/rikvanbruggen/
My Gists: https://gist.github.com/rvanbruggen. All the code in this article is published there as well, specifically over here.

Bruggen Blog

Pages