Last week I was having a conversation with one of my dear Neo4j colleagues, and we were talking about the fact that Graphs are simply so much fun to play around with, and that there's nothing like a great interesting dataset to have people really experiment and acquaint themselves with the technology. I know that to be extremely true, and I think I have demonstrated this elaborately over the years on this runaway blog of mine.
Then the conversation turned to a topic that I know very little about: the FIFA World Cup in Qatar that is starting next week. Now, reading this blog you may know that I am a little addicated to my 2 wheeled #mentalhealthmachine, and that chasing a ball across a field seems like a little bit of a game to me - but hey, that's ok! And with this conversation it actually dawned on me that at Neo4j, we had done "Worldcup Graphs" both in 2014 and in 2018: our friend and former colleague Mark Needham was the driving force behind both of those efforts.
You can still see some of the work that Mark did at the time on Github and Medium. It was truly another example of how a cool and timely dataset would get people to explore the wonderful world of graphs and get to know the technology in a fun and interesting way.
So: I decide that it would be nice to do that again. With all the new tech that is coming out of Neo4j with the release of Neo4j 5, that could not be very difficult, right? Let's take a look.
Looking for the world cup data
As with much of the sports data, it's not always that easy to find up to data interesting data. I think that's because much of that data is probably safely hidden behind copyright protection and paywalls - but there's always some stuff around. So after a bit of searching, I found a couple of interesting data sources:
- Andre Becklas' matches, players and cups
- David Muñoz Garcia's crazy interesting prediction project
- Josh Fjelstul's World Cup database So that's already quite a bit!
To make life a bit easier for myself, I selected a couple of the most interesting .csv
files abd put them together in a Google Spreadsheet.
There's a couple of worksheets in this document, but I will only use some of them:
squad_stats
with the definition of the different participating teamsplayers_22
with the different players' informationQatar_group_stage
with the definitions of the group stage matcheshistorical_win-loose-draw_ratios_qatar2022_teams
with the historical data with regards to the different team's matches
So, now we can proceed to actually importing the data into Neo4j, and have some fun!
Constructing a model
Whenever we do any kind of data wrangling, it's always important to ask yourself the question what kind of a data model you want to import the data into. The nice thing about graphs is of course that this super flexible and adaptable, but still you want to take some care to do the import appropriately. After going throught the data, I decided on the following structure:
Some interesting components to this:
- on the left hand side of the model you find the player data, including the
Clubs
,League
s,Position
s andTrait
s of thePlayer
s. - in the center of the model is the
NationalTeam
that thePlayer
s play for, based on their nationality. Now, the player dataset actually includes a lot of players that will never get selected to their national team - so there's two potential relationships between thePlayer
and theNationalTeam
.HAS_NATIONALITY
means that thePlayer
is a potential player for theNationalTeam
, but has not necessarily debuted yet.PLAYS_FOR
means that thePlayer
has already played for theNationalTeam
- on the right hand side of the model we find the
Matche
s and thePhase
s of the Worldcup2022Tournament
. - on the far right you will find the stats of the historic
Results
of theMatch
es that have been scheduled.
So let's get cracking with the import now.
Importing the WorldcupGraph, 2022 edition
Setting up the indexes
Quite a trivial thing to do, but it's always a good idea to set up some indexes before making some imports, as it will make the import transactions more efficient, especially if you are planning to do a lot of MERGE
s.
CREATE INDEX team_name FOR (t:NationalTeam) ON t.name;
CREATE INDEX player_short_name FOR (p:Player) ON p.short_name;
Running the import
Once we have that, the import transactions can be done in a large multi-statement import:
Let's take you through these statements quickly.
Importing the national teams in Qatar2022
Using the .csv
export of the spreadsheet, we can just point LOAD CSV
to that URL and create the NationalTeam
nodes:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1122940013" AS csv
CREATE (t:NationalTeam)
SET t = csv;
Importing all the FIFA players whose national team is in Qatar2022
Same process, except that this time we will filter on a specific part of the loaded .csv
file, specifically the players of countries that are actually present at the 2022 WorldCup tournament:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1368794561" AS csv
WITH csv
WHERE csv.in_qatar = "TRUE"
CREATE (p:Player)
SET p = csv;
As a formality, we also changed the date properties on these Player
nodes, so that we could use them later on using the right datatypes.
MATCH (p:Player)
SET p.dob = datetime(p.dob)
SET p.club_joined = datetime(p.club_joined);
Connecting players to positions that they play for their teams
Next, we are creating the Position
nodes and associating the players with these nodes. That involves splitting and unwiding these positions, as they are currently all put together ion the player_positions
field of the Player
nodes. Here's how that works:
MATCH (p:Player)
UNWIND split(p.player_positions,", ") as positions
MERGE (po:Position {name: positions})
CREATE (p)-[:PLAYS_TEAM_POSITION]->(po);
Connecting players to Leagues and Clubs
We also want to make sure that we split off the League
and Club
nodes of the Player
s - as that will be interesting to use later on when we do some queries around the capabilities of players in the tournament.
Note that we use the COALESCE function quite a bit at this stage of our import: this allows us to efficiently deal with potential NULL
values in our dataset. If we know the league or club of a player, we use that value - but if we don't we use the "UNKNOWN" value for that player.
MATCH (p:Player)
MERGE (l:League {name: coalesce(p.league_name,"UNKNOWN")})
SET l.level = coalesce(p.league_level,"UNKNOWN")
MERGE (cl:Club {id: coalesce(p.club_team_id,"UNKNOWN")})
SET cl.name = coalesce(p.club_name,"UNKNOWN")
CREATE (p)-[:PLAYS_FOR {position: p.club_position, jersey_number: p.club_jersey_number, join_date: p.club_joined, contract_valid_until: p.club_contract_valid_until}]->(cl)
MERGE (cl)-[:PART_OF_LEAGUE]->(l);
Connecting Players to Qatar2022 teams
There's two scenarios in this case:
- A first scenario, where the player has a
Position
recorded for theNationalTeam
. Then we create thePLAYS_FOR
relationship between thePlayer
and theNationalTeam
.
MATCH (p:Player), (t:NationalTeam)
WHERE p.nationality_name = t.name
AND p.nation_position IS NOT NULL
MERGE (p)-[:PLAYS_FOR]->(t);
- A second scenarion where we don't have a
Position
recorded for thePlayer
in theNationalTeam
. In that case, we just record that thePlayer
has the nationality of the team - with theHAS_NATIONALITY
relationship.
MATCH (p:Player), (t:NationalTeam)
WHERE p.nationality_name = t.name
AND p.nation_position IS NULL
MERGE (p)-[:HAS_NATIONALITY]->(t);
Connecting Players to traits
Because we have the info, we may as well connect the Player
to the Trait
s in our dataset:
MATCH (p:Player)
UNWIND split(p.player_traits,", ") as traits
MERGE (tr:Trait {name: traits})
CREATE (p)-[:HAS_TRAIT]->(tr);
Adding Ranking information to Teams of Qatar2022
We can also add the ranking information to the NationalTeam
nodes like this:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=320760693" AS csv
MERGE (t:NationalTeam {name: csv.team})
SET t.date_of_score = csv.date
SET t.rank = csv.rank
SET t.goalkeeper_score = csv.goalkeeper_score
SET t.defense_score = csv.defense_score
SET t.offense_score = csv.offense_score
SET t.midfield_score = csv.midfield_score;
Adding Qatar2022 matches
The last exercise is to add the actual Match
information of the tournament to the database. The only thing special here is that every match will also have a IS_HOME_TEAM
and an IS_AWAY_TEAM
relationship to the teams.
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=256106829" AS csv
MERGE (t:Tournament {name:"Qatar2022"})
MERGE (ph:Phase {name: csv.phase})
MERGE (ph)-[:PHASE_OF]->(t)
CREATE (m:Match)-[:IN_PHASE]->(ph)
SET m.date = datetime(csv.date)
SET m.id = csv.match
SET m.home_team = csv.country1
SET m.away_team = csv.country2
WITH m
MATCH (t:NationalTeam {name: m.home_team})
CREATE (m)-[:IS_HOME_TEAM]->(t)
with m
MATCH (t:NationalTeam {name: m.away_team})
CREATE (m)-[:IS_AWAY_TEAM]->(t);
As a very last step in the import process, I have also added historical results to every one of the matches - so that we can potentially better understand likely outcomes of these matches:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=272034195" AS csv
MATCH (t:Tournament {name:"Qatar2022"})--(ph:Phase)--(m:Match),
(t1:NationalTeam {name: csv.country1})--(m)--(t2:NationalTeam {name: csv.country2})
CREATE (m)-[:HAS_HISTORIC_RESULTS]->(r:RESULTS)
SET r = csv;
Note that this actually adds two sets of RESULTS
to every match, one always being the inverse of the other.
Now that we have the data loaded into our graph - we can start doing some queries.
Some queries to get the juices going
Let's build this up slowly and nicely.
Simple query: Let's find "Belgium"
All we do is lookup the NationalTeam
node:
MATCH (t:NationalTeam {name: "Belgium"})
RETURN t;
Next: Things connected to Belgium
Then we look at a path that explores direct and indirect connections to the "Belgium" node:
Direct connections we find like this:
MATCH path = (t:NationalTeam {name: "Belgium"})--(conn) RETURN path;
indirect connections we find like this:
MATCH path = (t:NationalTeam {name: "Belgium"})-[*..2]-(conn) RETURN path;
The result looks like this subgraph:
Next, let's look at some Player
-centric queries.
Paths between players
First, let's find the shortest paths between Lionel Messi and Kevin De Bruyne:
MATCH path = ALLSHORTESTPATHS ((p1:Player {short_name: "L. Messi"})-[*..5]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;
The results are unsurprising, through some of the Trait
nodes.
If we look at the paths that are not necessarily the shortest ones, we get very different subgraphs, of course.
MATCH path = ((p1:Player {short_name: "L. Messi"})-[:PLAYS_FOR]-()-[*..4]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;
Finally, I also wanted to take a look at how the matches are going to be played, and how the team composition might have an impact on this.
Find matches with players from same club on two sides
The basic idea behind this query is simple: would there be any WorldCup matches, where there would be players playing for National teams, that are actually on the same team during the rest of the season. That could be interesting in many different ways, right?
So here's a simple query specifically for Belgium, essentially looking for a cyclic pattern (starting and ending with the Club
node):
MATCH path = (c)--(p1:Player)--(t1:NationalTeam {name: "Belgium"})--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path;
We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:
MATCH path = (c)--(p1:Player)--(t1:NationalTeam)--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path
LIMIT 100;
That's all I have for now. I hope this was an interesting set of ideas and queries for you, and that you can easily experiment with this yourself as well.
Have fun!
Cheers
Rik
- My Blog: https://blog.bruggen.com/
- My Tweets: https://twitter.com/rvanbruggen
- My LinkedIn: https://www.linkedin.com/in/rikvanbruggen/
- My Gists: https://gist.github.com/rvanbruggen. All the code in this article is published there as well, specifically over here.
No comments:
Post a Comment