Last week I was having a conversation with one of my dear Neo4j colleagues, and we were talking about the fact that Graphs are simply so much fun to play around with, and that there's nothing like a great interesting dataset to have people really experiment and acquaint themselves with the technology. I know that to be extremely true, and I think I have demonstrated this elaborately over the years on this runaway blog of mine.
Then the conversation turned to a topic that I know very little about: the FIFA World Cup in Qatar that is starting next week. Now, reading this blog you may know that I am a little addicated to my 2 wheeled #mentalhealthmachine, and that chasing a ball across a field seems like a little bit of a game to me - but hey, that's ok! And with this conversation it actually dawned on me that at Neo4j, we had done "Worldcup Graphs" both in 2014 and in 2018: our friend and former colleague Mark Needham was the driving force behind both of those efforts.
![]()
You can still see some of the work that Mark did at the time on Github and Medium. It was truly another example of how a cool and timely dataset would get people to explore the wonderful world of graphs and get to know the technology in a fun and interesting way.
So: I decide that it would be nice to do that again. With all the new tech that is coming out of Neo4j with the release of Neo4j 5, that could not be very difficult, right? Let's take a look.
Looking for the world cup data
As with much of the sports data, it's not always that easy to find up to data interesting data. I think that's because much of that data is probably safely hidden behind copyright protection and paywalls - but there's always some stuff around. So after a bit of searching, I found a couple of interesting data sources:
- Andre Becklas' matches, players and cups
- David Muñoz Garcia's crazy interesting prediction project
- Josh Fjelstul's World Cup database So that's already quite a bit!
To make life a bit easier for myself, I selected a couple of the most interesting .csv files abd put them together in a Google Spreadsheet.
There's a couple of worksheets in this document, but I will only use some of them:
squad_statswith the definition of the different participating teamsplayers_22with the different players' informationQatar_group_stagewith the definitions of the group stage matcheshistorical_win-loose-draw_ratios_qatar2022_teamswith the historical data with regards to the different team's matches
So, now we can proceed to actually importing the data into Neo4j, and have some fun!
Constructing a model
Whenever we do any kind of data wrangling, it's always important to ask yourself the question what kind of a data model you want to import the data into. The nice thing about graphs is of course that this super flexible and adaptable, but still you want to take some care to do the import appropriately. After going throught the data, I decided on the following structure:
Some interesting components to this:
- on the left hand side of the model you find the player data, including the
Clubs,Leagues,Positions andTraits of thePlayers. - in the center of the model is the
NationalTeamthat thePlayers play for, based on their nationality. Now, the player dataset actually includes a lot of players that will never get selected to their national team - so there's two potential relationships between thePlayerand theNationalTeam.HAS_NATIONALITYmeans that thePlayeris a potential player for theNationalTeam, but has not necessarily debuted yet.PLAYS_FORmeans that thePlayerhas already played for theNationalTeam
- on the right hand side of the model we find the
Matches and thePhases of the Worldcup2022Tournament. - on the far right you will find the stats of the historic
Resultsof theMatches that have been scheduled.
So let's get cracking with the import now.
Importing the WorldcupGraph, 2022 edition
Setting up the indexes
Quite a trivial thing to do, but it's always a good idea to set up some indexes before making some imports, as it will make the import transactions more efficient, especially if you are planning to do a lot of MERGEs.
CREATE INDEX team_name FOR (t:NationalTeam) ON t.name;
CREATE INDEX player_short_name FOR (p:Player) ON p.short_name;
Running the import
Once we have that, the import transactions can be done in a large multi-statement import:
Let's take you through these statements quickly.
Importing the national teams in Qatar2022
Using the .csv export of the spreadsheet, we can just point LOAD CSV to that URL and create the NationalTeam nodes:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1122940013" AS csv
CREATE (t:NationalTeam)
SET t = csv;
Importing all the FIFA players whose national team is in Qatar2022
Same process, except that this time we will filter on a specific part of the loaded .csv file, specifically the players of countries that are actually present at the 2022 WorldCup tournament:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1368794561" AS csv
WITH csv
WHERE csv.in_qatar = "TRUE"
CREATE (p:Player)
SET p = csv;
As a formality, we also changed the date properties on these Player nodes, so that we could use them later on using the right datatypes.
MATCH (p:Player)
SET p.dob = datetime(p.dob)
SET p.club_joined = datetime(p.club_joined);
Connecting players to positions that they play for their teams
Next, we are creating the Position nodes and associating the players with these nodes. That involves splitting and unwiding these positions, as they are currently all put together ion the player_positions field of the Player nodes. Here's how that works:
MATCH (p:Player)
UNWIND split(p.player_positions,", ") as positions
MERGE (po:Position {name: positions})
CREATE (p)-[:PLAYS_TEAM_POSITION]->(po);
Connecting players to Leagues and Clubs
We also want to make sure that we split off the League and Club nodes of the Players - as that will be interesting to use later on when we do some queries around the capabilities of players in the tournament.
Note that we use the COALESCE function quite a bit at this stage of our import: this allows us to efficiently deal with potential NULL values in our dataset. If we know the league or club of a player, we use that value - but if we don't we use the "UNKNOWN" value for that player.
MATCH (p:Player)
MERGE (l:League {name: coalesce(p.league_name,"UNKNOWN")})
SET l.level = coalesce(p.league_level,"UNKNOWN")
MERGE (cl:Club {id: coalesce(p.club_team_id,"UNKNOWN")})
SET cl.name = coalesce(p.club_name,"UNKNOWN")
CREATE (p)-[:PLAYS_FOR {position: p.club_position, jersey_number: p.club_jersey_number, join_date: p.club_joined, contract_valid_until: p.club_contract_valid_until}]->(cl)
MERGE (cl)-[:PART_OF_LEAGUE]->(l);
Connecting Players to Qatar2022 teams
There's two scenarios in this case:
- A first scenario, where the player has a
Positionrecorded for theNationalTeam. Then we create thePLAYS_FORrelationship between thePlayerand theNationalTeam.
MATCH (p:Player), (t:NationalTeam)
WHERE p.nationality_name = t.name
AND p.nation_position IS NOT NULL
MERGE (p)-[:PLAYS_FOR]->(t);
- A second scenarion where we don't have a
Positionrecorded for thePlayerin theNationalTeam. In that case, we just record that thePlayerhas the nationality of the team - with theHAS_NATIONALITYrelationship.
MATCH (p:Player), (t:NationalTeam)
WHERE p.nationality_name = t.name
AND p.nation_position IS NULL
MERGE (p)-[:HAS_NATIONALITY]->(t);
Connecting Players to traits
Because we have the info, we may as well connect the Player to the Traits in our dataset:
MATCH (p:Player)
UNWIND split(p.player_traits,", ") as traits
MERGE (tr:Trait {name: traits})
CREATE (p)-[:HAS_TRAIT]->(tr);
Adding Ranking information to Teams of Qatar2022
We can also add the ranking information to the NationalTeam nodes like this:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=320760693" AS csv
MERGE (t:NationalTeam {name: csv.team})
SET t.date_of_score = csv.date
SET t.rank = csv.rank
SET t.goalkeeper_score = csv.goalkeeper_score
SET t.defense_score = csv.defense_score
SET t.offense_score = csv.offense_score
SET t.midfield_score = csv.midfield_score;
Adding Qatar2022 matches
The last exercise is to add the actual Match information of the tournament to the database. The only thing special here is that every match will also have a IS_HOME_TEAM and an IS_AWAY_TEAM relationship to the teams.
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=256106829" AS csv
MERGE (t:Tournament {name:"Qatar2022"})
MERGE (ph:Phase {name: csv.phase})
MERGE (ph)-[:PHASE_OF]->(t)
CREATE (m:Match)-[:IN_PHASE]->(ph)
SET m.date = datetime(csv.date)
SET m.id = csv.match
SET m.home_team = csv.country1
SET m.away_team = csv.country2
WITH m
MATCH (t:NationalTeam {name: m.home_team})
CREATE (m)-[:IS_HOME_TEAM]->(t)
with m
MATCH (t:NationalTeam {name: m.away_team})
CREATE (m)-[:IS_AWAY_TEAM]->(t);
As a very last step in the import process, I have also added historical results to every one of the matches - so that we can potentially better understand likely outcomes of these matches:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=272034195" AS csv
MATCH (t:Tournament {name:"Qatar2022"})--(ph:Phase)--(m:Match),
(t1:NationalTeam {name: csv.country1})--(m)--(t2:NationalTeam {name: csv.country2})
CREATE (m)-[:HAS_HISTORIC_RESULTS]->(r:RESULTS)
SET r = csv;
Note that this actually adds two sets of RESULTS to every match, one always being the inverse of the other.
Now that we have the data loaded into our graph - we can start doing some queries.
Some queries to get the juices going
Let's build this up slowly and nicely.
Simple query: Let's find "Belgium"
All we do is lookup the NationalTeamnode:
MATCH (t:NationalTeam {name: "Belgium"})
RETURN t;
Next: Things connected to Belgium
Then we look at a path that explores direct and indirect connections to the "Belgium" node:
Direct connections we find like this:
MATCH path = (t:NationalTeam {name: "Belgium"})--(conn) RETURN path;indirect connections we find like this:
MATCH path = (t:NationalTeam {name: "Belgium"})-[*..2]-(conn) RETURN path;
The result looks like this subgraph:
Next, let's look at some Player-centric queries.
Paths between players
First, let's find the shortest paths between Lionel Messi and Kevin De Bruyne:
MATCH path = ALLSHORTESTPATHS ((p1:Player {short_name: "L. Messi"})-[*..5]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;
The results are unsurprising, through some of the Trait nodes.
If we look at the paths that are not necessarily the shortest ones, we get very different subgraphs, of course.
MATCH path = ((p1:Player {short_name: "L. Messi"})-[:PLAYS_FOR]-()-[*..4]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;
Finally, I also wanted to take a look at how the matches are going to be played, and how the team composition might have an impact on this.
Find matches with players from same club on two sides
The basic idea behind this query is simple: would there be any WorldCup matches, where there would be players playing for National teams, that are actually on the same team during the rest of the season. That could be interesting in many different ways, right?
So here's a simple query specifically for Belgium, essentially looking for a cyclic pattern (starting and ending with the Club node):
MATCH path = (c)--(p1:Player)--(t1:NationalTeam {name: "Belgium"})--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path;
We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:
MATCH path = (c)--(p1:Player)--(t1:NationalTeam)--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path
LIMIT 100;
That's all I have for now. I hope this was an interesting set of ideas and queries for you, and that you can easily experiment with this yourself as well.
Have fun!
Cheers
Rik
- My Blog: https://blog.bruggen.com/
- My Tweets: https://twitter.com/rvanbruggen
- My LinkedIn: https://www.linkedin.com/in/rikvanbruggen/
- My Gists: https://gist.github.com/rvanbruggen. All the code in this article is published there as well, specifically over here.
No comments:
Post a Comment