By Achim Zeileis
In times past, when we wanted to know which team would win the World Cup, we had to turn to seers with crystal balls, use divination via tea leaves, or hope for Paul the Octopus to tell us what would happen.
But modern data science can provide a better alternative. As part of a team of statisticians, I helped train a machine learning algorithm to predict the most likely course of the tournament.
Probabilistic forecasts and loaded dice
The algorithm we built proceeds in two steps.
In the first, sophisticated statistical models and expert insight from bookmakers and transfer markets are combined to determine the strengths of all teams and their players. In the second step, a machine learning algorithm decides how to best combine the strength estimates with other information about the teams.
This produced a probabilistic forecast for each possible match in the tournament. It can be thought of as a pair of loaded dice: Instead of having the numbers 1 to 6 with equal probabilities, these loaded dice have different probabilities for the number of goals for either team.
For example, according to our forecast, Mexico has a die rolling 1.9 goals
on average in the opening match, whereas opponent South Africa has an average of only 0.7. But this does not mean that Mexico will surely win. Rather, a win for Mexico is the most likely outcome with 65% probability. A draw is less likely (21%), and a win for South Africa is the least likely outcome (14%).
‘Vuelve a casa, el fútbol vuelve a casa!’
Using different pairs of loaded dice, the result of each match in the World Cup can be simulated. We took into account the official tournament draw and all FIFA rules, including the possibility of overtime and penalty shootouts. We ran the simulation 100,000 times to determine the tournament’s most likely course.
The results show that Spain is the favorite for the title with a winning probability of 14.5%, closely followed by England and France, each at 12.4%, and Germany at 11.2%.
Due to the expanded tournament – this World Cup has 48 teams and five rounds in the knockout stage – this group of favorites is tightly packed. Portugal and Argentina also have good chances to win the title, at 8.9% and 8.2%, respectively.
For its part, the United States has a good chance of reaching the Round of 32: 78%. This is the highest in their group, which has three other teams. In the knockout stage, however, when every match is do or die, the probabilities of the U.S. team “surviving” go down relatively quickly. The probability for a home victory in the final at MetLife Stadium in New Jersey on July 19 is 1%.
A deeper peek into the engine room
Our machine learning algorithm and subsequent simulations are fueled by data, expert knowledge and statistical models.
First, all national matches over the past eight years are the basis for a “retrospective” estimate of the teams’ strengths. Second, a “prospective” strength estimate is obtained from quoted odds of various international bookmakers, reflecting their expert opinions about the upcoming tournament.
Third, ratings of the individual players are produced based on their contributions to goals at the club and national levels. And finally, the current quality and future potential of the players is reflected in their expected market values. These are available from the Transfermarkt website that uses a wisdom-of-the crowd approach to estimate the unknown real-market values.
These four variables are combined with a broad range of further relevant inputs reflecting the current states of the different teams and the countries they come from. This includes team-specific details, such as their FIFA rank and the number of players in the semifinals of this year’s Champions League. We also factored in country-specific socioeconomic factors, such as GDP per capita.
To determine if and how these features are relevant for the actual results in a World Cup, a machine learning algorithm was used.
Here, a so-called random forest is trained, consisting of lots of decision trees capturing slightly different subsets of the data. The algorithm has been trained on all matches played at the major soccer tournaments since World Cup 2006. It thus links a team’s strength, market value and other factors to the number of goals scored in matches at World Cups. This is the information that loads the dice for our simulations.
Find out more
This is not the first time that our team comprising Andreas Groll and Rouven Michels and colleagues at TU Dortmund University in Germany, Lars Magnus Hvattum at Norway’s Molde University College, Gunther Schauberger at TU Munich and I have collaborated to forecast a World Cup.
In the 2019 Women’s World Cup we correctly predicted the U.S. as the winner. In the 2023 Women’s World Cup and the 2022 men’s World Cup, the winners – Spain and Argentina, respectively – were not our favorites, although we did predict them to be serious contenders.
The bottom line is forecasts are about probabilities. Our program will not predict the winner with 100% certainty – but it might do better than an eight-limbed mollusk.
![]()
Achim Zeileis is Professor of Statistics at the University of Innsbruck.






















Chols says
How can I access the full report including results for all groups and teams?
JimboXYZ says
Italy failed to qualify for a 3rd straight World Cup.
R.S. says
The US team really impressed with its performance against Paraguay. The teamwork was excellent; they appeared to have the ball most of the time. Paraguay did not ever seem to get a strategically coherent game underway. I wouldn’t discount the team as going beyond the Round of 32 to get closer to quarter finals. Even their test game against Germany was impressive although they lost 1:2 eventually.
Laurel says
For me, it’s about stats. Years ago, NASCAR had a game (no money involved) where you could pick, I think it was six drivers, maybe five, I don’t remember now, which added up to x amount of play dollars. Each driver had a worth according to their winning status, so someone like Jimmy Johnson cost a lot, were as some rookie cost little. You had to make it work.
We were on teams. Our team consisted of about 12 players. At first, I played with my heart, picking Greg Biffle followed by Matt Kenseth, and so on, ignoring the top drivers. That didn’t work out too well. Then, I started watching the stats. That changed everything! For about three, or so, years in a row, I was in the top 10 to top 100 players across the country! I was the top player on my team, and yes, I bought Jimmy Johnson! NASCAR changed the system and it became a pay to play game, so I got out, as did the rest of our team.
Maybe, I shoulda…