Predicting the 2022 World Cup Using Machine Learning
Article directory
- Predicting the 2022 World Cup Using Machine Learning
project instruction
- This project is adjusted and optimized based on Kaggle's open source Baseline , and the baseline is explained line by line.
- The data sets used in this project are the FIFA World Ranking 1992-2022 and the international football results from 1872 to 2022 .
- In this project, we turn the problem into a classification problem, i.e. the goal of our final model is to predict the win rate of the home team and the draw/win rate of the away game.
- In order to remove the advantage of the away team, in the project the outcome of the change between the away and the home team is predicted (because there is no home advantage in the World Cup), and the average of the two predictions is used as the probability.
- Students in need, please click the hyperlink to download.
Dataset description
International Football Results 1872-2022
- The dataset includes 44,152 international soccer match results from the first official match in 1872 to 2022.
- Competitions range from the FIFA World Cup to the FIFA Wild Cup to regular friendlies.
- These games are strictly men's official international competitions and the figures do not include Olympic games or games in which at least one team is a national B team, U-23 or league selection team.
results.csv includes the following columns:
- date - date of the match
- home_team - home team name
- away_team - away team name
- home_score - full-time home team score, including extra time, excluding penalty shootouts
- away_score - full time away score, including extra time, excluding penalty shootouts
- tournament - tournament name
- city - the name of the city/town/administrative unit where the race is held
- country - the name of the country where the match is played
- neutral - TRUE/FALSE column indicating whether the match is played on neutral ground
shootouts.csv includes the following columns:
- date - date of the match
- home_team - home team name
- away_team - away team name
- winner - penalty shootout winner
FIFA World Ranking 1992-2022
- country_full — full country name
- country_abrv — country abbreviation
- rank — current country rank
- total_points — current total points
- previous_points — the total points of the previous rating
- rank_change — rank change
- confederation — FIFA Confederation
- rank_date — the date the rank was calculated
Data Analysis and Preprocessing
data preparation
#解压数据集
!unzip -d datasets/international-football-results-from-1872-to-2017 1872年至2022年国际足球成绩.zip
!unzip -d datasets/fifaworldranking 国际足联世界排名1992-2022.zip
Analysis and preprocessing of international football scores from 1872 to 2022
Import result.csv from 1872 to 2022 international football results
import pandas as pd
import numpy as np
import re
df = pd.read_csv("datasets/international-football-results-from-1872-to-2017/results.csv")
df.head()
Let's preview the data first, know its general structure,
and check the basic information of the data
df.info()
- It can be found that data represents the date, but it is not in the format of the date, so we need to modify it to the format of the date when preprocessing
- Only two columns in this table are continuous features, and the rest of the features are discrete
Check for missing values
#缺失值查看、
df.isna().sum()
- It can be found that there are two columns of features that contain 40 missing values at the same time.
- And these two columns of features represent exactly our most important score features.
- Without the score feature, we cannot perform subsequent modeling on it, so we need to remove samples with missing scores.
Eliminate samples with missing scores and modify the date format
#删除缺失值所在的行
df.dropna(inplace=True)
#将日期列的格式转换为日期格式
df["date"] = pd.to_datetime(df["date"])
But running directly like this will report an error, because after running, we found that the date column contains the string '2022-19-22' (which does not conform to the normal date logic), so we need to remove the row containing this string first Then process the data.
df = df.drop(df[df['date']=='2022-19-22'].index,axis=0)
The dataset we will be using will be the 2018 FIFA Olympics, the last game after the 2018 World Cup and before the 2022 World Cup. The idea is to analyze the match situation during the World Cup preparation and classification phase.
Therefore, we want to filter the dataset
- Let's take a look at the last few games in 2022
df.sort_values("date").tail()
- Filter out games after August 1, 2018, and reset the index
df = df[(df["date"] >= "2018-8-1")].reset_index(drop=True)
df.sort_values('data').tail()
Analysis and preprocessing of the FIFA World Rankings 1992-2022 dataset
Just like before, we need to convert the date format first and extract the data after August 1, 2018
rank = pd.read_csv("datasets/fifaworldranking/fifa_ranking-2022-10-06.csv")
rank["rank_date"] = pd.to_datetime(rank["rank_date"]) #转换日期格式
rank = rank[(rank["rank_date"] >= "2018-8-1")].reset_index(drop=True) #筛选数据集
Some teams in the World Cup have different names in the ranking dataset. So, it needs adjustment.
rank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States")
Merge the two tables
Next, we need to merge the data sets. The merge is to get a World Cup data set and its ranking.
- Set the date as our index, then group by country, resample the first data of each day as our data, and finally reset the index
- If it is empty, we use the forward filling method.
rank = rank.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()
- We select the features "country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date" in the rank table to merge with the df table
- And perform left and right alignment according to the date, home_team in the left table and the rank_date and country_full in the right table
- Since the left and right tables have duplicate feature columns, we only need to take one of them, so we choose to delete rank_date and country_full here
df_wc_ranked = df.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]],
left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1)
- We know that besides home_team (home team name) and away_team (visitor team name) in result.csv
- The merge above just merges the data of the home team together, but the data of the visiting team has not been merged yet, so we need to merge again
- Here we take the feature column of rank to be the same as above, but the home_team in the left alignment becomes our away_team
- Since it has been merged just once, if it is merged again, there will be many duplicate column names. Also in order to distinguish the characteristics of the home and away teams, we change the suffix of the rank feature column of the home team to _home, and change the suffix of the rank feature column of the visiting team to _away
- Finally, it is the same as just now, we can take one of the remaining repeated features (such as time and country name)
df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]],
left_on=["date", "away_team"], right_on=["rank_date", "country_full"],
suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)
After merging, let’s take a look at some of the data where both the home and away teams are Brazil.
df_wc_ranked[(df_wc_ranked.home_team == "Brazil") | (df_wc_ranked.away_team == "Brazil")].tail()
Now that we have the data ready, we can perform feature engineering on the dataset
feature engineering
- The idea here is to create more features that have an impact on the outcome of a soccer game
- We think that the characteristics of the impact may be the following:
1. The team's historical score
2. The team's historical goals and conceded goals
3. The team's ranking
4. The team's rise in the ranking
5. The progress facing the ranking Balls and losses
6. The importance of the game (friendly or not) - So we want to create a function: determine which team won, and how many points they scored in the game
Encapsulate a function to judge winning or losing
df = df_wc_ranked
def result_finder(home, away):
if home > away:
return pd.Series([0, 3, 0])
if home < away:
return pd.Series([1, 0, 3])
else:
return pd.Series([2, 1, 1])
results = df.apply(lambda x: result_finder(x["home_score"], x["away_score"]), axis=1)
df[["result", "home_team_points", "away_team_points"]] = results
hypothesis testing
- The game points are 3 points for a win, 1 point for a tie, and 0 points for a loss, which is different from the existing ranking points in the database.
- In addition, we believe that the ranking points in the dataset and the ranking of the same team are negatively correlated, and we should only use one of them to create new features.
- The following is a test of this hypothesis.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))
sns.heatmap(df[["total_points_home", "rank_home", "total_points_away", "rank_away"]].corr())
plt.show()
Feature Derivation
- Now, we need to create features that are good for modeling
- For example:
1. Difference in ranking
2. Ranking of points won in a match versus the team you faced
3. Difference in goals in a match.
All features that are not part of differences should be created for both teams (away and home).
df["rank_dif"] = df["rank_home"] - df["rank_away"] #排名差异
df["sg"] = df["home_score"] - df["away_score"] #分数差异
df["points_home_by_rank"] = df["home_team_points"]/df["rank_away"] #主场队伍进球与排名的关系
df["points_away_by_rank"] = df["away_team_points"]/df["rank_home"] #客场队伍进球与排名的关系
For better feature derivation, we split the dataset into home team and away team datasets, and then merge them together to calculate various features of their past games.
They are then separated and combined to construct an original dataset.
This process optimizes feature derivation
- First divide the dataset into home and away datasets
home_team = df[["date", "home_team", "home_score", "away_score", "rank_home", "rank_away","rank_change_home", "total_points_home", "result", "rank_dif", "points_home_by_rank", "home_team_points"]]
away_team = df[["date", "away_team", "away_score", "home_score", "rank_away", "rank_home","rank_change_away", "total_points_away", "result", "rank_dif", "points_away_by_rank", "away_team_points"]]
- Since the name of the feature column was modified when the data set was merged, now we need to change the name of the feature column to the original name for subsequent processing
home_team.columns = [h.replace("home_", "").replace("_home", "").replace("away_", "suf_").replace("_away", "_suf") for h in home_team.columns]
away_team.columns = [a.replace("away_", "").replace("_away", "").replace("home_", "suf_").replace("_home", "_suf") for a in away_team.columns]
- Append them together for feature calculation
team_stats = home_team.append(away_team)
- These columns will be used for feature calculation
team_stats_raw = team_stats.copy()
Now, we have a dataset ready for further feature derivation. The columns that will be derived are:
- Mean goals of the team in World Cup Cycle. --The average number of goals of the World Cup team
- Mean goals of the team in last 5 games. --The team's average goals in the last 5 games
- Mean goals suffered of the team in World Cup Cycle. --The average number of fouls committed by the World Cup team
- Mean goals suffered of the team in last 5 games. --The average number of fouls committed by the team in the last 5 games
- Mean FIFA Rank that team faced in World Cup Cycle. --The average FIFA rank of the team in the World Cup
- Mean FIFA Rank that team faced in last 5 games. --The average FIFA rank of the team in the last 5 games
- FIFA Points won at the cycle. --FIFA Points
- FIFA Points won in last 5 games. --The last 5 FIFA points
- Mean game points at the Cycle. --Game points
- Mean game points at last 5 games. --The last 5 game points
- Mean game points by rank faced at the Cycle.
- Mean game points by rank faced at last 5 games.
stats_val = []
for index, row in team_stats.iterrows():
team = row["team"]
date = row["date"]
past_games = team_stats.loc[(team_stats["team"] == team) & (team_stats["date"] < date)].sort_values(by=['date'], ascending=False)
last5 = past_games.head(5) #取出过去五场比赛
goals = past_games["score"].mean()
goals_l5 = last5["score"].mean()
goals_suf = past_games["suf_score"].mean()
goals_suf_l5 = last5["suf_score"].mean()
rank = past_games["rank_suf"].mean()
rank_l5 = last5["rank_suf"].mean()
if len(last5) > 0:
points = past_games["total_points"].values[0] - past_games["total_points"].values[-1]#qtd de pontos ganhos
points_l5 = last5["total_points"].values[0] - last5["total_points"].values[-1]
else:
points = 0
points_l5 = 0
gp = past_games["team_points"].mean()
gp_l5 = last5["team_points"].mean()
gp_rank = past_games["points_by_rank"].mean()
gp_rank_l5 = last5["points_by_rank"].mean()
stats_val.append([goals, goals_l5, goals_suf, goals_suf_l5, rank, rank_l5, points, points_l5, gp, gp_l5, gp_rank, gp_rank_l5])
- Merge the newly derived features with the original table
- And re-use full_df to receive
stats_cols = ["goals_mean", "goals_mean_l5", "goals_suf_mean", "goals_suf_mean_l5", "rank_mean", "rank_mean_l5", "points_mean", "points_mean_l5", "game_points_mean", "game_points_mean_l5", "game_points_rank_mean", "game_points_rank_mean_l5"]
stats_df = pd.DataFrame(stats_val, columns=stats_cols)
full_df = pd.concat([team_stats.reset_index(drop=True), stats_df], axis=1, ignore_index=False)
- Divide the merged data set into home and away games again
home_team_stats = full_df.iloc[:int(full_df.shape[0]/2),:]
away_team_stats = full_df.iloc[int(full_df.shape[0]/2):,:]
- Take out the column derived from the feature just now
home_team_stats = home_team_stats[home_team_stats.columns[-12:]]
away_team_stats = away_team_stats[away_team_stats.columns[-12:]]
- Rename it (home_ stands for home) (away_ stands for away)
In order to unify the data set, you need to add the suffix of home and away to each column, after that, the data can be combined and used
home_team_stats.columns = ['home_'+str(col) for col in home_team_stats.columns]
away_team_stats.columns = ['away_'+str(col) for col in away_team_stats.columns]
- data merge
match_stats = pd.concat([home_team_stats, away_team_stats.reset_index(drop=True)], axis=1, ignore_index=False)
full_df = pd.concat([df, match_stats.reset_index(drop=True)], axis=1, ignore_index=False)
full_df.columns
Take a look at the existing feature columns
- In order to determine whether the game is friendly, we encapsulate a function to judge it
def find_friendly(x):
if x == "Friendly":
return 1
else: return 0
full_df["is_friendly"] = full_df["tournament"].apply(lambda x: find_friendly(x))
- and one-hot encode it
full_df = pd.get_dummies(full_df, columns=["is_friendly"])
Perform data analysis on the dataset after feature engineering
- Here, we only select columns that contribute to our feature analysis for analysis
base_df = full_df[["date", "home_team", "away_team", "rank_home", "rank_away","home_score", "away_score","result", "rank_dif", "rank_change_home", "rank_change_away", 'home_goals_mean',
'home_goals_mean_l5', 'home_goals_suf_mean', 'home_goals_suf_mean_l5',
'home_rank_mean', 'home_rank_mean_l5', 'home_points_mean',
'home_points_mean_l5', 'away_goals_mean', 'away_goals_mean_l5',
'away_goals_suf_mean', 'away_goals_suf_mean_l5', 'away_rank_mean',
'away_rank_mean_l5', 'away_points_mean', 'away_points_mean_l5','home_game_points_mean', 'home_game_points_mean_l5',
'home_game_points_rank_mean', 'home_game_points_rank_mean_l5','away_game_points_mean',
'away_game_points_mean_l5', 'away_game_points_rank_mean',
'away_game_points_rank_mean_l5',
'is_friendly_0', 'is_friendly_1']]
base_df.head()
- Check for missing values
base_df.isna().sum()
- We know that the average value of rows with null values cannot be calculated, so we need to remove samples with null values
base_df_no_fg = base_df.dropna()
Now, we need to analyze all created features and check if they have predictive power. Also, if they don't, we need to create some predictive features, such as the difference between home and away teams. To analyze predictive power, I will designate a draw as a loss for the home team and classify the problem as binary.
df = base_df_no_fg
def no_draw(x):
if x == 2:
return 1
else:
return x
df["target"] = df["result"].apply(lambda x: no_draw(x))
Filter features using violin plots and boxplots
- Next, we use the violin plot and the box plot to analyze whether the features have different distributions according to the target.
- Use scatterplots to analyze correlations
- In order to make the image more intuitive, we will extract some features and draw them on the same canvas, and draw another part of the features on the next canvas
data1 = df[list(df.columns[8:20].values) + ["target"]]
data2 = df[df.columns[20:]]
- Normalize features
scaled = (data1[:-1] - data1[:-1].mean()) / data1[:-1].std()
scaled["target"] = data1["target"]
violin1 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")
scaled = (data2[:-1] - data2[:-1].mean()) / data2[:-1].std()
scaled["target"] = data2["target"]
violin2 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")
- Draw the feature violin plot in data1
plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin1,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
- Draw the characteristic violin diagram of data2
plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin2,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
Looking at these plots, we find that the rank difference is the only good separator of the data. However, we can create some features to get the difference between home and away teams and analyze whether they separate the data well.
- In order to better explore the difference between the home and away games, we find the difference of the characteristic mean of each home and away game, and standardize it, and then draw their violin plots
dif = df.copy()
dif.loc[:, "goals_dif"] = dif["home_goals_mean"] - dif["away_goals_mean"]
dif.loc[:, "goals_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_mean_l5"]
dif.loc[:, "goals_suf_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_suf_mean"]
dif.loc[:, "goals_suf_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_suf_mean_l5"]
dif.loc[:, "goals_made_suf_dif"] = dif["home_goals_mean"] - dif["away_goals_suf_mean"]
dif.loc[:, "goals_made_suf_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_suf_mean_l5"]
dif.loc[:, "goals_suf_made_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_mean"]
dif.loc[:, "goals_suf_made_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_mean_l5"]
data_difs = dif.iloc[:, -8:]
scaled = (data_difs - data_difs.mean()) / data_difs.std()
scaled["target"] = data2["target"]
violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
As can be seen from this plot, the difference in the number of goals scored is a good separator, as is the number of fouls. But the difference between the number of goals scored and the number of goals conceded by each team is not a good distinguishing criterion
- Then we now filter out the following 5 features
- ran_dif
- goals_dif
- goals_dif_l5
- goals_suf_dif
- goals_suf_dif_l5
- Next, we can also create other features, such as: differences in scores obtained, differences in rankings
dif.loc[:, "dif_points"] = dif["home_game_points_mean"] - dif["away_game_points_mean"]
dif.loc[:, "dif_points_l5"] = dif["home_game_points_mean_l5"] - dif["away_game_points_mean_l5"]
dif.loc[:, "dif_points_rank"] = dif["home_game_points_rank_mean"] - dif["away_game_points_rank_mean"]
dif.loc[:, "dif_points_rank_l5"] = dif["home_game_points_rank_mean_l5"] - dif["away_game_points_rank_mean_l5"]
dif.loc[:, "dif_rank_agst"] = dif["home_rank_mean"] - dif["away_rank_mean"]
dif.loc[:, "dif_rank_agst_l5"] = dif["home_rank_mean_l5"] - dif["away_rank_mean_l5"]
- Also, we can calculate the impact of goals scored and fouls by grade and examine this difference
dif.loc[:, "goals_per_ranking_dif"] = (dif["home_goals_mean"] / dif["home_rank_mean"]) - (dif["away_goals_mean"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_suf_dif"] = (dif["home_goals_suf_mean"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_dif_l5"] = (dif["home_goals_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_mean_l5"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_suf_dif_l5"] = (dif["home_goals_suf_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean_l5"] / dif["away_rank_mean"])
- As usual, normalize the newly constructed features and visualize them with a violin plot
data_difs = dif.iloc[:, -10:]
scaled = (data_difs - data_difs.mean()) / data_difs.std()
scaled["target"] = data2["target"]
violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")
plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
Due to the low values, the violin plot does not give us good feedback, so for these features, we will use boxplots
plt.figure(figsize=(15,10))
sns.boxplot(x="features", y="value", hue="target", data=violin)
plt.xticks(rotation=90)
plt.show()
- From this you can see that Difference of points (all matches and last 5 matches), difference of points by ranking faced (all matches and last 5 matches) and difference of rank faced (all matches and last 5 matches) are good Characteristics.
- Also, some derived features have very similar distributions, for which we will use scatterplots for analysis.
sns.jointplot(data = data_difs, x = 'goals_per_ranking_dif', y = 'goals_per_ranking_dif_l5', kind="reg")
plt.show()
- Since the feature distributions of dif_rank_agst and dif_rank_agst_l5 are very similar, we only use its full version for plotting (dif_rank_agst)
sns.jointplot(data = data_difs, x = 'dif_rank_agst', y = 'dif_rank_agst_l5', kind="reg")
plt.show()
- For score features
sns.jointplot(data = data_difs, x = 'dif_points', y = 'dif_points_l5', kind="reg")
plt.show()
- Ranking features against scores
sns.jointplot(data = data_difs, x = 'dif_points_rank', y = 'dif_points_rank_l5', kind="reg")
plt.show()
Since the two versions (all stats, last 5 games) are not that similar, we decided to use both. Therefore, the final result of our feature screening is:
- rank_dif
- goals_dif
- goals_dif_l5
- goals_suf_dif
- goals_suf_dif_l5
- dif_rank_agst
- dif_rank_agst_l5
- goals_per_ranking_dif
- dif_points_rank
- dif_points_rank_l5
- is_friendly
def create_db(df):
columns = ["home_team", "away_team", "target", "rank_dif", "home_goals_mean", "home_rank_mean", "away_goals_mean", "away_rank_mean", "home_rank_mean_l5", "away_rank_mean_l5", "home_goals_suf_mean", "away_goals_suf_mean", "home_goals_mean_l5", "away_goals_mean_l5", "home_goals_suf_mean_l5", "away_goals_suf_mean_l5", "home_game_points_rank_mean", "home_game_points_rank_mean_l5", "away_game_points_rank_mean", "away_game_points_rank_mean_l5","is_friendly_0", "is_friendly_1"]
base = df.loc[:, columns]
base.loc[:, "goals_dif"] = base["home_goals_mean"] - base["away_goals_mean"]
base.loc[:, "goals_dif_l5"] = base["home_goals_mean_l5"] - base["away_goals_mean_l5"]
base.loc[:, "goals_suf_dif"] = base["home_goals_suf_mean"] - base["away_goals_suf_mean"]
base.loc[:, "goals_suf_dif_l5"] = base["home_goals_suf_mean_l5"] - base["away_goals_suf_mean_l5"]
base.loc[:, "goals_per_ranking_dif"] = (base["home_goals_mean"] / base["home_rank_mean"]) - (base["away_goals_mean"] / base["away_rank_mean"])
base.loc[:, "dif_rank_agst"] = base["home_rank_mean"] - base["away_rank_mean"]
base.loc[:, "dif_rank_agst_l5"] = base["home_rank_mean_l5"] - base["away_rank_mean_l5"]
base.loc[:, "dif_points_rank"] = base["home_game_points_rank_mean"] - base["away_game_points_rank_mean"]
base.loc[:, "dif_points_rank_l5"] = base["home_game_points_rank_mean_l5"] - base["away_game_points_rank_mean_l5"]
model_df = base[["home_team", "away_team", "target", "rank_dif", "goals_dif", "goals_dif_l5", "goals_suf_dif", "goals_suf_dif_l5", "goals_per_ranking_dif", "dif_rank_agst", "dif_rank_agst_l5", "dif_points_rank", "dif_points_rank_l5", "is_friendly_0", "is_friendly_1"]]
return model_df
model_db = create_db(df)
model_db
Build a predictive model
- Through the above steps, we have obtained a data set with predictive ability, we can start our modeling
- In this task, we will build two models (RFC, GBDT). Finally, the model with the best recall rate is selected as our prediction model.
- First we filter out our feature and label columns
X = model_db.iloc[:, 3:]
y = model_db[["target"]]
- Import RFC and GBDT in integrated learning in sklearn
- Import the split dataset and grid search package in sklearn
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
- Here we choose a ratio of 8:2 to split the data set, and set the random seed number to 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1)
Build GBDT model
- First build the GBDT model, and use grid search to fine-tune the model
gb = GradientBoostingClassifier(random_state=5)
params = {
"learning_rate": [0.01, 0.1, 0.5],
"min_samples_split": [5, 10],
"min_samples_leaf": [3, 5],
"max_depth":[3,5,10],
"max_features":["sqrt"],
"n_estimators":[100, 200]
}
gb_cv = GridSearchCV(gb, params, cv = 3, n_jobs = -1, verbose = False)
gb_cv.fit(X_train.values, np.ravel(y_train))
- Look at the parameter configuration of GBDT
gb = gb_cv.best_estimator_
gb
Create an RFC model
- Next, build the RFC model and fine-tune the model with grid search
params_rf = {
"max_depth": [20],
"min_samples_split": [10],
"max_leaf_nodes": [175],
"min_samples_leaf": [5],
"n_estimators": [250],
"max_features": ["sqrt"],
}
rf = RandomForestClassifier(random_state=1)
rf_cv = GridSearchCV(rf, params_rf, cv = 3, n_jobs = -1, verbose = False)
rf_cv.fit(X_train.values, np.ravel(y_train))
rf = rf_cv.best_estimator_
Model comparison
Here, we use confusion matrix and ROC curve for model comparison
def analyze(model):
fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test.values)[:,1]) #test AUC
plt.figure(figsize=(15,10))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label="test")
fpr_train, tpr_train, _ = roc_curve(y_train, model.predict_proba(X_train.values)[:,1]) #train AUC
plt.plot(fpr_train, tpr_train, label="train")
auc_test = roc_auc_score(y_test, model.predict_proba(X_test.values)[:,1])
auc_train = roc_auc_score(y_train, model.predict_proba(X_train.values)[:,1])
plt.legend()
plt.title('AUC score is %.2f on test and %.2f on training'%(auc_test, auc_train))
plt.show()
plt.figure(figsize=(15, 10))
cm = confusion_matrix(y_test, model.predict(X_test.values))
sns.heatmap(cm, annot=True, fmt="d")
- GBDT
analyze(gb)
analyze(rf)
Through analysis, it is found that the random forest model may be slightly better, but it seems that the generalization ability is not good. Therefore we will use the GBDT model
world cup simulation
Data preparation and preprocessing
- The first thing was to create the FIFA World Cup game
- To do this we have to get the teams and group stage matches in Wikipedia
- First use pd.read_html to quickly crawl data
from operator import itemgetter
dfs = pd.read_html(r"https://en.wikipedia.org/wiki/2022_FIFA_World_Cup#Teams")
- Preprocess the crawled form
from collections.abc import Iterable
for i in range(len(dfs)):
df = dfs[i]
cols = list(df.columns.values)
if isinstance(cols[0], Iterable):
if any("Tie-breaking criteria" in c for c in cols):
start_pos = i+1
if any("Match 46" in c for c in cols):
end_pos = i+1
matches = []
groups = ["A", "B", "C", "D", "E", "F", "G", "H"]
group_count = 0
table = {
}
table[groups[group_count]] = [[a.split(" ")[0], 0, []] for a in list(dfs[start_pos].iloc[:, 1].values)]
for i in range(start_pos+1, end_pos, 1):
if len(dfs[i].columns) == 3:
team_1 = dfs[i].columns.values[0]
team_2 = dfs[i].columns.values[-1]
matches.append((groups[group_count], team_1, team_2))
else:
group_count+=1
table[groups[group_count]] = [[a, 0, []] for a in list(dfs[i].iloc[:, 1].values)]
table
Above, we stored each team's points in the group stage and its probability of winning each game. In particular, when two teams have the same points, the mean of the teams' winning probabilities will be used as the tie.
Next, we'll use the data from the previous game as the data for each team's participating teams. For example, Brazil vs. Serbia, Brazil's data is their data in the previous game, and the same is true for Serbia's data.
def find_stats(team_1):
#team_1 = "Qatar"
past_games = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date")
last5 = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date").tail(5)
team_1_rank = past_games["rank"].values[-1]
team_1_goals = past_games.score.mean()
team_1_goals_l5 = last5.score.mean()
team_1_goals_suf = past_games.suf_score.mean()
team_1_goals_suf_l5 = last5.suf_score.mean()
team_1_rank_suf = past_games.rank_suf.mean()
team_1_rank_suf_l5 = last5.rank_suf.mean()
team_1_gp_rank = past_games.points_by_rank.mean()
team_1_gp_rank_l5 = last5.points_by_rank.mean()
return [team_1_rank, team_1_goals, team_1_goals_l5, team_1_goals_suf, team_1_goals_suf_l5, team_1_rank_suf, team_1_rank_suf_l5, team_1_gp_rank, team_1_gp_rank_l5]
def find_features(team_1, team_2):
rank_dif = team_1[0] - team_2[0]
goals_dif = team_1[1] - team_2[1]
goals_dif_l5 = team_1[2] - team_2[2]
goals_suf_dif = team_1[3] - team_2[3]
goals_suf_dif_l5 = team_1[4] - team_2[4]
goals_per_ranking_dif = (team_1[1]/team_1[5]) - (team_2[1]/team_2[5])
dif_rank_agst = team_1[5] - team_2[5]
dif_rank_agst_l5 = team_1[6] - team_2[6]
dif_gp_rank = team_1[7] - team_2[7]
dif_gp_rank_l5 = team_1[8] - team_2[8]
return [rank_dif, goals_dif, goals_dif_l5, goals_suf_dif, goals_suf_dif_l5, goals_per_ranking_dif, dif_rank_agst, dif_rank_agst_l5, dif_gp_rank, dif_gp_rank_l5, 1, 0]
officially start the simulation
Now we can start simulating the World Cup.
Since the model is a binary classification model, it will only predict whether team 1 will win or not. Therefore, we need to define some criteria to judge the average. Also, since there is no home field advantage in the World Cup, the idea is to predict two matches, changing Team 1 and Team 2, and the team with the highest probability average will be the winner . ** In the group stage, if the home team wins as team 1 and loses as team 2, or if the home team wins as team 2 and loses as team 1, the match will be considered a draw.
advanced_group = []
last_group = ""
for k in table.keys():
for t in table[k]:
t[1] = 0
t[2] = []
for teams in matches:
draw = False
team_1 = find_stats(teams[1])
team_2 = find_stats(teams[2])
features_g1 = find_features(team_1, team_2)
features_g2 = find_features(team_2, team_1)
probs_g1 = gb.predict_proba([features_g1])
probs_g2 = gb.predict_proba([features_g2])
team_1_prob_g1 = probs_g1[0][0]
team_1_prob_g2 = probs_g2[0][1]
team_2_prob_g1 = probs_g1[0][1]
team_2_prob_g2 = probs_g2[0][0]
team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
if ((team_1_prob_g1 > team_2_prob_g1) & (team_2_prob_g2 > team_1_prob_g2)) | ((team_1_prob_g1 < team_2_prob_g1) & (team_2_prob_g2 < team_1_prob_g2)):
draw=True
for i in table[teams[0]]:
if i[0] == teams[1] or i[0] == teams[2]:
i[1] += 1
elif team_1_prob > team_2_prob:
winner = teams[1]
winner_proba = team_1_prob
for i in table[teams[0]]:
if i[0] == teams[1]:
i[1] += 3
elif team_2_prob > team_1_prob:
winner = teams[2]
winner_proba = team_2_prob
for i in table[teams[0]]:
if i[0] == teams[2]:
i[1] += 3
for i in table[teams[0]]: #adding criterio de desempate (probs por jogo)
if i[0] == teams[1]:
i[2].append(team_1_prob)
if i[0] == teams[2]:
i[2].append(team_2_prob)
if last_group != teams[0]:
if last_group != "":
print("\n")
print("%s组 : "%(last_group))
for i in table[last_group]: #adding crieterio de desempate
i[2] = np.mean(i[2])
final_points = table[last_group]
final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)
advanced_group.append([final_table[0][0], final_table[1][0]])
for i in final_table:
print("%s -------- %d"%(i[0], i[1]))
print("\n")
print("-"*10+" %s组开始分析 "%(teams[0])+"-"*10)
if draw == False:
print(" %s组 - %s VS. %s: %s获胜 概率为 %.2f"%(teams[0], teams[1], teams[2], winner, winner_proba))
else:
print(" %s组 - %s vs. %s: 平局"%(teams[0], teams[1], teams[2]))
last_group = teams[0]
print("\n")
print(" %s组 : "%(last_group))
for i in table[last_group]: #adding crieterio de desempate
i[2] = np.mean(i[2])
final_points = table[last_group]
final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)
advanced_group.append([final_table[0][0], final_table[1][0]])
for i in final_table:
print("%s -------- %d"%(i[0], i[1]))
Group stage predictions
---------- Group a begins to analyze ----------
Group A - Qatar vs Ecuador: 0.60 odds for Ecuador to win
Group A - Senegal vs Netherlands: 0.59 odds for Netherlands to win
Group A - Qatar vs Senegal: 0.58 odds for Senegal to win
Group A - Netherlands vs Ecuador: 0.66 odds for Netherlands to win
Group A - Ecuador vs Senegal: 0.53 odds for Ecuador to win
Group A - Netherlands vs Qatar: 0.69 odds for Netherlands to win
Group A:
Netherlands -------- 9
Ecuador -------- 6
Senegal -------- 3
Qatar -------- 0
---------- Group B starts to analyze ----------
Group B England vs Iran: 0.60 odds of England winning
Group B - USA v Wales: Draw
Group B - Wales vs Iran: 0.54 chance of Wales winning
Group B - England v USA: 0.58 chance of England winning
Group B - Wales vs England: 0.60 chance of England winning
Group B - Iran vs US: 0.57 probability of US winning
Group B:
England -------- 9
United States -------- 4
Wales -------- 4
Iran -------- 0
---------- Group C begins to analyze ----------
Group C - Argentina vs Saudi Arabia: 0.70 probability of Argentina winning
Group C - Mexico v Poland: Draw
Group C - Poland vs Saudi Arabia: 0.64 probability of Poland winning
Group C - Argentina vs Mexico: 0.62 probability of Argentina winning
Group C - Poland vs Argentina: 0.64 probability for Argentina to win
Group C - Saudi Arabia vs. Mexico: 0.64 odds for Mexico to win
Group C:
Argentina -------- 9
Poland -------- 4
Mexico -------- 4
Saudi Arabia -------- 0
---------- Group d starts analysis----------
Group D - Denmark vs Tunisia: 0.63 probability of Denmark winning
Group D - France vs Australia: 0.65 probability of France winning
Group D - Tunisia v Australia: Draw
Group D - France v Denmark: Draw
Group D - Australia vs Denmark: 0.65 probability of Denmark winning
Group D - Tunisia vs France: 0.63 probability of France winning
Group D:
France -------- 7
Denmark -------- 7
Tunisia -------- 1
Australia -------- 1
---------- Group e starts analysis----------
Group E - Germany vs Japan: 0.59 probability of Germany winning
Group E - Spain vs Costa Rica: 0.68 probability of Spain winning
Group E - Japan vs Costa Rica: Draw
Group E - Spain v Germany: Draw
Group E - Japan vs Spain: 0.62 probability of Spain winning
Group E - Costa Rica vs. Germany: 0.60 probability of Germany winning
Group E:
Spain -------- 7
Germany -------- 7
Japan-------- 1
Costa Rica -------- 1
---------- Group f begins analysis----------
Group F - Morocco vs Croatia: 0.58 chance for Croatia to win
Group F - Belgium vs Canada: 0.67 probability of Belgium winning
Group F - Belgium vs Morocco: 0.63 odds of Belgium winning
Group F - Croatia vs Canada: 0.62 chance for Croatia to win
Group F - Croatia vs Belgium: 0.60 probability of Belgium winning
Group F - Canada v Morocco: Draw
Group F:
Belgium -------- 9
Croatia -------- 6
Morocco -------- 1
Canada -------- 1
---------- G group starts analysis----------
Group G - Switzerland vs Cameroon: 0.62 odds for Switzerland to win
Group G - Brazil vs Serbia: 0.63 probability of Brazil winning
Group G - Cameroon vs Serbia: 0.61 probability of Serbia winning
Group G - Brazil vs Switzerland: Draw
Group G - Serbia vs Switzerland: 0.56 odds for Switzerland to win
Group G - Cameroon vs Brazil: 0.71 chance of Brazil winning
Group G:
Brazil -------- 7
Switzerland -------- 7
Serbia -------- 3
Cameroon -------- 0
---------- Group h starts analysis----------
Group H - Uruguay vs South Korea: 0.60 odds for Uruguay to win
Group H - Portugal vs Ghana: 0.71 probability of Portugal winning
Group H - South Korea vs Ghana: 0.69 odds for South Korea to win
Group H - Portugal v Uruguay: Draw
Group H - Ghana vs Uruguay: 0.69 odds for Uruguay to win
Group H - South Korea vs Portugal: 0.63 probability of Portugal winning
Group H:
Portugal -------- 7
Uruguay -------- 7
South Korea -------- 3
Ghana -------- 0
playoff predictions
There should be no surprises in the group stage predictions, with the franchise being a draw between Brazil and Switzerland or France and Denmark. For the playoff stage, we will show in a tree diagram
advanced = advanced_group
playoffs = {
"第十六场比赛": [], "四分之一决赛": [], "半决赛": [], "决赛": []}
for p in playoffs.keys():
playoffs[p] = []
actual_round = ""
next_rounds = []
for p in playoffs.keys():
if p == "第十六场比赛":
control = []
for a in range(0, len(advanced*2), 1):
if a < len(advanced):
if a % 2 == 0:
control.append((advanced*2)[a][0])
else:
control.append((advanced*2)[a][1])
else:
if a % 2 == 0:
control.append((advanced*2)[a][1])
else:
control.append((advanced*2)[a][0])
playoffs[p] = [[control[c], control[c+1]] for c in range(0, len(control)-1, 1) if c%2 == 0]
for i in range(0, len(playoffs[p]), 1):
game = playoffs[p][i]
home = game[0]
away = game[1]
team_1 = find_stats(home)
team_2 = find_stats(away)
features_g1 = find_features(team_1, team_2)
features_g2 = find_features(team_2, team_1)
probs_g1 = gb.predict_proba([features_g1])
probs_g2 = gb.predict_proba([features_g2])
team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
if actual_round != p:
print("-"*10)
print("开始模拟 %s"%(p))
print("-"*10)
print("\n")
if team_1_prob < team_2_prob:
print("%s VS. %s: %s 晋级 概率为 %.2f"%(home, away, away, team_2_prob))
next_rounds.append(away)
else:
print("%s VS. %s: %s 晋级 概率为 %.2f"%(home, away, home, team_1_prob))
next_rounds.append(home)
game.append([team_1_prob, team_2_prob])
playoffs[p][i] = game
actual_round = p
else:
playoffs[p] = [[next_rounds[c], next_rounds[c+1]] for c in range(0, len(next_rounds)-1, 1) if c%2 == 0]
next_rounds = []
for i in range(0, len(playoffs[p])):
game = playoffs[p][i]
home = game[0]
away = game[1]
team_1 = find_stats(home)
team_2 = find_stats(away)
features_g1 = find_features(team_1, team_2)
features_g2 = find_features(team_2, team_1)
probs_g1 = gb.predict_proba([features_g1])
probs_g2 = gb.predict_proba([features_g2])
team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
if actual_round != p:
print("-"*10)
print("开始模拟 %s"%(p))
print("-"*10)
print("\n")
if team_1_prob < team_2_prob:
print("%s VS. %s: %s 晋级 概率为 %.2f"%(home, away, away, team_2_prob))
next_rounds.append(away)
else:
print("%s VS. %s: %s 晋级 概率为 %.2f"%(home, away, home, team_1_prob))
next_rounds.append(home)
game.append([team_1_prob, team_2_prob])
playoffs[p][i] = game
actual_round = p
Start to simulate the sixteenth match---------
Netherlands vs USA: The probability of the Netherlands advancing is 0.55
Argentina vs Denmark: Argentina's promotion probability is 0.59
Spain vs Croatia: Spain has a 0.57 chance of qualifying
Brazil vs Uruguay: 0.60 probability of Brazil qualifying
Ecuador vs England: 0.65 chance of England qualifying
Poland vs France: 0.60 probability of France qualifying
Germany vs Belgium: Belgium's promotion probability is 0.50
Switzerland vs Portugal: The probability of Portugal advancing is 0.52
Start to simulate the quarter-finals ---------
Netherlands vs Argentina: The probability of the Netherlands advancing is 0.52
Spain vs Brazil: 0.51 probability of Brazil qualifying
England vs France: 0.51 probability of France qualifying
Belgium vs Portugal: The probability of Portugal advancing is 0.52
Start simulating the semi-final ---------
Netherlands vs Brazil: The probability of Brazil advancing is 0.52
France vs Portugal: Portugal's promotion probability is 0.52
Start mock final ---------
Brazil vs Portugal: The probability of Brazil advancing is 0.52
- draw its tree diagram
!pip install pydot pydot-ng graphviz
import networkx as nx
from networkx.drawing.nx_pydot import graphviz_layout
plt.figure(figsize=(15, 10))
G = nx.balanced_tree(2, 3)
labels = []
for p in playoffs.keys():
for game in playoffs[p]:
label = f"{
game[0]}({
round(game[2][0], 2)}) \n {
game[1]}({
round(game[2][1], 2)})"
labels.append(label)
labels_dict = {
}
labels_rev = list(reversed(labels))
for l in range(len(list(G.nodes))):
labels_dict[l] = labels_rev[l]
pos = graphviz_layout(G, prog='twopi')
labels_pos = {
n: (k[0], k[1]-0.08*k[1]) for n,k in pos.items()}
center = pd.DataFrame(pos).mean(axis=1).mean()
nx.draw(G, pos = pos, with_labels=False, node_color=range(15), edge_color="#bbf5bb", width=10, font_weight='bold',cmap=plt.cm.Greens, node_size=5000)
nx.draw_networkx_labels(G, pos = labels_pos, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="black", lw=.5, alpha=1),
labels=labels_dict)
texts = ["Round \nof 16", "Quarter \n Final", "Semi \n Final", "Final\n"]
pos_y = pos[0][1] + 55
for text in reversed(texts):
pos_x = center
pos_y -= 75
plt.text(pos_y, pos_x, text, fontsize = 18)
plt.axis('equal')
plt.show()
Summarize
- Predictions are subject to change at any time as our database changes as the game progresses. If you want to know the latest results, just put the latest data in.
- The prediction result is not necessarily accurate, it is just a picture, the main purpose is to learn, in this project, you can learn the technology of data preprocessing and feature engineering
- The modeling in this project is relatively rough. There is only one gradient boosting tree that has been cross-validated by grid search, so the result is not accurate. Of course, I also hope that Brazil will win in the end.
- As of the noon of 2022.11.25, a total of 16 World Cup games have been played, and 11 of them have been predicted correctly. The upsets between Argentina and Japan were not predicted correctly, as were the draws between Uruguay and South Korea, Morocco and Croatia, Denmark and Tunisia.
- Please do not use this prediction result to participate in various bets, it is only for learning reference
Optimization direction
- Try to add more features that are good for prediction, such as the new crown epidemic, the state of the players in recent days, etc.
- You can spend more time on feature engineering
- Finally, on the model, the model in this article is relatively sloppy, you can try to use a more advantageous machine learning model
code download
Baseline:
https://www.kaggle.com/code/sslp23/predicting-fifa-2022-world-cup-with-ml/notebook
A version that will be optimized and updated at any time:
https://aistudio.baidu.com/aistudio/projectdetail/5116425?contributionType=1&sUid=2553954&shared=1&ts=1669358827040