Python! Predicting the 2022 World Cup Using Machine Learning

Predicting the 2022 World Cup Using Machine Learning

project instruction

Dataset description

International Football Results 1872-2022

  • The dataset includes 44,152 international soccer match results from the first official match in 1872 to 2022.
  • Competitions range from the FIFA World Cup to the FIFA Wild Cup to regular friendlies.
  • These games are strictly men's official international competitions and the figures do not include Olympic games or games in which at least one team is a national B team, U-23 or league selection team.

results.csv includes the following columns:

  • date - date of the match
  • home_team - home team name
  • away_team - away team name
  • home_score - full-time home team score, including extra time, excluding penalty shootouts
  • away_score - full time away score, including extra time, excluding penalty shootouts
  • tournament - tournament name
  • city ​​- the name of the city/town/administrative unit where the race is held
  • country - the name of the country where the match is played
  • neutral - TRUE/FALSE column indicating whether the match is played on neutral ground

shootouts.csv includes the following columns:

  • date - date of the match
  • home_team - home team name
  • away_team - away team name
  • winner - penalty shootout winner

FIFA World Ranking 1992-2022

  • country_full — full country name
  • country_abrv — country abbreviation
  • rank — current country rank
  • total_points — current total points
  • previous_points — the total points of the previous rating
  • rank_change — rank change
  • confederation — FIFA Confederation
  • rank_date — the date the rank was calculated

Data Analysis and Preprocessing

data preparation

#解压数据集
!unzip -d datasets/international-football-results-from-1872-to-2017 1872年至2022年国际足球成绩.zip 
!unzip -d datasets/fifaworldranking 国际足联世界排名1992-2022.zip 

Analysis and preprocessing of international football scores from 1872 to 2022

Import result.csv from 1872 to 2022 international football results

import pandas as pd
import numpy as np 
import re
df =  pd.read_csv("datasets/international-football-results-from-1872-to-2017/results.csv")
df.head()

Let's preview the data first, know its general structure,
insert image description here
and check the basic information of the data

df.info()

insert image description here

  • It can be found that data represents the date, but it is not in the format of the date, so we need to modify it to the format of the date when preprocessing
  • Only two columns in this table are continuous features, and the rest of the features are discrete

Check for missing values

#缺失值查看、
df.isna().sum()

insert image description here

  • It can be found that there are two columns of features that contain 40 missing values ​​at the same time.
  • And these two columns of features represent exactly our most important score features.
  • Without the score feature, we cannot perform subsequent modeling on it, so we need to remove samples with missing scores.

Eliminate samples with missing scores and modify the date format

#删除缺失值所在的行
df.dropna(inplace=True)
#将日期列的格式转换为日期格式
df["date"] = pd.to_datetime(df["date"])

But running directly like this will report an error, because after running, we found that the date column contains the string '2022-19-22' (which does not conform to the normal date logic), so we need to remove the row containing this string first Then process the data.

df = df.drop(df[df['date']=='2022-19-22'].index,axis=0)

The dataset we will be using will be the 2018 FIFA Olympics, the last game after the 2018 World Cup and before the 2022 World Cup. The idea is to analyze the match situation during the World Cup preparation and classification phase.
Therefore, we want to filter the dataset

  • Let's take a look at the last few games in 2022
df.sort_values("date").tail()

insert image description here

  • Filter out games after August 1, 2018, and reset the index
df = df[(df["date"] >= "2018-8-1")].reset_index(drop=True)
df.sort_values('data').tail()

Analysis and preprocessing of the FIFA World Rankings 1992-2022 dataset

Just like before, we need to convert the date format first and extract the data after August 1, 2018

rank = pd.read_csv("datasets/fifaworldranking/fifa_ranking-2022-10-06.csv")
rank["rank_date"] = pd.to_datetime(rank["rank_date"]) #转换日期格式
rank = rank[(rank["rank_date"] >= "2018-8-1")].reset_index(drop=True) #筛选数据集

Some teams in the World Cup have different names in the ranking dataset. So, it needs adjustment.

rank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States")

Merge the two tables

Next, we need to merge the data sets. The merge is to get a World Cup data set and its ranking.

  • Set the date as our index, then group by country, resample the first data of each day as our data, and finally reset the index
  • If it is empty, we use the forward filling method.
rank = rank.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()
  • We select the features "country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date" in the rank table to merge with the df table
  • And perform left and right alignment according to the date, home_team in the left table and the rank_date and country_full in the right table
  • Since the left and right tables have duplicate feature columns, we only need to take one of them, so we choose to delete rank_date and country_full here
df_wc_ranked = df.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]],
 left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1)
  • We know that besides home_team (home team name) and away_team (visitor team name) in result.csv
  • The merge above just merges the data of the home team together, but the data of the visiting team has not been merged yet, so we need to merge again
  • Here we take the feature column of rank to be the same as above, but the home_team in the left alignment becomes our away_team
  • Since it has been merged just once, if it is merged again, there will be many duplicate column names. Also in order to distinguish the characteristics of the home and away teams, we change the suffix of the rank feature column of the home team to _home, and change the suffix of the rank feature column of the visiting team to _away
  • Finally, it is the same as just now, we can take one of the remaining repeated features (such as time and country name)
df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], 
left_on=["date", "away_team"], right_on=["rank_date", "country_full"], 
suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)

After merging, let’s take a look at some of the data where both the home and away teams are Brazil.

df_wc_ranked[(df_wc_ranked.home_team == "Brazil") | (df_wc_ranked.away_team == "Brazil")].tail()

insert image description here
Now that we have the data ready, we can perform feature engineering on the dataset

feature engineering

  • The idea here is to create more features that have an impact on the outcome of a soccer game
  • We think that the characteristics of the impact may be the following:
    1. The team's historical score
    2. The team's historical goals and conceded goals
    3. The team's ranking
    4. The team's rise in the ranking
    5. The progress facing the ranking Balls and losses
    6. The importance of the game (friendly or not)
  • So we want to create a function: determine which team won, and how many points they scored in the game

Encapsulate a function to judge winning or losing

df = df_wc_ranked
def result_finder(home, away):
    if home > away:
        return pd.Series([0, 3, 0])
    if home < away:
        return pd.Series([1, 0, 3])
    else:
        return pd.Series([2, 1, 1])

results = df.apply(lambda x: result_finder(x["home_score"], x["away_score"]), axis=1)
df[["result", "home_team_points", "away_team_points"]] = results

hypothesis testing

  • The game points are 3 points for a win, 1 point for a tie, and 0 points for a loss, which is different from the existing ranking points in the database.
  • In addition, we believe that the ranking points in the dataset and the ranking of the same team are negatively correlated, and we should only use one of them to create new features.
  • The following is a test of this hypothesis.
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 10))
sns.heatmap(df[["total_points_home", "rank_home", "total_points_away", "rank_away"]].corr())
plt.show()

insert image description here

Feature Derivation

  • Now, we need to create features that are good for modeling
  • For example:
    1. Difference in ranking
    2. Ranking of points won in a match versus the team you faced
    3. Difference in goals in a match.
    All features that are not part of differences should be created for both teams (away and home).
df["rank_dif"] = df["rank_home"] - df["rank_away"]  #排名差异
df["sg"] = df["home_score"] - df["away_score"]  #分数差异
df["points_home_by_rank"] = df["home_team_points"]/df["rank_away"] #主场队伍进球与排名的关系
df["points_away_by_rank"] = df["away_team_points"]/df["rank_home"] #客场队伍进球与排名的关系

For better feature derivation, we split the dataset into home team and away team datasets, and then merge them together to calculate various features of their past games.
They are then separated and combined to construct an original dataset.
This process optimizes feature derivation

  • First divide the dataset into home and away datasets
home_team = df[["date", "home_team", "home_score", "away_score", "rank_home", "rank_away","rank_change_home", "total_points_home", "result", "rank_dif", "points_home_by_rank", "home_team_points"]]

away_team = df[["date", "away_team", "away_score", "home_score", "rank_away", "rank_home","rank_change_away", "total_points_away", "result", "rank_dif", "points_away_by_rank", "away_team_points"]]
  • Since the name of the feature column was modified when the data set was merged, now we need to change the name of the feature column to the original name for subsequent processing
home_team.columns = [h.replace("home_", "").replace("_home", "").replace("away_", "suf_").replace("_away", "_suf") for h in home_team.columns]

away_team.columns = [a.replace("away_", "").replace("_away", "").replace("home_", "suf_").replace("_home", "_suf") for a in away_team.columns]
  • Append them together for feature calculation
team_stats = home_team.append(away_team)
  • These columns will be used for feature calculation
team_stats_raw = team_stats.copy()

Now, we have a dataset ready for further feature derivation. The columns that will be derived are:

  • Mean goals of the team in World Cup Cycle. --The average number of goals of the World Cup team
  • Mean goals of the team in last 5 games. --The team's average goals in the last 5 games
  • Mean goals suffered of the team in World Cup Cycle. --The average number of fouls committed by the World Cup team
  • Mean goals suffered of the team in last 5 games. --The average number of fouls committed by the team in the last 5 games
  • Mean FIFA Rank that team faced in World Cup Cycle. --The average FIFA rank of the team in the World Cup
  • Mean FIFA Rank that team faced in last 5 games. --The average FIFA rank of the team in the last 5 games
  • FIFA Points won at the cycle. --FIFA Points
  • FIFA Points won in last 5 games. --The last 5 FIFA points
  • Mean game points at the Cycle. --Game points
  • Mean game points at last 5 games. --The last 5 game points
  • Mean game points by rank faced at the Cycle.
  • Mean game points by rank faced at last 5 games.
stats_val = []

for index, row in team_stats.iterrows():
    team = row["team"]
    date = row["date"]
    past_games = team_stats.loc[(team_stats["team"] == team) & (team_stats["date"] < date)].sort_values(by=['date'], ascending=False)
    last5 = past_games.head(5) #取出过去五场比赛
    
    goals = past_games["score"].mean()
    goals_l5 = last5["score"].mean()
    
    goals_suf = past_games["suf_score"].mean()
    goals_suf_l5 = last5["suf_score"].mean()
    
    rank = past_games["rank_suf"].mean()
    rank_l5 = last5["rank_suf"].mean()
    
    if len(last5) > 0:
        points = past_games["total_points"].values[0] - past_games["total_points"].values[-1]#qtd de pontos ganhos
        points_l5 = last5["total_points"].values[0] - last5["total_points"].values[-1] 
    else:
        points = 0
        points_l5 = 0
        
    gp = past_games["team_points"].mean()
    gp_l5 = last5["team_points"].mean()
    
    gp_rank = past_games["points_by_rank"].mean()
    gp_rank_l5 = last5["points_by_rank"].mean()
    
    stats_val.append([goals, goals_l5, goals_suf, goals_suf_l5, rank, rank_l5, points, points_l5, gp, gp_l5, gp_rank, gp_rank_l5])
  • Merge the newly derived features with the original table
  • And re-use full_df to receive
stats_cols = ["goals_mean", "goals_mean_l5", "goals_suf_mean", "goals_suf_mean_l5", "rank_mean", "rank_mean_l5", "points_mean", "points_mean_l5", "game_points_mean", "game_points_mean_l5", "game_points_rank_mean", "game_points_rank_mean_l5"]

stats_df = pd.DataFrame(stats_val, columns=stats_cols)

full_df = pd.concat([team_stats.reset_index(drop=True), stats_df], axis=1, ignore_index=False)
  • Divide the merged data set into home and away games again
home_team_stats = full_df.iloc[:int(full_df.shape[0]/2),:]
away_team_stats = full_df.iloc[int(full_df.shape[0]/2):,:]
  • Take out the column derived from the feature just now
home_team_stats = home_team_stats[home_team_stats.columns[-12:]]
away_team_stats = away_team_stats[away_team_stats.columns[-12:]]
  • Rename it (home_ stands for home) (away_ stands for away)
    In order to unify the data set, you need to add the suffix of home and away to each column, after that, the data can be combined and used
home_team_stats.columns = ['home_'+str(col) for col in home_team_stats.columns]
away_team_stats.columns = ['away_'+str(col) for col in away_team_stats.columns]
  • data merge
match_stats = pd.concat([home_team_stats, away_team_stats.reset_index(drop=True)], axis=1, ignore_index=False)
full_df = pd.concat([df, match_stats.reset_index(drop=True)], axis=1, ignore_index=False)
full_df.columns

Take a look at the existing feature columns
insert image description here

  • In order to determine whether the game is friendly, we encapsulate a function to judge it
def find_friendly(x):
    if x == "Friendly":
        return 1
    else: return 0

full_df["is_friendly"] = full_df["tournament"].apply(lambda x: find_friendly(x)) 
  • and one-hot encode it
full_df = pd.get_dummies(full_df, columns=["is_friendly"])

Perform data analysis on the dataset after feature engineering

  • Here, we only select columns that contribute to our feature analysis for analysis
base_df = full_df[["date", "home_team", "away_team", "rank_home", "rank_away","home_score", "away_score","result", "rank_dif", "rank_change_home", "rank_change_away", 'home_goals_mean',
       'home_goals_mean_l5', 'home_goals_suf_mean', 'home_goals_suf_mean_l5',
       'home_rank_mean', 'home_rank_mean_l5', 'home_points_mean',
       'home_points_mean_l5', 'away_goals_mean', 'away_goals_mean_l5',
       'away_goals_suf_mean', 'away_goals_suf_mean_l5', 'away_rank_mean',
       'away_rank_mean_l5', 'away_points_mean', 'away_points_mean_l5','home_game_points_mean', 'home_game_points_mean_l5',
       'home_game_points_rank_mean', 'home_game_points_rank_mean_l5','away_game_points_mean',
       'away_game_points_mean_l5', 'away_game_points_rank_mean',
       'away_game_points_rank_mean_l5',
       'is_friendly_0', 'is_friendly_1']]

base_df.head()

insert image description here

  • Check for missing values
base_df.isna().sum()

insert image description here

  • We know that the average value of rows with null values ​​cannot be calculated, so we need to remove samples with null values
base_df_no_fg = base_df.dropna()

Now, we need to analyze all created features and check if they have predictive power. Also, if they don't, we need to create some predictive features, such as the difference between home and away teams. To analyze predictive power, I will designate a draw as a loss for the home team and classify the problem as binary.

df = base_df_no_fg
def no_draw(x):
    if x == 2:
        return 1
    else:
        return x
    
df["target"] = df["result"].apply(lambda x: no_draw(x))

Filter features using violin plots and boxplots

  • Next, we use the violin plot and the box plot to analyze whether the features have different distributions according to the target.
  • Use scatterplots to analyze correlations
  • In order to make the image more intuitive, we will extract some features and draw them on the same canvas, and draw another part of the features on the next canvas
data1 = df[list(df.columns[8:20].values) + ["target"]]
data2 = df[df.columns[20:]]
  • Normalize features
scaled = (data1[:-1] - data1[:-1].mean()) / data1[:-1].std()
scaled["target"] = data1["target"]
violin1 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")

scaled = (data2[:-1] - data2[:-1].mean()) / data2[:-1].std()
scaled["target"] = data2["target"]
violin2 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")
  • Draw the feature violin plot in data1
plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin1,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

insert image description here

  • Draw the characteristic violin diagram of data2
plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin2,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

insert image description here
Looking at these plots, we find that the rank difference is the only good separator of the data. However, we can create some features to get the difference between home and away teams and analyze whether they separate the data well.

  • In order to better explore the difference between the home and away games, we find the difference of the characteristic mean of each home and away game, and standardize it, and then draw their violin plots
dif = df.copy()
dif.loc[:, "goals_dif"] = dif["home_goals_mean"] - dif["away_goals_mean"]
dif.loc[:, "goals_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_mean_l5"]
dif.loc[:, "goals_suf_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_suf_mean"]
dif.loc[:, "goals_suf_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_suf_mean_l5"]
dif.loc[:, "goals_made_suf_dif"] = dif["home_goals_mean"] - dif["away_goals_suf_mean"]
dif.loc[:, "goals_made_suf_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_suf_mean_l5"]
dif.loc[:, "goals_suf_made_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_mean"]
dif.loc[:, "goals_suf_made_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_mean_l5"]

data_difs = dif.iloc[:, -8:]
scaled = (data_difs - data_difs.mean()) / data_difs.std()
scaled["target"] = data2["target"]
violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")

plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

insert image description here
As can be seen from this plot, the difference in the number of goals scored is a good separator, as is the number of fouls. But the difference between the number of goals scored and the number of goals conceded by each team is not a good distinguishing criterion

  • Then we now filter out the following 5 features
    1. ran_dif
    2. goals_dif
    3. goals_dif_l5
    4. goals_suf_dif
    5. goals_suf_dif_l5
  • Next, we can also create other features, such as: differences in scores obtained, differences in rankings
dif.loc[:, "dif_points"] = dif["home_game_points_mean"] - dif["away_game_points_mean"]
dif.loc[:, "dif_points_l5"] = dif["home_game_points_mean_l5"] - dif["away_game_points_mean_l5"]
dif.loc[:, "dif_points_rank"] = dif["home_game_points_rank_mean"] - dif["away_game_points_rank_mean"]
dif.loc[:, "dif_points_rank_l5"] = dif["home_game_points_rank_mean_l5"] - dif["away_game_points_rank_mean_l5"]

dif.loc[:, "dif_rank_agst"] = dif["home_rank_mean"] - dif["away_rank_mean"]
dif.loc[:, "dif_rank_agst_l5"] = dif["home_rank_mean_l5"] - dif["away_rank_mean_l5"]
  • Also, we can calculate the impact of goals scored and fouls by grade and examine this difference
dif.loc[:, "goals_per_ranking_dif"] = (dif["home_goals_mean"] / dif["home_rank_mean"]) - (dif["away_goals_mean"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_suf_dif"] = (dif["home_goals_suf_mean"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_dif_l5"] = (dif["home_goals_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_mean_l5"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_suf_dif_l5"] = (dif["home_goals_suf_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean_l5"] / dif["away_rank_mean"])
  • As usual, normalize the newly constructed features and visualize them with a violin plot
data_difs = dif.iloc[:, -10:]
scaled = (data_difs - data_difs.mean()) / data_difs.std()
scaled["target"] = data2["target"]
violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")

plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()

insert image description here
Due to the low values, the violin plot does not give us good feedback, so for these features, we will use boxplots

plt.figure(figsize=(15,10))
sns.boxplot(x="features", y="value", hue="target", data=violin)
plt.xticks(rotation=90)
plt.show()

insert image description here

  • From this you can see that Difference of points (all matches and last 5 matches), difference of points by ranking faced (all matches and last 5 matches) and difference of rank faced (all matches and last 5 matches) are good Characteristics.
  • Also, some derived features have very similar distributions, for which we will use scatterplots for analysis.
sns.jointplot(data = data_difs, x = 'goals_per_ranking_dif', y = 'goals_per_ranking_dif_l5', kind="reg")
plt.show()

insert image description here

  • Since the feature distributions of dif_rank_agst and dif_rank_agst_l5 are very similar, we only use its full version for plotting (dif_rank_agst)
sns.jointplot(data = data_difs, x = 'dif_rank_agst', y = 'dif_rank_agst_l5', kind="reg")
plt.show()

insert image description here

  • For score features
sns.jointplot(data = data_difs, x = 'dif_points', y = 'dif_points_l5', kind="reg")
plt.show()

insert image description here

  • Ranking features against scores
sns.jointplot(data = data_difs, x = 'dif_points_rank', y = 'dif_points_rank_l5', kind="reg")
plt.show()

insert image description here
Since the two versions (all stats, last 5 games) are not that similar, we decided to use both. Therefore, the final result of our feature screening is:

  • rank_dif
  • goals_dif
  • goals_dif_l5
  • goals_suf_dif
  • goals_suf_dif_l5
  • dif_rank_agst
  • dif_rank_agst_l5
  • goals_per_ranking_dif
  • dif_points_rank
  • dif_points_rank_l5
  • is_friendly
def create_db(df):
    columns = ["home_team", "away_team", "target", "rank_dif", "home_goals_mean", "home_rank_mean", "away_goals_mean", "away_rank_mean", "home_rank_mean_l5", "away_rank_mean_l5", "home_goals_suf_mean", "away_goals_suf_mean", "home_goals_mean_l5", "away_goals_mean_l5", "home_goals_suf_mean_l5", "away_goals_suf_mean_l5", "home_game_points_rank_mean", "home_game_points_rank_mean_l5", "away_game_points_rank_mean", "away_game_points_rank_mean_l5","is_friendly_0", "is_friendly_1"]
    
    base = df.loc[:, columns]
    base.loc[:, "goals_dif"] = base["home_goals_mean"] - base["away_goals_mean"]
    base.loc[:, "goals_dif_l5"] = base["home_goals_mean_l5"] - base["away_goals_mean_l5"]
    base.loc[:, "goals_suf_dif"] = base["home_goals_suf_mean"] - base["away_goals_suf_mean"]
    base.loc[:, "goals_suf_dif_l5"] = base["home_goals_suf_mean_l5"] - base["away_goals_suf_mean_l5"]
    base.loc[:, "goals_per_ranking_dif"] = (base["home_goals_mean"] / base["home_rank_mean"]) - (base["away_goals_mean"] / base["away_rank_mean"])
    base.loc[:, "dif_rank_agst"] = base["home_rank_mean"] - base["away_rank_mean"]
    base.loc[:, "dif_rank_agst_l5"] = base["home_rank_mean_l5"] - base["away_rank_mean_l5"]
    base.loc[:, "dif_points_rank"] = base["home_game_points_rank_mean"] - base["away_game_points_rank_mean"]
    base.loc[:, "dif_points_rank_l5"] = base["home_game_points_rank_mean_l5"] - base["away_game_points_rank_mean_l5"]
    
    model_df = base[["home_team", "away_team", "target", "rank_dif", "goals_dif", "goals_dif_l5", "goals_suf_dif", "goals_suf_dif_l5", "goals_per_ranking_dif", "dif_rank_agst", "dif_rank_agst_l5", "dif_points_rank", "dif_points_rank_l5", "is_friendly_0", "is_friendly_1"]]
    return model_df

model_db = create_db(df)
model_db

insert image description here

Build a predictive model

  • Through the above steps, we have obtained a data set with predictive ability, we can start our modeling
  • In this task, we will build two models (RFC, GBDT). Finally, the model with the best recall rate is selected as our prediction model.
  • First we filter out our feature and label columns
X = model_db.iloc[:, 3:]
y = model_db[["target"]]
  • Import RFC and GBDT in integrated learning in sklearn
  • Import the split dataset and grid search package in sklearn
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
  • Here we choose a ratio of 8:2 to split the data set, and set the random seed number to 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1)

Build GBDT model

  • First build the GBDT model, and use grid search to fine-tune the model
gb = GradientBoostingClassifier(random_state=5)

params = {
    
    "learning_rate": [0.01, 0.1, 0.5],
            "min_samples_split": [5, 10],
            "min_samples_leaf": [3, 5],
            "max_depth":[3,5,10],
            "max_features":["sqrt"],
            "n_estimators":[100, 200]
         } 

gb_cv = GridSearchCV(gb, params, cv = 3, n_jobs = -1, verbose = False)

gb_cv.fit(X_train.values, np.ravel(y_train))

insert image description here

  • Look at the parameter configuration of GBDT
gb = gb_cv.best_estimator_
gb

insert image description here

Create an RFC model

  • Next, build the RFC model and fine-tune the model with grid search
params_rf = {
    
    "max_depth": [20],
                "min_samples_split": [10],
                "max_leaf_nodes": [175],
                "min_samples_leaf": [5],
                "n_estimators": [250],
                 "max_features": ["sqrt"],
                }

rf = RandomForestClassifier(random_state=1)

rf_cv = GridSearchCV(rf, params_rf, cv = 3, n_jobs = -1, verbose = False)

rf_cv.fit(X_train.values, np.ravel(y_train))

insert image description here

rf = rf_cv.best_estimator_

Model comparison

Here, we use confusion matrix and ROC curve for model comparison

def analyze(model):
    fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test.values)[:,1]) #test AUC
    plt.figure(figsize=(15,10))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.plot(fpr, tpr, label="test")

    fpr_train, tpr_train, _ = roc_curve(y_train, model.predict_proba(X_train.values)[:,1]) #train AUC
    plt.plot(fpr_train, tpr_train, label="train")
    auc_test = roc_auc_score(y_test, model.predict_proba(X_test.values)[:,1])
    auc_train = roc_auc_score(y_train, model.predict_proba(X_train.values)[:,1])
    plt.legend()
    plt.title('AUC score is %.2f on test and %.2f on training'%(auc_test, auc_train))
    plt.show()
    
    plt.figure(figsize=(15, 10))
    cm = confusion_matrix(y_test, model.predict(X_test.values))
    sns.heatmap(cm, annot=True, fmt="d")
  • GBDT
analyze(gb)

insert image description here
insert image description here

analyze(rf)

insert image description here
insert image description here
Through analysis, it is found that the random forest model may be slightly better, but it seems that the generalization ability is not good. Therefore we will use the GBDT model

world cup simulation

Data preparation and preprocessing

  • The first thing was to create the FIFA World Cup game
  • To do this we have to get the teams and group stage matches in Wikipedia
  • First use pd.read_html to quickly crawl data
from operator import itemgetter
dfs = pd.read_html(r"https://en.wikipedia.org/wiki/2022_FIFA_World_Cup#Teams")
  • Preprocess the crawled form
from collections.abc import Iterable

for i in range(len(dfs)):
    df = dfs[i]
    cols = list(df.columns.values)
    
    if isinstance(cols[0], Iterable):
        if any("Tie-breaking criteria" in c for c in cols):
            start_pos = i+1

        if any("Match 46" in c for c in cols):
            end_pos = i+1

matches = []
groups = ["A", "B", "C", "D", "E", "F", "G", "H"]
group_count = 0 

table = {
    
    }
table[groups[group_count]] = [[a.split(" ")[0], 0, []] for a in list(dfs[start_pos].iloc[:, 1].values)]

for i in range(start_pos+1, end_pos, 1):
    if len(dfs[i].columns) == 3:
        team_1 = dfs[i].columns.values[0]
        team_2 = dfs[i].columns.values[-1]
        
        matches.append((groups[group_count], team_1, team_2))
    else:
        group_count+=1
        table[groups[group_count]] = [[a, 0, []] for a in list(dfs[i].iloc[:, 1].values)]

table

insert image description here
Above, we stored each team's points in the group stage and its probability of winning each game. In particular, when two teams have the same points, the mean of the teams' winning probabilities will be used as the tie.
Next, we'll use the data from the previous game as the data for each team's participating teams. For example, Brazil vs. Serbia, Brazil's data is their data in the previous game, and the same is true for Serbia's data.

def find_stats(team_1):
#team_1 = "Qatar"
    past_games = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date")
    last5 = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date").tail(5)

    team_1_rank = past_games["rank"].values[-1]
    team_1_goals = past_games.score.mean()
    team_1_goals_l5 = last5.score.mean()
    team_1_goals_suf = past_games.suf_score.mean()
    team_1_goals_suf_l5 = last5.suf_score.mean()
    team_1_rank_suf = past_games.rank_suf.mean()
    team_1_rank_suf_l5 = last5.rank_suf.mean()
    team_1_gp_rank = past_games.points_by_rank.mean()
    team_1_gp_rank_l5 = last5.points_by_rank.mean()

    return [team_1_rank, team_1_goals, team_1_goals_l5, team_1_goals_suf, team_1_goals_suf_l5, team_1_rank_suf, team_1_rank_suf_l5, team_1_gp_rank, team_1_gp_rank_l5]

def find_features(team_1, team_2):
    rank_dif = team_1[0] - team_2[0]
    goals_dif = team_1[1] - team_2[1]
    goals_dif_l5 = team_1[2] - team_2[2]
    goals_suf_dif = team_1[3] - team_2[3]
    goals_suf_dif_l5 = team_1[4] - team_2[4]
    goals_per_ranking_dif = (team_1[1]/team_1[5]) - (team_2[1]/team_2[5])
    dif_rank_agst = team_1[5] - team_2[5]
    dif_rank_agst_l5 = team_1[6] - team_2[6]
    dif_gp_rank = team_1[7] - team_2[7]
    dif_gp_rank_l5 = team_1[8] - team_2[8]
    
    return [rank_dif, goals_dif, goals_dif_l5, goals_suf_dif, goals_suf_dif_l5, goals_per_ranking_dif, dif_rank_agst, dif_rank_agst_l5, dif_gp_rank, dif_gp_rank_l5, 1, 0]

officially start the simulation

Now we can start simulating the World Cup.
Since the model is a binary classification model, it will only predict whether team 1 will win or not. Therefore, we need to define some criteria to judge the average. Also, since there is no home field advantage in the World Cup, the idea is to predict two matches, changing Team 1 and Team 2, and the team with the highest probability average will be the winner . ** In the group stage, if the home team wins as team 1 and loses as team 2, or if the home team wins as team 2 and loses as team 1, the match will be considered a draw.

advanced_group = []
last_group = ""

for k in table.keys():
    for t in table[k]:
        t[1] = 0
        t[2] = []
        
for teams in matches:
    draw = False
    team_1 = find_stats(teams[1])
    team_2 = find_stats(teams[2])

    

    features_g1 = find_features(team_1, team_2)
    features_g2 = find_features(team_2, team_1)

    probs_g1 = gb.predict_proba([features_g1])
    probs_g2 = gb.predict_proba([features_g2])
    
    team_1_prob_g1 = probs_g1[0][0]
    team_1_prob_g2 = probs_g2[0][1]
    team_2_prob_g1 = probs_g1[0][1]
    team_2_prob_g2 = probs_g2[0][0]

    team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
    team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
    
    if ((team_1_prob_g1 > team_2_prob_g1) & (team_2_prob_g2 > team_1_prob_g2)) | ((team_1_prob_g1 < team_2_prob_g1) & (team_2_prob_g2 < team_1_prob_g2)):
        draw=True
        for i in table[teams[0]]:
            if i[0] == teams[1] or i[0] == teams[2]:
                i[1] += 1
                
    elif team_1_prob > team_2_prob:
        winner = teams[1]
        winner_proba = team_1_prob
        for i in table[teams[0]]:
            if i[0] == teams[1]:
                i[1] += 3
                
    elif team_2_prob > team_1_prob:  
        winner = teams[2]
        winner_proba = team_2_prob
        for i in table[teams[0]]:
            if i[0] == teams[2]:
                i[1] += 3
    
    for i in table[teams[0]]: #adding criterio de desempate (probs por jogo)
            if i[0] == teams[1]:
                i[2].append(team_1_prob)
            if i[0] == teams[2]:
                i[2].append(team_2_prob)

    if last_group != teams[0]:
        if last_group != "":
            print("\n")
            print("%s组 : "%(last_group))
            
            for i in table[last_group]: #adding crieterio de desempate
                i[2] = np.mean(i[2])
            
            final_points = table[last_group]
            final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)
            advanced_group.append([final_table[0][0], final_table[1][0]])
            for i in final_table:
                print("%s -------- %d"%(i[0], i[1]))
        print("\n")
        print("-"*10+"  %s组开始分析 "%(teams[0])+"-"*10)
        
        
    if draw == False:
        print(" %s组 - %s VS. %s:  %s获胜 概率为 %.2f"%(teams[0], teams[1], teams[2], winner, winner_proba))
    else:
        print(" %s组 - %s vs. %s: 平局"%(teams[0], teams[1], teams[2]))
    last_group =  teams[0]

print("\n")
print(" %s组 : "%(last_group))

for i in table[last_group]: #adding crieterio de desempate
    i[2] = np.mean(i[2])
            
final_points = table[last_group]
final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)
advanced_group.append([final_table[0][0], final_table[1][0]])
for i in final_table:
    print("%s -------- %d"%(i[0], i[1]))

Group stage predictions

---------- Group a begins to analyze ----------

Group A - Qatar vs Ecuador: 0.60 odds for Ecuador to win

Group A - Senegal vs Netherlands: 0.59 odds for Netherlands to win

Group A - Qatar vs Senegal: 0.58 odds for Senegal to win

Group A - Netherlands vs Ecuador: 0.66 odds for Netherlands to win

Group A - Ecuador vs Senegal: 0.53 odds for Ecuador to win

Group A - Netherlands vs Qatar: 0.69 odds for Netherlands to win

Group A:

Netherlands -------- 9

Ecuador -------- 6

Senegal -------- 3

Qatar -------- 0

---------- Group B starts to analyze ----------

Group B England vs Iran: 0.60 odds of England winning

Group B - USA v Wales: Draw

Group B - Wales vs Iran: 0.54 chance of Wales winning

Group B - England v USA: 0.58 chance of England winning

Group B - Wales vs England: 0.60 chance of England winning

Group B - Iran vs US: 0.57 probability of US winning

Group B:

England -------- 9

United States -------- 4

Wales -------- 4

Iran -------- 0

---------- Group C begins to analyze ----------

Group C - Argentina vs Saudi Arabia: 0.70 probability of Argentina winning

Group C - Mexico v Poland: Draw

Group C - Poland vs Saudi Arabia: 0.64 probability of Poland winning

Group C - Argentina vs Mexico: 0.62 probability of Argentina winning

Group C - Poland vs Argentina: 0.64 probability for Argentina to win

Group C - Saudi Arabia vs. Mexico: 0.64 odds for Mexico to win

Group C:

Argentina -------- 9

Poland -------- 4

Mexico -------- 4

Saudi Arabia -------- 0

---------- Group d starts analysis----------

Group D - Denmark vs Tunisia: 0.63 probability of Denmark winning

Group D - France vs Australia: 0.65 probability of France winning

Group D - Tunisia v Australia: Draw

Group D - France v Denmark: Draw

Group D - Australia vs Denmark: 0.65 probability of Denmark winning

Group D - Tunisia vs France: 0.63 probability of France winning

Group D:

France -------- 7

Denmark -------- 7

Tunisia -------- 1

Australia -------- 1

---------- Group e starts analysis----------

Group E - Germany vs Japan: 0.59 probability of Germany winning

Group E - Spain vs Costa Rica: 0.68 probability of Spain winning

Group E - Japan vs Costa Rica: Draw

Group E - Spain v Germany: Draw

Group E - Japan vs Spain: 0.62 probability of Spain winning

Group E - Costa Rica vs. Germany: 0.60 probability of Germany winning

Group E:

Spain -------- 7

Germany -------- 7

Japan-------- 1

Costa Rica -------- 1

---------- Group f begins analysis----------

Group F - Morocco vs Croatia: 0.58 chance for Croatia to win

Group F - Belgium vs Canada: 0.67 probability of Belgium winning

Group F - Belgium vs Morocco: 0.63 odds of Belgium winning

Group F - Croatia vs Canada: 0.62 chance for Croatia to win

Group F - Croatia vs Belgium: 0.60 probability of Belgium winning

Group F - Canada v Morocco: Draw

Group F:

Belgium -------- 9

Croatia -------- 6

Morocco -------- 1

Canada -------- 1

---------- G group starts analysis----------

Group G - Switzerland vs Cameroon: 0.62 odds for Switzerland to win

Group G - Brazil vs Serbia: 0.63 probability of Brazil winning

Group G - Cameroon vs Serbia: 0.61 probability of Serbia winning

Group G - Brazil vs Switzerland: Draw

Group G - Serbia vs Switzerland: 0.56 odds for Switzerland to win

Group G - Cameroon vs Brazil: 0.71 chance of Brazil winning

Group G:

Brazil -------- 7

Switzerland -------- 7

Serbia -------- 3

Cameroon -------- 0

---------- Group h starts analysis----------

Group H - Uruguay vs South Korea: 0.60 odds for Uruguay to win

Group H - Portugal vs Ghana: 0.71 probability of Portugal winning

Group H - South Korea vs Ghana: 0.69 odds for South Korea to win

Group H - Portugal v Uruguay: Draw

Group H - Ghana vs Uruguay: 0.69 odds for Uruguay to win

Group H - South Korea vs Portugal: 0.63 probability of Portugal winning

Group H:

Portugal -------- 7

Uruguay -------- 7

South Korea -------- 3

Ghana -------- 0

playoff predictions

There should be no surprises in the group stage predictions, with the franchise being a draw between Brazil and Switzerland or France and Denmark. For the playoff stage, we will show in a tree diagram

advanced = advanced_group

playoffs = {
    
    "第十六场比赛": [], "四分之一决赛": [], "半决赛": [], "决赛": []}

for p in playoffs.keys():
    playoffs[p] = []

actual_round = ""
next_rounds = []

for p in playoffs.keys():
    if p == "第十六场比赛":
        control = []
        for a in range(0, len(advanced*2), 1):
            if a < len(advanced):
                if a % 2 == 0:
                    control.append((advanced*2)[a][0])
                else:
                    control.append((advanced*2)[a][1])
            else:
                if a % 2 == 0:
                    control.append((advanced*2)[a][1])
                else:
                    control.append((advanced*2)[a][0])

        playoffs[p] = [[control[c], control[c+1]] for c in range(0, len(control)-1, 1) if c%2 == 0]
        
        for i in range(0, len(playoffs[p]), 1):
            game = playoffs[p][i]
            
            home = game[0]
            away = game[1]
            team_1 = find_stats(home)
            team_2 = find_stats(away)

            features_g1 = find_features(team_1, team_2)
            features_g2 = find_features(team_2, team_1)
            
            probs_g1 = gb.predict_proba([features_g1])
            probs_g2 = gb.predict_proba([features_g2])
            
            team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
            team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
            
            if actual_round != p:
                print("-"*10)
                print("开始模拟 %s"%(p))
                print("-"*10)
                print("\n")
            
            if team_1_prob < team_2_prob:
                print("%s VS. %s: %s 晋级 概率为 %.2f"%(home, away, away, team_2_prob))
                next_rounds.append(away)
            else:
                print("%s VS. %s: %s 晋级 概率为 %.2f"%(home, away, home, team_1_prob))
                next_rounds.append(home)
            
            game.append([team_1_prob, team_2_prob])
            playoffs[p][i] = game
            actual_round = p
        
    else:
        playoffs[p] = [[next_rounds[c], next_rounds[c+1]] for c in range(0, len(next_rounds)-1, 1) if c%2 == 0]
        next_rounds = []
        for i in range(0, len(playoffs[p])):
            game = playoffs[p][i]
            home = game[0]
            away = game[1]
            team_1 = find_stats(home)
            team_2 = find_stats(away)
            
            features_g1 = find_features(team_1, team_2)
            features_g2 = find_features(team_2, team_1)
            
            probs_g1 = gb.predict_proba([features_g1])
            probs_g2 = gb.predict_proba([features_g2])
            
            team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
            team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
            
            if actual_round != p:
                print("-"*10)
                print("开始模拟 %s"%(p))
                print("-"*10)
                print("\n")
            
            if team_1_prob < team_2_prob:
                print("%s VS. %s: %s 晋级 概率为 %.2f"%(home, away, away, team_2_prob))
                next_rounds.append(away)
            else:
                print("%s VS. %s: %s 晋级 概率为 %.2f"%(home, away, home, team_1_prob))
                next_rounds.append(home)
            game.append([team_1_prob, team_2_prob])
            playoffs[p][i] = game
            actual_round = p

Start to simulate the sixteenth match---------
Netherlands vs USA: The probability of the Netherlands advancing is 0.55

Argentina vs Denmark: Argentina's promotion probability is 0.59

Spain vs Croatia: Spain has a 0.57 chance of qualifying

Brazil vs Uruguay: 0.60 probability of Brazil qualifying

Ecuador vs England: 0.65 chance of England qualifying

Poland vs France: 0.60 probability of France qualifying

Germany vs Belgium: Belgium's promotion probability is 0.50

Switzerland vs Portugal: The probability of Portugal advancing is 0.52


Start to simulate the quarter-finals ---------
Netherlands vs Argentina: The probability of the Netherlands advancing is 0.52

Spain vs Brazil: 0.51 probability of Brazil qualifying

England vs France: 0.51 probability of France qualifying

Belgium vs Portugal: The probability of Portugal advancing is 0.52


Start simulating the semi-final ---------
Netherlands vs Brazil: The probability of Brazil advancing is 0.52

France vs Portugal: Portugal's promotion probability is 0.52


Start mock final ---------

Brazil vs Portugal: The probability of Brazil advancing is 0.52

  • draw its tree diagram
!pip install pydot pydot-ng graphviz
import networkx as nx
from networkx.drawing.nx_pydot import graphviz_layout

plt.figure(figsize=(15, 10))
G = nx.balanced_tree(2, 3)


labels = []


for p in playoffs.keys():
    for game in playoffs[p]:
        label = f"{
      
      game[0]}({
      
      round(game[2][0], 2)}) \n {
      
      game[1]}({
      
      round(game[2][1], 2)})"
        labels.append(label)
    
labels_dict = {
    
    }
labels_rev = list(reversed(labels))

for l in range(len(list(G.nodes))):
    labels_dict[l] = labels_rev[l]

pos = graphviz_layout(G, prog='twopi')
labels_pos = {
    
    n: (k[0], k[1]-0.08*k[1]) for n,k in pos.items()}
center  = pd.DataFrame(pos).mean(axis=1).mean()
    

nx.draw(G, pos = pos, with_labels=False, node_color=range(15), edge_color="#bbf5bb", width=10, font_weight='bold',cmap=plt.cm.Greens, node_size=5000)
nx.draw_networkx_labels(G, pos = labels_pos, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="black", lw=.5, alpha=1),
                        labels=labels_dict)
texts = ["Round \nof 16", "Quarter \n Final", "Semi \n Final", "Final\n"]
pos_y = pos[0][1] + 55
for text in reversed(texts):
    pos_x = center
    pos_y -= 75 
    plt.text(pos_y, pos_x, text, fontsize = 18)

plt.axis('equal')
plt.show()

insert image description here

Summarize

  • Predictions are subject to change at any time as our database changes as the game progresses. If you want to know the latest results, just put the latest data in.
  • The prediction result is not necessarily accurate, it is just a picture, the main purpose is to learn, in this project, you can learn the technology of data preprocessing and feature engineering
  • The modeling in this project is relatively rough. There is only one gradient boosting tree that has been cross-validated by grid search, so the result is not accurate. Of course, I also hope that Brazil will win in the end.
  • As of the noon of 2022.11.25, a total of 16 World Cup games have been played, and 11 of them have been predicted correctly. The upsets between Argentina and Japan were not predicted correctly, as were the draws between Uruguay and South Korea, Morocco and Croatia, Denmark and Tunisia.
  • Please do not use this prediction result to participate in various bets, it is only for learning reference

Optimization direction

  • Try to add more features that are good for prediction, such as the new crown epidemic, the state of the players in recent days, etc.
  • You can spend more time on feature engineering
  • Finally, on the model, the model in this article is relatively sloppy, you can try to use a more advantageous machine learning model

code download

Baseline:
https://www.kaggle.com/code/sslp23/predicting-fifa-2022-world-cup-with-ml/notebook

A version that will be optimized and updated at any time:
https://aistudio.baidu.com/aistudio/projectdetail/5116425?contributionType=1&sUid=2553954&shared=1&ts=1669358827040

Guess you like

Origin blog.csdn.net/weixin_62338855/article/details/128023854