就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

ShowMeAI發表於2022-12-20
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

? 作者:韓信子@ShowMeAI
? 資料分析實戰系列https://www.showmeai.tech/tutorials/40
? 機器學習實戰系列https://www.showmeai.tech/tutorials/41
? 本文地址https://www.showmeai.tech/article-detail/400
? 宣告:版權所有,轉載請聯絡平臺與作者並註明出處
? 收藏ShowMeAI檢視更多精彩內容

? 賽後作者補充

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

FIFA 2022世界盃已經落幕!關於哪支球隊將贏得冠軍的討論,也有了明確答案。恭喜梅西!恭喜阿根廷!賽前 ShowMeAI 使用資料科學和機器學習的技能,開發一個基於歷史資料的模型來預測 FIFA 2022 世界盃比賽結果。現在塵埃落定,讓我們一起看看機器學習的預測與實際比賽結果,有多大大大大的差距吧!

對比下方官網釋出的賽程結果彙總, ShowMeAI 將機器學習的預測結果視覺化後與之進行了比較。

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

可以看到,從小組賽開始直到1/4決賽,機器學習模型預測的正確率都是比較高的。然而從半決賽開始,模型預測準確度急轉直下,不論是參賽球隊還是輸贏判斷都降為0,冠亞季軍無一預測正確

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

但這也正是足球的魅力所在。正是競技體育中存在的不確定性,讓我們更深刻地感受到了奮鬥、勇氣、英雄和夢想的含義。(下文是賽前完整的建模過程,一起來看看吧!)

? 資料來源

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

我們先為機器學習建模準備資料,我們需要一些資料來體現各支球隊的表現。我們本次用到的是FIFA 相關的資料:?1872到2022歷史比賽資料 和 ?FIFA 排名資料,資料可以直接在Kaggle平臺獲取,也可以在ShowMeAI的百度網盤獲取。

? 實戰資料集下載(百度網盤):公眾號『ShowMeAI研究中心』回覆『實戰』,或者點選 這裡 獲取本文 [35]基於機器學習的2022世界盃預測實戰FIFA 2022資料集

ShowMeAI官方GitHubhttps://github.com/ShowMeAI-Hub

? 資料集構建

哪些特徵會影響足球比賽的勝負結果?這個開放的問題涉及很多特徵維度:從選定的球員到當天球場的溫度。我們簡單一點處理,僅使用參與比賽的每個團隊的過去統計資料構建一個資料集,優先考慮可以透過簡單方式收集的可量化統計資料,例如進球數、平均排名、贏得的分數等。這些資料可以在我們上面談到的兩個資料集中整合得到。

另外,我們僅分析 2018 之後的資料,這樣我們可以更聚焦在本屆世界盃備戰這幾年球隊隊員表現的變化。資料構建程式碼如下:

import pandas as pd
import re
df =  pd.read_csv("results.csv") #games between national teams
df["date"] = pd.to_datetime(df["date"])
df = df[(df["date"] >= "2018-8-1")].reset_index(drop=True) #games at the 2022 wc cycle
df_wc = df #pre-wc outcomes

rank = pd.read_csv("fifa_ranking-2022-10-06.csv") #rankings
rank["rank_date"] = pd.to_datetime(rank["rank_date"]) 
rank = rank[(rank["rank_date"] >= "2018-8-1")].reset_index(drop=True) #selecting games from the 2022 wc cycle
rank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States") #ajustando nomes de algumas seleções
rank = rank.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()
rank_wc = rank #dataframe with rankings

#Making the merge
df_wc_ranked = df_wc.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1)
df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "away_team"], right_on=["rank_date", "country_full"], suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)

最終的資料集結果如下:

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

? 特徵工程

對特徵工程細節感興趣的同學,可以閱讀ShowMeAI的詳解文章,學習理論知識與實戰方法:

? 機器學習實戰 | 機器學習特徵工程最全解讀

準備好資料之後,我們就可以進行特徵工程了,我們希望從原始資料中構建有預測能力的特徵資訊,我們這裡採用瞭如下特徵:

  • 世界盃週期和最近 5 場比賽的平均進球數。
  • 世界盃週期和最近 5 場比賽的平均失球數。
  • 每支球隊之間的 FIFA 排名差異。
  • 國際足聯排名每支球隊在世界盃週期比賽和最近 5 場比賽中平均面對。
  • 從第一場比賽到現在,FIFA 排名的積分變化。
  • FIFA 排名 5 場比賽前和現在的積分變化。
  • 世界盃週期和最近 5 場比賽的平均局分。
  • 根據世界盃週期和最近 5 場比賽中的排名位置加權平均贏得的比賽積分。
  • 表徵比賽公正友好的類別變數欄位值。

我們選取以上特徵的原因是:

  • 前兩個特徵用於量化一支球隊的進攻力和防守力;
  • 國際足聯在比賽中排名位置的差異是用來量化國際足聯計算的兩隊實力的差異;
  • 平均排名用於分析球隊面對的對手的實力;
  • 國際足聯排名積分的變化是為了分析世界盃週期和最近5場比賽中球隊能力的變化;
  • 球隊的場均勝率量化球隊的表現,而球隊的場均勝負加權平均是根據球隊所面對的對手的排名位次進行加權,以更精準分析球隊的表現。
df = df_wc_ranked

def result_finder(home, away):
    if home > away:
        return pd.Series([0, 3, 0])
    if home < away:
        return pd.Series([1, 0, 3])
    else:
        return pd.Series([2, 1, 1])

results = df.apply(lambda x: result_finder(x["home_score"], x["away_score"]), axis=1)

df[["result", "home_team_points", "away_team_points"]] = results

df["rank_dif"] = df["rank_home"] - df["rank_away"]
df["sg"] = df["home_score"] - df["away_score"]
df["points_home_by_rank"] = df["home_team_points"]/df["rank_away"]
df["points_away_by_rank"] = df["away_team_points"]/df["rank_home"]

home_team = df[["date", "home_team", "home_score", "away_score", "rank_home", "rank_away","rank_change_home", "total_points_home", "result", "rank_dif", "points_home_by_rank", "home_team_points"]]

away_team = df[["date", "away_team", "away_score", "home_score", "rank_away", "rank_home","rank_change_away", "total_points_away", "result", "rank_dif", "points_away_by_rank", "away_team_points"]]

home_team.columns = [h.replace("home_", "").replace("_home", "").replace("away_", "suf_").replace("_away", "_suf") for h in home_team.columns]

away_team.columns = [a.replace("away_", "").replace("_away", "").replace("home_", "suf_").replace("_home", "_suf") for a in away_team.columns]

team_stats = home_team.append(away_team)

team_stats_raw = team_stats.copy()
stats_val = []

for index, row in team_stats.iterrows():
    team = row["team"]
    date = row["date"]
    past_games = team_stats.loc[(team_stats["team"] == team) & (team_stats["date"] < date)].sort_values(by=['date'], ascending=False)
    last5 = past_games.head(5)
    
    goals = past_games["score"].mean()
    goals_l5 = last5["score"].mean()
    
    goals_suf = past_games["suf_score"].mean()
    goals_suf_l5 = last5["suf_score"].mean()
    
    rank = past_games["rank_suf"].mean()
    rank_l5 = last5["rank_suf"].mean()
    
    if len(last5) > 0:
        points = past_games["total_points"].values[0] - past_games["total_points"].values[-1]#qtd de pontos ganhos
        points_l5 = last5["total_points"].values[0] - last5["total_points"].values[-1] 
    else:
        points = 0
        points_l5 = 0
        
    gp = past_games["team_points"].mean()
    gp_l5 = last5["team_points"].mean()
    
    gp_rank = past_games["points_by_rank"].mean()
    gp_rank_l5 = last5["points_by_rank"].mean()
    
    stats_val.append([goals, goals_l5, goals_suf, goals_suf_l5, rank, rank_l5, points, points_l5, gp, gp_l5, gp_rank, gp_rank_l5])

stats_cols = ["goals_mean", "goals_mean_l5", "goals_suf_mean", "goals_suf_mean_l5", "rank_mean", "rank_mean_l5", "points_mean", "points_mean_l5", "game_points_mean", "game_points_mean_l5", "game_points_rank_mean", "game_points_rank_mean_l5"]

stats_df = pd.DataFrame(stats_val, columns=stats_cols)

full_df = pd.concat([team_stats.reset_index(drop=True), stats_df], axis=1, ignore_index=False)

home_team_stats = full_df.iloc[:int(full_df.shape[0]/2),:]
away_team_stats = full_df.iloc[int(full_df.shape[0]/2):,:]

home_team_stats = home_team_stats[home_team_stats.columns[-12:]]
away_team_stats = away_team_stats[away_team_stats.columns[-12:]]

home_team_stats.columns = ['home_'+str(col) for col in home_team_stats.columns]
away_team_stats.columns = ['away_'+str(col) for col in away_team_stats.columns]

match_stats = pd.concat([home_team_stats, away_team_stats.reset_index(drop=True)], axis=1, ignore_index=False)

full_df = pd.concat([df, match_stats.reset_index(drop=True)], axis=1, ignore_index=False)

def find_friendly(x):
    if x == "Friendly":
        return 1
    else: return 0

full_df["is_friendly"] = full_df["tournament"].apply(lambda x: find_friendly(x)) 

full_df = pd.get_dummies(full_df, columns=["is_friendly"])

base_df = full_df[["date", "home_team", "away_team", "rank_home", "rank_away","home_score", "away_score","result", "rank_dif", "rank_change_home", "rank_change_away", 'home_goals_mean',
       'home_goals_mean_l5', 'home_goals_suf_mean', 'home_goals_suf_mean_l5',
       'home_rank_mean', 'home_rank_mean_l5', 'home_points_mean',
       'home_points_mean_l5', 'away_goals_mean', 'away_goals_mean_l5',
       'away_goals_suf_mean', 'away_goals_suf_mean_l5', 'away_rank_mean',
       'away_rank_mean_l5', 'away_points_mean', 'away_points_mean_l5','home_game_points_mean', 'home_game_points_mean_l5',
       'home_game_points_rank_mean', 'home_game_points_rank_mean_l5','away_game_points_mean',
       'away_game_points_mean_l5', 'away_game_points_rank_mean',
       'away_game_points_rank_mean_l5',
       'is_friendly_0', 'is_friendly_1']]

base_df.tail()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

? 資料分析

在建模之前,我們對於資料做一點分析。比賽的結果有3種情況:贏、平、輸,但作為 3 類分類問題進行建模,類別不均衡是一個很大的問題,且評估也會有點麻煩,我們做一點合併和調整:彙總到「主隊贏」和「主隊平/輸」2種情況。

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

關於資料分析與視覺化的詳細教程,可以閱讀ShowMeAI關於的資料分析系列教程與文章

我們按照不同的結果(贏/輸平)來對不同的特徵維度進行分佈分析,我們這裡使用小提琴圖。

base_df_no_fg = base_df.dropna()

df = base_df_no_fg

def no_draw(x):
    if x == 2:
        return 1
    else:
        return x
    
df["target"] = df["result"].apply(lambda x: no_draw(x))
import matplotlib.pyplot as plt

data1 = df[list(df.columns[8:20].values) + ["target"]]

scaled = (data1[:-1] - data1[:-1].mean()) / data1[:-1].std()
scaled["target"] = data1["target"]
violin1 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")

plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin1,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵
data2 = df[df.columns[20:]]

scaled = (data2[:-1] - data2[:-1].mean()) / data2[:-1].std()
scaled["target"] = data2["target"]
violin2 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")

plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin2,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

對於第一組資料,目前的特徵中只有rank_dif(兩隊排名的差值)對 target classes 有影響。因此,我們考慮建立更多差異特徵,這類特徵似乎是很強的特徵資訊,構建如下特徵:

  • 進球差異。
  • 失球差異。
  • 球隊進球與對手進球之間的差異。
dif = df.copy()
dif.loc[:, "goals_dif"] = dif["home_goals_mean"] - dif["away_goals_mean"]
dif.loc[:, "goals_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_mean_l5"]
dif.loc[:, "goals_suf_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_suf_mean"]
dif.loc[:, "goals_suf_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_suf_mean_l5"]
dif.loc[:, "goals_made_suf_dif"] = dif["home_goals_mean"] - dif["away_goals_suf_mean"]
dif.loc[:, "goals_made_suf_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_suf_mean_l5"]
dif.loc[:, "goals_suf_made_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_mean"]
dif.loc[:, "goals_suf_made_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_mean_l5"]

我們再次使用小提琴圖分析。

data_difs = dif.iloc[:, -8:]
scaled = (data_difs - data_difs.mean()) / data_difs.std()
scaled["target"] = data2["target"]
violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")

plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

進球差異和失球差異特徵對目標有很好的區分度。然而,球隊進球與對手進球之間差異的特徵沒有影響。那我們再考慮:

  • 排名差異。
  • 世界盃週期和最近 5 場比賽的進球差異。
  • 在世界盃週期和最近 5 場比賽中出現淨勝球。

此外,我們還可以計算積分的差異、排名位置的差異以及排名所獲得的積分差異。而且,為了衡量對手的水平,我們可以考慮:排名所造成的進球與失球之間的差異。

dif.loc[:, "dif_points"] = dif["home_game_points_mean"] - dif["away_game_points_mean"]
dif.loc[:, "dif_points_l5"] = dif["home_game_points_mean_l5"] - dif["away_game_points_mean_l5"]
dif.loc[:, "dif_points_rank"] = dif["home_game_points_rank_mean"] - dif["away_game_points_rank_mean"]
dif.loc[:, "dif_points_rank_l5"] = dif["home_game_points_rank_mean_l5"] - dif["away_game_points_rank_mean_l5"]

dif.loc[:, "dif_rank_agst"] = dif["home_rank_mean"] - dif["away_rank_mean"]
dif.loc[:, "dif_rank_agst_l5"] = dif["home_rank_mean_l5"] - dif["away_rank_mean_l5"]

dif.loc[:, "goals_per_ranking_dif"] = (dif["home_goals_mean"] / dif["home_rank_mean"]) - (dif["away_goals_mean"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_suf_dif"] = (dif["home_goals_suf_mean"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_dif_l5"] = (dif["home_goals_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_mean_l5"] / dif["away_rank_mean"])
dif.loc[:, "goals_per_ranking_suf_dif_l5"] = (dif["home_goals_suf_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean_l5"] / dif["away_rank_mean"])

我們用提琴圖和箱線圖對資料進行分析:

data_difs = dif.iloc[:, -10:]
scaled = (data_difs - data_difs.mean()) / data_difs.std()
scaled["target"] = data2["target"]
violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value")

plt.figure(figsize=(15,10))
sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵
plt.figure(figsize=(15,10))
sns.boxplot(x="features", y="value", hue="target", data=violin)
plt.xticks(rotation=90)
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

積分差異、排名的進球差異、排名的積分差異是很好的特徵。但是,我們有一些特徵之間的相關度非常高,我們透過jointplot進行聯合分佈分析:

sns.jointplot(data = data_difs, x = 'dif_rank_agst', y = 'dif_rank_agst_l5', kind="reg")
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵
sns.jointplot(data = data_difs, x = 'goals_per_ranking_dif', y = 'goals_per_ranking_dif_l5', kind="reg")
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵
sns.jointplot(data = data_difs, x = 'dif_points_rank', y = 'dif_points_rank_l5', kind="reg")
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵
sns.jointplot(data = data_difs, x = 'dif_points', y = 'dif_points_l5', kind="reg")
plt.show()
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

分析相關性可以看出,我們選擇其中的1組特徵就好,這裡我們選擇了考慮全週期的版本。最後保留的特徵有下面這些:

  • 球隊排名差異(rank_dif
  • 世界盃週期和過去 5 場比賽平均進球數之間的差異(goals_dif / goals_dif_l5
  • 世界盃週期和過去 5 場比賽平均失球數之間的差異(goals_suf_dif / goals_suf_dif_l5
  • 世界盃週期和最近 5 場比賽的平均排名差異(dif_rank_agst / dif_rank_agst_l5
  • 世界盃週期平均排名加權進球數之間的差異(goals_per_ranking_dif
  • 世界盃週期和過去 5 場比賽中排名平均得分之間的差異(dif_points_rank / dif_points_rank_l5
  • 表示球賽是否公平友好的類別變數(is_friendly

這樣,我們最終的資料集如下,包含後續機器學習模型所需的全部特徵。

def create_db(df):
    columns = ["home_team", "away_team", "target", "rank_dif", "home_goals_mean", "home_rank_mean", "away_goals_mean", "away_rank_mean", "home_rank_mean_l5", "away_rank_mean_l5", "home_goals_suf_mean", "away_goals_suf_mean", "home_goals_mean_l5", "away_goals_mean_l5", "home_goals_suf_mean_l5", "away_goals_suf_mean_l5", "home_game_points_rank_mean", "home_game_points_rank_mean_l5", "away_game_points_rank_mean", "away_game_points_rank_mean_l5","is_friendly_0", "is_friendly_1"]
    
    base = df.loc[:, columns]
    base.loc[:, "goals_dif"] = base["home_goals_mean"] - base["away_goals_mean"]
    base.loc[:, "goals_dif_l5"] = base["home_goals_mean_l5"] - base["away_goals_mean_l5"]
    base.loc[:, "goals_suf_dif"] = base["home_goals_suf_mean"] - base["away_goals_suf_mean"]
    base.loc[:, "goals_suf_dif_l5"] = base["home_goals_suf_mean_l5"] - base["away_goals_suf_mean_l5"]
    base.loc[:, "goals_per_ranking_dif"] = (base["home_goals_mean"] / base["home_rank_mean"]) - (base["away_goals_mean"] / base["away_rank_mean"])
    base.loc[:, "dif_rank_agst"] = base["home_rank_mean"] - base["away_rank_mean"]
    base.loc[:, "dif_rank_agst_l5"] = base["home_rank_mean_l5"] - base["away_rank_mean_l5"]
    base.loc[:, "dif_points_rank"] = base["home_game_points_rank_mean"] - base["away_game_points_rank_mean"]
    base.loc[:, "dif_points_rank_l5"] = base["home_game_points_rank_mean_l5"] - base["away_game_points_rank_mean_l5"]
    
    model_df = base[["home_team", "away_team", "target", "rank_dif", "goals_dif", "goals_dif_l5", "goals_suf_dif", "goals_suf_dif_l5", "goals_per_ranking_dif", "dif_rank_agst", "dif_rank_agst_l5", "dif_points_rank", "dif_points_rank_l5", "is_friendly_0", "is_friendly_1"]]
    return model_df
 
model_db = create_db(df)
model_db
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

? 建模最佳化

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

關於機器學習建模與調優的相關知識與實戰方法,可以檢視ShowMeAI的系列教程與文章

? 機器學習****實戰:手把手教你玩轉機器學習系列

? AI****垂直領域工具庫速查表 | Scikit-Learn 速查表

下面我們就可以開始建模了,我們使用兩個模型 Random Forest 和 Gradient Boosting 來建模,進行效果對比。對於模型調參,我們使用 SkLearn 的 ?GridSearchCV 進行引數最佳化,挑選最佳模型。

import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

#separating the target from the features
X = model_db.iloc[:, 3:]
y = model_db[["target"]]

#dividing the database
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1)

gb = GradientBoostingClassifier(random_state=5)
params = {"learning_rate": [0.01, 0.1, 0.5],
            "min_samples_split": [5, 10],
            "min_samples_leaf": [3, 5],
            "max_depth":[3,5,10],
            "max_features":["sqrt"],
            "n_estimators":[100, 200]
         } 
gb_cv = GridSearchCV(gb, params, cv = 3, n_jobs = -1, verbose = False)
gb_cv.fit(X_train.values, np.ravel(y_train))

#getting the best model
gb = gb_cv.best_estimator_

我們對隨機森林也進行調參和最佳化:

params_rf = {"max_depth": [20],
                "min_samples_split": [5, 10],
                "max_leaf_nodes": [175, 200],
                "min_samples_leaf": [5, 10],
                "n_estimators": [250],
                 "max_features": ["sqrt"],
                }

rf = RandomForestClassifier(random_state=1)
rf_cv = GridSearchCV(rf, params_rf, cv = 3, n_jobs = -1, verbose = False)
rf_cv.fit(X_train.values, np.ravel(y_train))

rf = rf_cv.best_estimator_

輸出結果:

GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=1), n_jobs=-1,
             param_grid={'max_depth': [20], 'max_features': ['sqrt'],
                         'max_leaf_nodes': [175, 200],
                         'min_samples_leaf': [5, 10],
                         'min_samples_split': [5, 10], 'n_estimators': [250]},
             verbose=False)

我們使用混淆矩陣和ROC-AUC曲線進行了模型分析,結果是:

from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

def analyze(model):
    fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test.values)[:,1]) #test AUC
    plt.figure(figsize=(15,10))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.plot(fpr, tpr, label="test")

    fpr_train, tpr_train, _ = roc_curve(y_train, model.predict_proba(X_train.values)[:,1]) #train AUC
    plt.plot(fpr_train, tpr_train, label="train")
    auc_test = roc_auc_score(y_test, model.predict_proba(X_test.values)[:,1])
    auc_train = roc_auc_score(y_train, model.predict_proba(X_train.values)[:,1])
    plt.legend()
    plt.title('AUC score is %.2f on test and %.2f on training'%(auc_test, auc_train))
    plt.show()
    
    plt.figure(figsize=(15, 10))
    cm = confusion_matrix(y_test, model.predict(X_test.values))
    sns.heatmap(cm, annot=True, fmt="d")

analyze(gb)
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

對隨機森林進行分析:

analyze(rf)
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

隨機森林模型的效能稍好,但結果上有一點過擬合。分析 Gradient Boosting 模型的 AUC-ROC,它風險較低,我們最終選擇它。

? 應用模型

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

下面我們基於這個模型將預測世界盃結果。我們先使用了 ?Pandas的read_html 方法獲取參加世界盃的球隊名單。

dfs = pd.read_html(r"https://en.wikipedia.org/wiki/2022_FIFA_World_Cup#Teams")

from collections.abc import Iterable

for i in range(len(dfs)):
    df = dfs[i]
    cols = list(df.columns.values)
    
    if isinstance(cols[0], Iterable):
        if any("Tie-breaking criteria" in c for c in cols):
            start_pos = i+1

        if any("Match 46" in c for c in cols):
            end_pos = i+1
matches = []
groups = ["A", "B", "C", "D", "E", "F", "G", "H"]
group_count = 0 

table = {}
#TABLE -> TEAM, POINTS, WIN PROBS (CRITERIO DE DESEMPATE)
table[groups[group_count]] = [[a.split(" ")[0], 0, []] for a in list(dfs[start_pos].iloc[:, 1].values)]

for i in range(start_pos+1, end_pos, 1):
    if len(dfs[i].columns) == 3:
        team_1 = dfs[i].columns.values[0]
        team_2 = dfs[i].columns.values[-1]
        
        matches.append((groups[group_count], team_1, team_2))
    else:
        group_count+=1
        table[groups[group_count]] = [[a, 0, []] for a in list(dfs[i].iloc[:, 1].values)]

table
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵
matches[:10]
就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

我們的模型對主隊獲勝和客隊獲勝/平局進行了分類。那這裡面又怎麼區分平局呢? 我們處理的辦法如下,我們以兩種形式進行預測:

  • A 隊 x B 隊(模擬 1)
  • B 隊 x A 隊(模擬 2)

如果兩個預測都是 A 隊或 B 隊獲勝,則直接判定該隊獲勝。如果一次預測A隊獲勝,而第二次預測B隊獲勝,則判定結果為平局。下面我們構建程式碼來逐場模擬比賽,計算分數。

def find_stats(team_1):
#team_1 = "Qatar"
    past_games = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date")
    last5 = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date").tail(5)

    team_1_rank = past_games["rank"].values[-1]
    team_1_goals = past_games.score.mean()
    team_1_goals_l5 = last5.score.mean()
    team_1_goals_suf = past_games.suf_score.mean()
    team_1_goals_suf_l5 = last5.suf_score.mean()
    team_1_rank_suf = past_games.rank_suf.mean()
    team_1_rank_suf_l5 = last5.rank_suf.mean()
    team_1_gp_rank = past_games.points_by_rank.mean()
    team_1_gp_rank_l5 = last5.points_by_rank.mean()

    return [team_1_rank, team_1_goals, team_1_goals_l5, team_1_goals_suf, team_1_goals_suf_l5, team_1_rank_suf, team_1_rank_suf_l5, team_1_gp_rank, team_1_gp_rank_l5]

def find_features(team_1, team_2):
    rank_dif = team_1[0] - team_2[0]
    goals_dif = team_1[1] - team_2[1]
    goals_dif_l5 = team_1[2] - team_2[2]
    goals_suf_dif = team_1[3] - team_2[3]
    goals_suf_dif_l5 = team_1[4] - team_2[4]
    goals_per_ranking_dif = (team_1[1]/team_1[5]) - (team_2[1]/team_2[5])
    dif_rank_agst = team_1[5] - team_2[5]
    dif_rank_agst_l5 = team_1[6] - team_2[6]
    dif_gp_rank = team_1[7] - team_2[7]
    dif_gp_rank_l5 = team_1[8] - team_2[8]
    
    return [rank_dif, goals_dif, goals_dif_l5, goals_suf_dif, goals_suf_dif_l5, goals_per_ranking_dif, dif_rank_agst, dif_rank_agst_l5, dif_gp_rank, dif_gp_rank_l5, 1, 0]

advanced_group = []
last_group = ""

for k in table.keys():
    for t in table[k]:
        t[1] = 0
        t[2] = []
        
for teams in matches:
    draw = False
    team_1 = find_stats(teams[1])
    team_2 = find_stats(teams[2])

    features_g1 = find_features(team_1, team_2)
    features_g2 = find_features(team_2, team_1)

    probs_g1 = gb.predict_proba([features_g1])
    probs_g2 = gb.predict_proba([features_g2])
    
    team_1_prob_g1 = probs_g1[0][0]
    team_1_prob_g2 = probs_g2[0][1]
    team_2_prob_g1 = probs_g1[0][1]
    team_2_prob_g2 = probs_g2[0][0]

    team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
    team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
    
    if ((team_1_prob_g1 > team_2_prob_g1) & (team_2_prob_g2 > team_1_prob_g2)) | ((team_1_prob_g1 < team_2_prob_g1) & (team_2_prob_g2 < team_1_prob_g2)):
        draw=True
        for i in table[teams[0]]:
            if i[0] == teams[1] or i[0] == teams[2]:
                i[1] += 1
                
    elif team_1_prob > team_2_prob:
        winner = teams[1]
        winner_proba = team_1_prob
        for i in table[teams[0]]:
            if i[0] == teams[1]:
                i[1] += 3
                
    elif team_2_prob > team_1_prob:  
        winner = teams[2]
        winner_proba = team_2_prob
        for i in table[teams[0]]:
            if i[0] == teams[2]:
                i[1] += 3
    
    for i in table[teams[0]]: #adding criterio de desempate (probs por jogo)
            if i[0] == teams[1]:
                i[2].append(team_1_prob)
            if i[0] == teams[2]:
                i[2].append(team_2_prob)

    if last_group != teams[0]:
        if last_group != "":
            print("\n")
            print("Group %s advanced: "%(last_group))
            
            for i in table[last_group]: #adding crieterio de desempate
                i[2] = np.mean(i[2])
            
            final_points = table[last_group]
            final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)
            advanced_group.append([final_table[0][0], final_table[1][0]])
            for i in final_table:
                print("%s -------- %d"%(i[0], i[1]))
        print("\n")
        print("-"*10+" Starting Analysis for Group %s "%(teams[0])+"-"*10)
        
    if draw == False:
        print("Group %s - %s vs. %s: Winner %s with %.2f probability"%(teams[0], teams[1], teams[2], winner, winner_proba))
    else:
        print("Group %s - %s vs. %s: Draw"%(teams[0], teams[1], teams[2]))
    last_group =  teams[0]


print("\n")
print("Group %s advanced: "%(last_group))

for i in table[last_group]: #adding crieterio de desempate
    i[2] = np.mean(i[2])
            
final_points = table[last_group]
final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)
advanced_group.append([final_table[0][0], final_table[1][0]])
for i in final_table:
    print("%s -------- %d"%(i[0], i[1]))

結果是:

---------- Starting Analysis for Group A ----------
Group A - Qatar vs. Ecuador: Winner Ecuador with 0.62 probability
Group A - Senegal vs. Netherlands: Winner Netherlands with 0.62 probability
Group A - Qatar vs. Senegal: Winner Senegal with 0.60 probability
Group A - Netherlands vs. Ecuador: Winner Netherlands with 0.73 probability
Group A - Ecuador vs. Senegal: Draw
Group A - Netherlands vs. Qatar: Winner Netherlands with 0.78 probability


Group A advanced: 
Netherlands -------- 9
Senegal -------- 4
Ecuador -------- 4
Qatar -------- 0




---------- Starting Analysis for Group B ----------
Group B - England vs. Iran: Winner England with 0.62 probability
Group B - United States vs. Wales: Draw
Group B - Wales vs. Iran: Draw
Group B - England vs. United States: Winner England with 0.61 probability
Group B - Wales vs. England: Winner England with 0.64 probability
Group B - Iran vs. United States: Winner United States with 0.58 probability




Group B advanced: 
England -------- 9
United States -------- 4
Wales -------- 2
Iran -------- 1




---------- Starting Analysis for Group C ----------
Group C - Argentina vs. Saudi Arabia: Winner Argentina with 0.79 probability
Group C - Mexico vs. Poland: Draw
Group C - Poland vs. Saudi Arabia: Winner Poland with 0.70 probability
Group C - Argentina vs. Mexico: Winner Argentina with 0.67 probability
Group C - Poland vs. Argentina: Winner Argentina with 0.71 probability
Group C - Saudi Arabia vs. Mexico: Winner Mexico with 0.71 probability




Group C advanced: 
Argentina -------- 9
Poland -------- 4
Mexico -------- 4
Saudi Arabia -------- 0




---------- Starting Analysis for Group D ----------
Group D - Denmark vs. Tunisia: Winner Denmark with 0.68 probability
Group D - France vs. Australia: Winner France with 0.71 probability
Group D - Tunisia vs. Australia: Draw
Group D - France vs. Denmark: Draw
Group D - Australia vs. Denmark: Winner Denmark with 0.71 probability
Group D - Tunisia vs. France: Winner France with 0.69 probability




Group D advanced: 
France -------- 7
Denmark -------- 7
Tunisia -------- 1
Australia -------- 1




---------- Starting Analysis for Group E ----------
Group E - Germany vs. Japan: Winner Germany with 0.62 probability
Group E - Spain vs. Costa Rica: Winner Spain with 0.76 probability
Group E - Japan vs. Costa Rica: Winner Japan with 0.63 probability
Group E - Spain vs. Germany: Draw
Group E - Japan vs. Spain: Winner Spain with 0.67 probability
Group E - Costa Rica vs. Germany: Winner Germany with 0.65 probability




Group E advanced: 
Spain -------- 7
Germany -------- 7
Japan -------- 3
Costa Rica -------- 0




---------- Starting Analysis for Group F ----------
Group F - Morocco vs. Croatia: Winner Croatia with 0.58 probability
Group F - Belgium vs. Canada: Winner Belgium with 0.75 probability
Group F - Belgium vs. Morocco: Winner Belgium with 0.67 probability
Group F - Croatia vs. Canada: Winner Croatia with 0.64 probability
Group F - Croatia vs. Belgium: Winner Belgium with 0.64 probability
Group F - Canada vs. Morocco: Draw




Group F advanced: 
Belgium -------- 9
Croatia -------- 6
Morocco -------- 1
Canada -------- 1




---------- Starting Analysis for Group G ----------
Group G - Switzerland vs. Cameroon: Winner Switzerland with 0.69 probability
Group G - Brazil vs. Serbia: Winner Brazil with 0.72 probability
Group G - Cameroon vs. Serbia: Winner Serbia with 0.66 probability
Group G - Brazil vs. Switzerland: Draw
Group G - Serbia vs. Switzerland: Winner Switzerland with 0.57 probability
Group G - Cameroon vs. Brazil: Winner Brazil with 0.81 probability




Group G advanced: 
Brazil -------- 7
Switzerland -------- 7
Serbia -------- 3
Cameroon -------- 0




---------- Starting Analysis for Group H ----------
Group H - Uruguay vs. South Korea: Winner Uruguay with 0.62 probability
Group H - Portugal vs. Ghana: Winner Portugal with 0.81 probability
Group H - South Korea vs. Ghana: Winner South Korea with 0.76 probability
Group H - Portugal vs. Uruguay: Winner Portugal with 0.60 probability
Group H - Ghana vs. Uruguay: Winner Uruguay with 0.77 probability
Group H - South Korea vs. Portugal: Winner Portugal with 0.67 probability




Group H advanced: 
Portugal -------- 9
Uruguay -------- 6
South Korea -------- 3
Ghana -------- 0

上面的模型有一些結果很有趣,比如巴西和瑞士以及丹麥和法國之間的平局。

在季後賽中,思路是一樣的:

advanced = advanced_group


playoffs = {"Round of 16": [], "Quarter-Final": [], "Semi-Final": [], "Final": []}


for p in playoffs.keys():
    playoffs[p] = []


actual_round = ""
next_rounds = []


for p in playoffs.keys():
    if p == "Round of 16":
        control = []
        for a in range(0, len(advanced*2), 1):
            if a < len(advanced):
                if a % 2 == 0:
                    control.append((advanced*2)[a][0])
                else:
                    control.append((advanced*2)[a][1])
            else:
                if a % 2 == 0:
                    control.append((advanced*2)[a][1])
                else:
                    control.append((advanced*2)[a][0])


        playoffs[p] = [[control[c], control[c+1]] for c in range(0, len(control)-1, 1) if c%2 == 0]
        
        for i in range(0, len(playoffs[p]), 1):
            game = playoffs[p][i]
            
            home = game[0]
            away = game[1]
            team_1 = find_stats(home)
            team_2 = find_stats(away)


            features_g1 = find_features(team_1, team_2)
            features_g2 = find_features(team_2, team_1)
            
            probs_g1 = gb.predict_proba([features_g1])
            probs_g2 = gb.predict_proba([features_g2])
            
            team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
            team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
            
            if actual_round != p:
                print("-"*10)
                print("Starting simulation of %s"%(p))
                print("-"*10)
                print("\n")
            
            if team_1_prob < team_2_prob:
                print("%s vs. %s: %s advances with prob %.2f"%(home, away, away, team_2_prob))
                next_rounds.append(away)
            else:
                print("%s vs. %s: %s advances with prob %.2f"%(home, away, home, team_1_prob))
                next_rounds.append(home)
            
            game.append([team_1_prob, team_2_prob])
            playoffs[p][i] = game
            actual_round = p
        
    else:
        playoffs[p] = [[next_rounds[c], next_rounds[c+1]] for c in range(0, len(next_rounds)-1, 1) if c%2 == 0]
        next_rounds = []
        for i in range(0, len(playoffs[p])):
            game = playoffs[p][i]
            home = game[0]
            away = game[1]
            team_1 = find_stats(home)
            team_2 = find_stats(away)
            
            features_g1 = find_features(team_1, team_2)
            features_g2 = find_features(team_2, team_1)
            
            probs_g1 = gb.predict_proba([features_g1])
            probs_g2 = gb.predict_proba([features_g2])
            
            team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
            team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
            
            if actual_round != p:
                print("-"*10)
                print("Starting simulation of %s"%(p))
                print("-"*10)
                print("\n")
            
            if team_1_prob < team_2_prob:
                print("%s vs. %s: %s advances with prob %.2f"%(home, away, away, team_2_prob))
                next_rounds.append(away)
            else:
                print("%s vs. %s: %s advances with prob %.2f"%(home, away, home, team_1_prob))
                next_rounds.append(home)
            game.append([team_1_prob, team_2_prob])
            playoffs[p][i] = game
            actual_round = p

結果如下:

----------
Starting simulation of Round of 16
----------




Netherlands vs. United States: Netherlands advances with prob 0.54
Argentina vs. Denmark: Argentina advances with prob 0.59
Spain vs. Croatia: Spain advances with prob 0.61
Brazil vs. Uruguay: Brazil advances with prob 0.64
Senegal vs. England: England advances with prob 0.64
Poland vs. France: France advances with prob 0.67
Germany vs. Belgium: Belgium advances with prob 0.53
Switzerland vs. Portugal: Portugal advances with prob 0.57
----------
Starting simulation of Quarter-Final
----------




Netherlands vs. Argentina: Netherlands advances with prob 0.51
Spain vs. Brazil: Brazil advances with prob 0.54
England vs. France: England advances with prob 0.51
Belgium vs. Portugal: Portugal advances with prob 0.52
----------
Starting simulation of Semi-Final
----------




Netherlands vs. Brazil: Brazil advances with prob 0.55
England vs. Portugal: England advances with prob 0.51
----------
Starting simulation of Final
----------




Brazil vs. England: Brazil advances with prob 0.56

我們以圖示的方式來展示我們的結果。

import networkx as nx
from networkx.drawing.nx_pydot import graphviz_layout

plt.figure(figsize=(15, 10))
G = nx.balanced_tree(2, 3)


labels = []


for p in playoffs.keys():
    for game in playoffs[p]:
        label = f"{game[0]}({round(game[2][0], 2)}) \n {game[1]}({round(game[2][1], 2)})"
        labels.append(label)
    
labels_dict = {}
labels_rev = list(reversed(labels))

for l in range(len(list(G.nodes))):
    labels_dict[l] = labels_rev[l]

pos = graphviz_layout(G, prog='twopi')
labels_pos = {n: (k[0], k[1]-0.08*k[1]) for n,k in pos.items()}
center  = pd.DataFrame(pos).mean(axis=1).mean()
    

nx.draw(G, pos = pos, with_labels=False, node_color=range(15), edge_color="#bbf5bb", width=10, font_weight='bold',cmap=plt.cm.Greens, node_size=5000)
nx.draw_networkx_labels(G, pos = labels_pos, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="black", lw=.5, alpha=1),
                        labels=labels_dict)
texts = ["Round \nof 16", "Quarter \n Final", "Semi \n Final", "Final\n"]
pos_y = pos[0][1] + 55
for text in reversed(texts):
    pos_x = center
    pos_y -= 75 
    plt.text(pos_y, pos_x, text, fontsize = 18)

plt.axis('equal')
plt.show()

模擬世界盃的結果如下,我們的模型預測巴西隊獲勝,決賽中對陣英格蘭隊的機率為 56%! 模型預測結果中最大的冷門是比利時擊敗德國和英格蘭進入決賽,在四分之一決賽中淘汰法國。看到一些機率非常小的比賽很有趣,比如荷蘭對阿根廷。

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

? 總結

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

在本篇內容中,ShowMeAI應用機器學習的方法,對世界盃參賽球隊進行分析和建模,模擬與預測世界盃比賽結果。全篇內容包括詳細的資料預處理、資料分析、特徵工程、機器學習建模與模型調參最佳化,模型應用及結果視覺化。當然,世界盃的有趣之處就在於,比賽場上瞬息萬變,任何的結果都可能會發生,讓我們一起跟隨世界盃,欣賞每一場精彩的比賽吧!

參考資料

推薦閱讀

就離譜!使用機器學習預測2022世界盃:小組賽挺準,但冠亞季軍都錯了 ⛵

相關文章