備戰世界盃！先用深度學習與強化學習踢場 FIFA 18

機器之心發表於2018-06-07

原文網址 : https://juejin.im/post/5b18d952e51d4506825f1314

本文作者是切爾西足球俱樂部粉絲，他寫了一篇英文部落格介紹如何使智慧體在 FIFA 18 遊戲中更加完美地踢任意球。全文共分為兩部分：用神經網路監督式地玩 FIFA 18；用強化學習 Q 學習玩 FIFA 18。

玩 FIFA 遊戲的機制

構建能玩 FIFA 遊戲的智慧體與遊戲內建的 Bot 是不一樣的，它不能訪問任何內部程式資訊，只能與人一樣獲得螢幕的輸出資訊。遊戲視窗截圖就是所有需要饋送到智慧體遊戲引擎的資料，智慧體會處理這些視覺資訊並輸出它希望採取的動作，最後這些動作通過按鍵模擬器傳遞到遊戲中。

下面我們提供了一個基本的框架為智慧體提供輸入資訊，並使其輸出控制遊戲。因此，我們要考慮的就是如何學習遊戲智慧體。本文主要介紹了兩種方法，首先是以

深度神經網路

和有監督的方式構建智慧體，包括使用

卷積神經網路

理解截圖資訊和

長短期記憶網路

預測動作序列。其次，我們將通過深度 Q 學習以

強化學習

的方式訓練一個強大的智慧體。這兩種方式的實現方法都已經開源：

基於深度有監督的智慧體：https://github.com/ChintanTrivedi/DeepGamingAI_FIFA
基於強化學習的智慧體：https://github.com/ChintanTrivedi/DeepGamingAI_FIFARL

基於監督學習的智慧體

步驟 1：訓練

卷積神經網路

（CNN）

CNN 因其高度準確地對影象進行目標檢測的能力而出名。再加上有快速計算的 GPU 和高效的網路架構，我們可以構建能實時執行的 CNN 模型。

為了令智慧體能理解輸入影象，我們使用了一個非常緊湊的輕量級卷積網路，即 MobileNet。該網路抽取的特徵圖表徵了智慧體對影象的高階語義理解，例如理解球員和其它目標在影象中的位置。特徵圖隨後會與單次多目標檢測器一起檢測球場上的球員、球與球門。

步驟 2：訓練

長短期記憶網路

（LSTM）

現在理解了影象之後，我們繼續來決定下一步的行動。然而，我們並不想僅看完一個幀的影象就採取動作。我們首先需要觀察這些影象的短序列。這正是 LSTM 發揮作用的地方，LSTM 就是因其對時序資料的優越建模能力而出名的。連續的影象幀在序列中作為時間步，每個幀使用 CNN 模型來提取特徵圖。然後這些特徵圖被同時饋送到兩個 LSTM 網路。

第一個 LSTM 執行的是決定玩家移動方式的學習任務。因此，這是一個多類別分類模型。第二個 LSTM 得到相同的輸入，並決定採取交叉、過人、傳球還是射門的動作，是另一個多類別分類模型。然後這兩個分類問題的輸出被轉換為按鍵動作，來控制遊戲中的動作。

這些網路已經在手動玩遊戲並記錄輸入影象和目標按鍵動作而收集的資料上訓練過了。這是少數幾個收集標記資料不會那麼枯燥的任務型別之一。

基於

強化學習

的智慧體

在前一部分中，我介紹了一個經過訓練的人工智慧機器人，它使用監督學習技術來玩 FIFA 遊戲。通過這種方式，機器人很快就學會了傳球和射門等基本動作。然而，收集進一步改進所需的訓練資料變得很麻煩，改進之路舉步維艱，費時費力。出於這個原因，我決定改用強化學習。這部分我將簡要介紹什麼是強化學習，以及如何將它應用到這個遊戲中。實現這一點的一大挑戰是，我們無法訪問遊戲的程式碼，所以只能利用我們在遊戲螢幕上所看到的內容。因此，我無法在整個遊戲中對智慧體進行訓練，但可以在練習模式下找到一種應對方案來讓智慧體玩轉技能遊戲。在本教程中，我將嘗試教機器人在 30 碼處踢任意球，你也可以通過修改讓它玩其他的技能遊戲。讓我們先了解強化學習技術，以及如何制定適合這項技術的任意球問題解決方案。

強化學習

（以及深度 Q 學習）是什麼？

與監督學習相反，強化學習不需要手動標註訓練資料。而是與環境互動，觀察互動的結果。多次重複這個過程，獲得積極和消極經驗作為訓練資料。因此，我們通過實驗而不是模仿來學習。

假設我們的環境處於一個特定的狀態 s，當採取動作 a 時，它會變為狀態 s'。對於這個特定的動作，你在環境中觀察到的即時獎勵是 r。這個動作之後的任何一組動作都有自己的即時獎勵，直到你因為積極或消極經驗而停止互動。這些叫做未來獎勵。因此，對於當前狀態 s，我們將嘗試從所有可能的動作中估計哪一個動作將帶來最大的即時 + 未來獎勵，表示為 Q(s,a)，即 Q 函式。由此得到 Q(s,a) = r + γ * Q(s', a')，表示在 s 狀態下采取動作 a 的預期最終獎勵。由於預測未來具有不確定性，因此此處引入折扣因子 γ，我們更傾向於相信現在而不是未來。

圖源：http://people.csail.mit.edu/hongzi/content/publications/DeepRM-HotNets16.pdf

深度 Q 學習是一種特殊的強化學習技術，Q 函式是通過深度神經網路學習的。給定環境的狀態作為這個網路的影象輸入，它試圖預測所有可能動作的預期最終獎勵，像迴歸問題一樣。選擇具有最大預測 Q 值的動作作為我們在環境中要採取的動作。該技術因此得名「深度 Q 學習」。

將 FIFA 任意球定位為 Q 學習問題

狀態：通過 MobileNet CNN 處理的遊戲截圖，給出了 128 維的扁平特徵圖。
動作：四種可能的動作，分別是 shoot_low、shoot_high、move_left、move_right.
獎勵：如果按下射門，比賽成績增加 200 分以上，我們就進了一個球，r=+1。如果球沒進，比分保持不變，r=-1。最後，對於與向左或向右移動相關的動作，r=0。
策略：兩層密集網路，以特徵圖為輸入，預測所有 4 個動作的最終獎勵。

智慧體與遊戲環境互動的強化學習過程。Q 學習模型是這一過程的核心，負責預測智慧體可能採取的所有動作的未來獎勵。該模型在整個過程中不斷得到訓練和更新。

注意：如果我們在 FIFA 的開球模式中有一個和練習模式中一樣的效能表（performance meter），那麼我們可能就可以將整個遊戲作為 Q 學習問題，而不僅僅侷限於任意球。或者我們需要訪問我們沒有的遊戲內部程式碼。不管怎樣，我們應該充分利用現有的資源。

程式碼實現

我們將使用Tensorflow (Keras) 等深度學習工具在 Python 中完成實現過程。

GitHub 地址：https://github.com/ChintanTrivedi/DeepGamingAI_FIFARL

下面我將介紹程式碼的四個要點，以幫助大家理解教程，此處一些程式碼行出於簡潔目的被刪除了。不過執行程式碼時需要使用完整程式碼。

1. 與遊戲環境互動

我們沒有現成的 API 來訪問程式碼。所以，我們自己製作 API！我們將使用遊戲的截圖來觀察狀態，利用模擬按鍵在遊戲環境中採取動作，利用光學字元識別（OCR）來讀取遊戲中的獎勵。我們的 FIFA 類別中有三種主要的方法：observe(), act(), _get_reward()；另外還有一種方法是 _over()，檢查任意球是否發出。

class FIFA(object):
    """
    This class acts as the intermediate "API" to the actual game. Double quotes API because we are not touching the
    game's actual code. It interacts with the game simply using screen-grab (input) and keypress simulation (output)
    using some clever python libraries.
    """

    # Define actions that our agent can take and the corresponding keys to press for taking that action.
    actions_display_name = ['shoot low', 'shoot high', 'move left', 'move right']
    key_to_press = [spacebar, spacebar, leftarrow, rightarrow]    

    # Initialize reward that will act as feedback from our interactions with the game
    self.reward = 0 


    def __init__(self):
        # Create a CNN graph object that will process screenshot images of the game.
        self.cnn_graph = CNN()


    # Observe our game environment by taking screenshot of the game.
    def observe(self):
        # Get current state s from screen using screen-grab and narrow it down to the game window.
        screen = grab_screen(region=None)
        game_screen = screen[25:-40, 1921:]

        # Process through CNN to get the feature map from the raw image. This will act as our current state s.
        return self.cnn_graph.get_image_feature_map(game_screen)


    # Press the appropriate key based on the action our agent decides to take.
    def act(self, action):
        # If we are shooting low (action=0) then press spacebar for just 0.05s for low power. 
        # In all other cases press the key for a longer time.
        PressKey(key_to_press[action])
        time.sleep(0.05) if action == 0 else time.sleep(0.2)
        ReleaseKey(key_to_press[action])

        # Wait until some time after taking action for the game's animation to complete. 
        # Taking a shot requires 5 seconds of animation, otherwise the game responds immediately.
        time.sleep(5) if action in [0, 1] else time.sleep(1)

        # Once our environment has reacted to our agent's actions, we fetch the reward 
        # and check if the game is over or not (ie, it is over once the shot been taken)
        reward = self._get_reward(action)
        game_over = self._is_over(action)
        return self.observe(), reward, game_over


    # Get feedback from the game - uses OCR on "performance meter" in the game's top right corner. 
    # We will assign +1 reward to a shot if it ends up in the net, a -1 reward if it misses the net 
    # and 0 reward for a left or right movement.
    def _get_reward(self, action):
        screen = grab_screen(region=None)
        game_screen = screen[25:-40, 1921:]

        # Narrow down to the reward meter at top right corner of game screen to get the feedback.
        reward_meter = game_screen[85:130, 1650:1730]
        i = Image.fromarray(reward_meter.astype('uint8'), 'RGB')
        try:
            # Use OCR to recognize the reward obtained after taking the action.
            ocr_result = pt.image_to_string(i)
            ingame_reward = int(''.join(c for c in ocr_result if c.isdigit()))

            # Determine if the feedback is positive or not based on the reward observed. 
            # Also update our reward object with latest observed reward.
            if ingame_reward - self.reward > 200:
                # If ball goes into the net, our ingame performance meter increases by more than 200 points.
                self.reward = ingame_reward
                action_reward = 1
            elif self._is_over(action):
                # If ball has been shot but performance meter has not increased the score, ie, we missed the goal.
                self.reward = ingame_reward
                action_reward = -1
            else:
                # If ball hasn't been shot yet, we are only moving left or right.
                self.reward = ingame_reward
                action_reward = 0
        except:
            # Sometimes OCR fails, we will just assume we haven't scored in this scenario.
            action_reward = -1 if self._is_over(action) else 0
        return action_reward


    def _is_over(self, action):
        # Check if the ball is still there to be hit. If shoot action has been initiated, 
        # the game is considered over since you cannot influence it anymore.
        return True if action in [0, 1] else False

複製程式碼

2. 收集訓練資料

在整個訓練過程中，我們要儲存所有的經驗和觀察到的獎勵，並以此作為 Q 學習模型的訓練資料。所以，對於我們採取的每一個動作，我們都要將經驗 <s, a, r, s'> 與 game_over 標誌一起儲存。我們的模型將嘗試學習的目標標籤是每個動作的最終獎勵，這是該回歸問題的實數。

class ExperienceReplay(object):
    """
    During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory.
    In training, batches of randomly drawn experiences are used to generate the input and target for training.
    """

    def __init__(self, max_memory=100000, discount=.9):
        """
        Setup
        max_memory: the maximum number of experiences we want to store
        memory: a list of experiences
        discount: the discount factor for future experience

        In the memory the information whether the game ended at the state is stored seperately in a nested array
        [...
        [experience, game_over]
        [experience, game_over]
        ...]
        """
        self.max_memory = max_memory
        self.memory = list()
        self.discount = discount

    def remember(self, states, game_over):
        # Save a state to memory
        self.memory.append([states, game_over])
        # We don't want to store infinite memories, so if we have too many, we just delete the oldest one
        if len(self.memory) > self.max_memory:
            del self.memory[0]

    def get_batch(self, model, batch_size=10):

        # How many experiences do we have?
        len_memory = len(self.memory)

        # Calculate the number of actions that can possibly be taken in the game.
        num_actions = model.output_shape[-1]

        # Dimensions of our observed states, ie, the input to our model.
        env_dim = self.memory[0][0][0].shape[1]

        # We want to return an input and target vector with inputs from an observed state.
        inputs = np.zeros((min(len_memory, batch_size), env_dim))

        # ...and the target r + gamma * max Q(s’,a’)
        # Note that our target is a matrix, with possible fields not only for the action taken but also
        # for the other possible actions. The actions not take the same value as the prediction to not affect them
        targets = np.zeros((inputs.shape[0], num_actions))

        # We draw states to learn from randomly
        for i, idx in enumerate(np.random.randint(0, len_memory,
                                                  size=inputs.shape[0])):
            """
            Here we load one transition <s, a, r, s’> from memory
            state_t: initial state s
            action_t: action taken a
            reward_t: reward earned r
            state_tp1: the state that followed s’
            """
            state_t, action_t, reward_t, state_tp1 = self.memory[idx][0]

            # We also need to know whether the game ended at this state
            game_over = self.memory[idx][1]

            # add the state s to the input
            inputs[i:i + 1] = state_t

            # First we fill the target values with the predictions of the model.
            # They will not be affected by training (since the training loss for them is 0)
            targets[i] = model.predict(state_t)[0]

            """
            If the game ended, the expected reward Q(s,a) should be the final reward r.
            Otherwise the target value is r + gamma * max Q(s’,a’)
            """
            #  Here Q_sa is max_a'Q(s', a')
            Q_sa = np.max(model.predict(state_tp1)[0])

            # if the game ended, the reward is the final reward
            if game_over:  # if game_over is True
                targets[i, action_t] = reward_t
            else:
                # r + gamma * max Q(s’,a’)
                targets[i, action_t] = reward_t + self.discount * Q_sa
        return inputs, targets
複製程式碼

3. 訓練過程

現在我們可以與遊戲互動，並將互動儲存在記憶中。我們開始訓練 Q 學習模型。為此，我們將在「探索」（exploration，在遊戲中隨機採取動作）和「利用」（exploitation，採取模型預測的動作）之間取得平衡。這樣我們就可以通過試錯來獲得不同的遊戲體驗。

引數

epsilon 正是用於此目的，它是平衡 exploration 和 exploitation 的指數遞減因子。開始的時候，我們什麼都不知道，想進行更多探索，但是隨著 epoch 的增加，我們學到的越來越多，於是我們想多利用，少探索。因此引數epsilon 的值衰減。

在本教程中，由於時間和效能的限制，模型訓練只進行了 1000 個 epoch，但以後我想至少訓練 5000 個 epoch。

# parameters
max_memory = 1000  # Maximum number of experiences we are storing
batch_size = 1  # Number of experiences we use for training per batch

exp_replay = ExperienceReplay(max_memory=max_memory)


# Train a model on the given game
def train(game, model, epochs, verbose=1):
    num_actions = len(game.key_to_press)  # 4 actions [shoot_low, shoot_high, left_arrow, right_arrow]
    # Reseting the win counter
    win_cnt = 0
    # We want to keep track of the progress of the AI over time, so we save its win count history 
    # indicated by number of goals scored
    win_hist = []
    # Epochs is the number of games we play
    for e in range(epochs):
        loss = 0.
        # epsilon for exploration - dependent inversely on the training epoch
        epsilon = 4 / ((e + 1) ** (1 / 2))
        game_over = False
        # get current state s by observing our game environment
        input_t = game.observe()

        while not game_over:
            # The learner is acting on the last observed game screen
            # input_t is a vector containing representing the game screen
            input_tm1 = input_t

            # We choose our action from either exploration (random) or exploitation (model).
            if np.random.rand() <= epsilon:
                # Explore a random action
                action = int(np.random.randint(0, num_actions, size=1))
            else:
                # Choose action from the model's prediction
                # q contains the expected rewards for the actions
                q = model.predict(input_tm1)
                # We pick the action with the highest expected reward
                action = np.argmax(q[0])

            # apply action, get rewards r and new state s'
            input_t, reward, game_over = game.act(action)
            # If we managed to score a goal we add 1 to our win counter
            if reward == 1:
                win_cnt += 1

            """
            The experiences < s, a, r, s’ > we make during gameplay are our training data.
            Here we first save the last experience, and then load a batch of experiences to train our model
            """
            # store experience
            exp_replay.remember([input_tm1, action, reward, input_t], game_over)

            # Load batch of experiences
            inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)

            # train model on experiences
            batch_loss = model.train_on_batch(inputs, targets)

            loss += batch_loss

        if verbose > 0:
            print("Epoch {:03d}/{:03d} | Loss {:.4f} | Win count {}".format(e, epochs, loss, win_cnt))

        # Track win history to later check if our model is improving at the game over time.
        win_hist.append(win_cnt)
    return win_hist
複製程式碼

4. 模型定義和訓練過程的啟動

Q 學習過程的核心是具有 ReLU

啟用函式

的兩層密集 / 全連線網路。它將 128 維的特徵圖作為輸入狀態，為每個可能的動作輸出 4 個 Q 值。具有最大預測 Q 值的動作是根據給定狀態的網路策略所要採取的期望動作。

# Number of games played in training. 
# Trained on 1000 epochs till now, but would ideally like to train for 5000 epochs at least.
epochs = 1000
game = FIFA()

# Our model's architecture parameters
input_size = 128 # The input shape for model - this comes from the output shape of the CNN Mobilenet
num_actions = len(game.key_to_press)
hidden_size = 512

# Setting up the model with keras.
model = Sequential()
model.add(Dense(hidden_size, input_shape=(input_size,), activation='relu'))
model.add(Dense(hidden_size, activation='relu'))
model.add(Dense(num_actions))
model.compile(sgd(lr=.01), "mse")

# Training the model
hist = train(game, model, epoch, verbose=1)

複製程式碼

這是執行此程式碼的起點，但你必須確保 FIFA 18 遊戲在第二個顯示器上以視窗模式執行，並在技能遊戲下載入任意球練習模式：射擊選單。確保遊戲控制元件與你在 FIFA.py 指令碼中硬編碼的鍵同步。

結果

儘管該智慧體並未掌握所有種類的任意球，但它在某些場景中學習效果很好。它幾乎總能在沒有築人牆的時候成功射門，但是在人牆出現時射門會有些困難。此外，由於它在訓練過程中並未頻繁遇到「不直面球門」等情況，因此在這些情況下它的行為比較愚蠢。但是，隨著訓練 epoch 的增加，研究者注意到該行為呈下降趨勢。

上圖顯示在 1000 個 epoch 中每次嘗試的任意球平均數。因此，例如 epoch 700 的值為 0.45 意味著（平均）45% 的嘗試需要罰球。

如上圖所示，在訓練 1000 個 epoch 後，平均射門得分率從 30% 上升到 50%。這意味著當前

機器人

在一半數量的任意球嘗試中成功得分（而人類的平均得分率是 75-80%）。不過 FIFA 的比賽並不具備那麼強的確定性，使得學習過程變得困難。

更多結果檢視：https://www.youtube.com/c/DeepGamingAI

https://v.qq.com/x/page/r06768j6yrv.html

結論

總體而言，我認為雖然該智慧體未能達到人類水平，但結果也是相當令人滿意的。從監督式學習轉向強化學習有助於減少收集訓練資料的麻煩。如果有足夠的時間去探索，它在學習如何玩簡單遊戲等問題上會表現得非常好。然而，強化學習的設定在遇到陌生情況時似乎會失敗，這使我認為將它表述為不能推斷資訊的迴歸問題和監督學習中的分類問題是一樣的。也許二者結合可以解決兩種方法的弱點。到時候我們或許就會看到為遊戲構建人工智慧的最佳結果。

原文連結：https://towardsdatascience.com/building-a-deep-neural-network-to-play-fifa-18-dce54d45e675

https://towardsdatascience.com/using-deep-q-learning-in-fifa-18-to-perfect-the-art-of-free-kicks-f2e4e979ee66

深度學習、強化學習核心技術實戰
2021-03-21
深度學習強化學習
深度學習及深度強化學習研修
2021-01-04
深度學習強化學習
深度學習+深度強化學習+遷移學習【研修】
2021-03-25
深度學習強化學習遷移學習
深度學習及深度強化學習應用
2021-01-04
深度學習強化學習
深度強化學習day01初探強化學習
2019-06-27
強化學習
深度強化學習核心技術實戰
2021-03-20
強化學習
關於強化學習、深度學習deeplearning研修
2020-11-25
強化學習深度學習
《深度強化學習》手稿開放了！
2018-10-17
強化學習
機器學習、深度學習、強化學習課程超級大列表！
2019-11-06
機器學習深度學習強化學習
深度強化學習技術開發與應用
2022-08-10
強化學習
強化學習之蒙特卡洛學習,時序差分學習理論與實戰
2020-12-10
強化學習
【強化學習】變革尚未成功：深度強化學習研究的短期悲觀與長期樂觀
2018-03-25
強化學習
強化學習-學習筆記13 | 多智慧體強化學習
2022-07-10
強化學習筆記智慧體
【強化學習】強化學習的基本概念與程式碼實現
2018-03-21
強化學習
當AI開始“踢髒球”，你還敢信任強化學習嗎？
2020-03-30
AI強化學習
強化學習10——迭代學習
2020-10-26
強化學習
強化學習
2020-12-05
強化學習
強化學習-學習筆記3 | 策略學習
2022-07-05
強化學習筆記
【強化學習篇】--強化學習案例詳解一
2018-06-30
強化學習
【強化學習】強化學習術語表（A-Z）
2020-10-25
強化學習
《白話強化學習與Pytorch》
2024-04-29
強化學習PyTorch
流式深度學習終於奏效了！強化學習之父Richard Sutton力薦
2024-11-29
深度學習強化學習
強化學習(十七) 基於模型的強化學習與Dyna演算法框架
2019-02-15
強化學習模型演算法框架
【讀書1】【2017】MATLAB與深度學習——深度學習(2)
2018-11-09
Matlab深度學習
強化學習-學習筆記2 | 價值學習
2022-07-04
強化學習筆記
強化學習(十六) 深度確定性策略梯度(DDPG)
2019-02-01
強化學習梯度
推薦系統中的前沿技術研究與落地：深度學習、AutoML與強化學習 | AI ProCon 2019
2019-10-24
深度學習TOML強化學習AI
回顧·機器學習/深度學習工程實戰
2019-02-21
機器學習深度學習
深度學習之PyTorch實戰（4）——遷移學習
2023-03-26
深度學習PyTorch遷移學習
matplotlib 強化學習
2020-06-21
強化學習
強化學習-學習筆記5 | AlphaGo
2022-07-06
強化學習筆記Go
一文讀懂人工智慧、機器學習、深度學習、強化學習的關係（必看）
2019-02-14
人工智慧機器學習深度學習強化學習
強化學習之原理與應用
2019-02-20
強化學習
強化學習與其他機器學習方法有什麼不同？
2019-03-06
強化學習機器學習
【強化學習篇】--強化學習從初識到應用
2018-06-30
強化學習
深度學習——正則化
2022-01-25
深度學習
深度學習學習框架
2018-08-02
深度學習框架
DeepMind綜述深度強化學習中的快與慢，智慧體應該像人一樣學習
2019-05-03
強化學習智慧體

備戰世界盃！先用深度學習與強化學習踢場 FIFA 18

基於監督學習的智慧體

結論

相關文章