強化學習實戰 | 表格型Q-Learning玩井字棋（一）

埠默笙聲聲聲脈發表於2021-12-07

原文網址 : https://www.cnblogs.com/wsy950409/p/15658371.html

在強化學習實戰 | 自定義Gym環境之井子棋中，我們構建了一個井字棋環境，並進行了測試。接下來我們可以使用各種強化學習方法訓練agent出棋，其中比較簡單的是Q學習，Q即Q(S, a)，是狀態動作價值，表示在狀態s下執行動作a的未來收益的總和。Q學習的演算法如下：

可以看到，當agent在狀態S，執行了動作a之後，得到了環境給予的獎勵R，並進入狀態S'。同時，選擇最大的Q(S', a)，更新Q(S, a)。所謂表格型Q學習，就是構建一個Q(S, a)的表格，維護所有的狀態動作價值。一個很好的示例來自 Q學習玩Flappy Bird，隨著遊戲的不斷進行，Q表格中記錄的狀態越來越多，狀態動作價值也越來越準確，於是小鳥也飛得越來越好。

我們也要構建這樣的Q表格，並希望通過Q_table[state][action] 的檢索方式訪問其儲存的狀態動作價值，我們可以用字典實現：

'[1, 0, -1, 0, 0, 0, 1, -1, 0]'	{'(0,1)':0, '(1,0)':0, '(1,1)':0, '(1,2)':0, '(2,2)':0}
'[0, 1, 0, -1, 0, 0, -1, 0, 1]'	......

在本文中我們要做到如下的目標：

改寫強化學習實戰 | 自定義Gym環境之井子棋中的測試程式碼，要更有邏輯，更能凸顯強化學習中 agent 和環境的概念。
agent 隨機選擇空格進行動作，每次動作前，更新Q表格：若表格中不存在當前狀態，則將當前狀態及其動作價值新增至Q表格中。
玩50000次遊戲，檢視Q表格中的狀態數

步驟1：建立檔案

在任意目錄新建檔案 Table QLearning play TicTacToe.py

步驟2：建立類 Agent()

Agent() 類需要有（1）隨機落子的動作生成函式（2）Q表格（3）更新Q表格的函式，且新增表格中全部狀態動作價值設為0。程式碼如下：

class Agent():
    def __init__(self):
        self.Q_table = {}
    
    def getEmptyPos(self, env_): # 返回空位的座標
        action_space = []
        for i, row in enumerate(env_.state):
            for j, one in enumerate(row):
                if one == 0: action_space.append((i,j)) 
        return action_space
        
    def randomAction(self, env_, mark): # 隨機選擇空格動作
        actions = self.getEmptyPos(env_)
        action_pos = random.choice(actions)
        action = {'mark':mark, 'pos':action_pos}
        return action
    
    def updateQtable(self, env_): # 更新Q表格
        state = env_.state
        if str(state) not in self.Q_table: # 新增狀態
            self.Q_table[str(state)] = {}
            actions = self.getEmptyPos(env_)
            for action in actions:
                self.Q_table[str(state)][str(action)] = 0 # 新增的狀態動作價值為0

步驟3：建立類 Game()

Game() 類需要有（1）是/否顯示遊戲過程、更改行動時間間隔的屬性（2）開局隨機先後手（3）切換行動方的函式（4）遊戲結束時，可以新建遊戲。程式碼如下：

class Game():
    def __init__(self, env):
        self.INTERVAL = 0 # 行動間隔
        self.RENDER = False # 是否顯示遊戲過程
        self.first = 'blue' if random.random() > 0.5 else 'red' # 隨機先後手
        self.currentMove = self.first # 當前行動方
        self.env = env
        self.agent = Agent()
    
    def switchMove(self): # 切換行動玩家
        move = self.currentMove
        if move == 'blue': self.currentMove = 'red'
        elif move == 'red': self.currentMove = 'blue'
    
    def newGame(self): # 新建遊戲
        self.first = 'blue' if random.random() > 0.5 else 'red'
        self.currentMove = self.first
        self.env.reset()
    
    def run(self): # 玩一局遊戲
        self.env.reset() # 在第一次step前要先重置環境 不然會報錯
        while True:
            if self.currentMove == 'blue': self.agent.updateQtable(self.env) # 只記錄藍方視角下的局面
            action = self.agent.randomAction(self.env, self.currentMove)
            state, reward, done, info = self.env.step(action)
            if self.RENDER: self.env.render()
            self.switchMove()
            time.sleep(self.INTERVAL)
            if done:
                self.newGame()
                if self.RENDER: self.env.render()
                time.sleep(self.INTERVAL)
                break

步驟4：測試

（1）玩一局遊戲，顯示Q表格，及Q表格中儲存的狀態數：

env = gym.make('TicTacToeEnv-v0')
game = Game(env)
for i in range(1):
    game.run()

for state in game.agent.Q_table:
    print(state)
    for action in game.agent.Q_table[state]:
        print(action, ': ', game.agent.Q_table[state][action])
    print('--------------')

print('dim of state: ', len(game.agent.Q_table))

輸出：

（2）玩50000局遊戲，檢視Q表格中儲存的狀態數：

env = gym.make('TicTacToeEnv-v0')
game = Game(env)
for i in range(50000):
    game.run()

print('dim of state: ', len(game.agent.Q_table))

輸出：

整體程式碼如下：

import gym
import random
import time

# 檢視所有已註冊的環境
# from gym import envs
# print(envs.registry.all()) 

class Game():
    def __init__(self, env):
        self.INTERVAL = 0 # 行動間隔
        self.RENDER = False # 是否顯示遊戲過程
        self.first = 'blue' if random.random() > 0.5 else 'red' # 隨機先後手
        self.currentMove = self.first
        self.env = env
        self.agent = Agent()
    
    def switchMove(self): # 切換行動玩家
        move = self.currentMove
        if move == 'blue': self.currentMove = 'red'
        elif move == 'red': self.currentMove = 'blue'
    
    def newGame(self): # 新建遊戲
        self.first = 'blue' if random.random() > 0.5 else 'red'
        self.currentMove = self.first
        self.env.reset()
    
    def run(self): # 玩一局遊戲
        self.env.reset() # 在第一次step前要先重置環境 不然會報錯
        while True:
            if self.currentMove == 'blue': self.agent.updateQtable(self.env) # 只記錄藍方視角下的局面
            action = self.agent.randomAction(self.env, self.currentMove)
            state, reward, done, info = self.env.step(action)
            if self.RENDER: self.env.render()
            self.switchMove()
            time.sleep(self.INTERVAL)
            if done:
                self.newGame()
                if self.RENDER: self.env.render()
                time.sleep(self.INTERVAL)
                break
                    
class Agent():
    def __init__(self):
        self.Q_table = {}
    
    def getEmptyPos(self, env_): # 返回空位的座標
        action_space = []
        for i, row in enumerate(env_.state):
            for j, one in enumerate(row):
                if one == 0: action_space.append((i,j)) 
        return action_space
        
    def randomAction(self, env_, mark): # 隨機選擇空格動作
        actions = self.getEmptyPos(env_)
        action_pos = random.choice(actions)
        action = {'mark':mark, 'pos':action_pos}
        return action
    
    def updateQtable(self, env_):
        state = env_.state
        if str(state) not in self.Q_table:
            self.Q_table[str(state)] = {}
            actions = self.getEmptyPos(env_)
            for action in actions:
                self.Q_table[str(state)][str(action)] = 0
            

env = gym.make('TicTacToeEnv-v0')
game = Game(env)
for i in range(1):
    game.run()

for state in game.agent.Q_table:
    print(state)
    for action in game.agent.Q_table[state]:
        print(action, ': ', game.agent.Q_table[state][action])
    print('--------------')

print('dim of state: ', len(game.agent.Q_table))

View Code

強化學習實戰 | 表格型Q-Learning玩井字棋（二）
2021-12-09
強化學習
強化學習實戰 | 表格型Q-Learning玩井字棋（四）遊戲時間
2021-12-12
強化學習遊戲
強化學習實戰 | 表格型Q-Learning玩井子棋（三）優化，優化
2021-12-10
強化學習優化
強化學習實戰 | 自定義Gym環境之井字棋
2021-12-06
強化學習
強化學習-學習筆記8 | Q-learning
2022-07-07
強化學習筆記
Python：用海龜實現井字棋
2020-12-10
Python
Python程式碼 | 井字棋
2024-07-14
Python
採用α-β演算法實現井字棋遊戲
2020-11-12
演算法遊戲
深度學習、強化學習核心技術實戰
2021-03-21
深度學習強化學習
強化學習（九）Deep Q-Learning進階之Nature DQN
2018-10-08
強化學習
走近流行強化學習演算法：最優Q-Learning
2018-06-02
強化學習演算法
深度強化學習核心技術實戰
2021-03-20
強化學習
強化學習實戰 | 自定義Gym環境
2021-12-05
強化學習
用C語言編寫小遊戲——“井字棋”
2018-07-27
C語言遊戲
【強化學習篇】--強化學習案例詳解一
2018-06-30
強化學習
強化學習組隊學習task02——馬爾可夫決策過程及表格型方法
2020-10-23
強化學習馬爾可夫
一文讀懂強化學習：RL全面解析與Pytorch實戰
2023-11-02
強化學習PyTorch
強化學習（七）時序差分離線控制演算法Q-Learning
2018-09-19
強化學習演算法
強化學習實戰 | 自定義Gym環境之掃雷
2022-01-26
強化學習
【強化學習】強化學習的基本概念與程式碼實現
2018-03-21
強化學習
強化學習(Reinforcement Learning)中的Q-Learning、DQN，面試看這篇就夠了！
2019-08-18
強化學習面試
基於落點打分的井字棋智慧下棋演算法（C語言實現）
2023-10-17
演算法C語言
強化學習之蒙特卡洛學習,時序差分學習理論與實戰
2020-12-10
強化學習
企業數字化轉型實戰：管理視覺化
2018-08-14
視覺化
用洛書幻方對抗人類玩家的井字棋程式
2020-05-01
強化學習實戰 | 自定義gym環境之顯示字串
2022-01-08
強化學習字串
強化學習演算法筆記之【Q-learning演算法和DQN演算法】
2024-10-18
強化學習演算法筆記
強化學習（一）模型基礎
2018-07-29
強化學習模型
強化學習
2020-12-05
強化學習
強化學習-學習筆記13 | 多智慧體強化學習
2022-07-10
強化學習筆記智慧體
【強化學習】強化學習術語表（A-Z）
2020-10-25
強化學習
深度強化學習day01初探強化學習
2019-06-27
強化學習
《強化學習》一書術語表
2018-07-14
強化學習
強化學習10——迭代學習
2020-10-26
強化學習
學習心得之華為數字化轉型框架
2022-06-13
框架
人，才是強化學習在真實世界中面臨的真正挑戰
2019-09-12
強化學習
matplotlib 強化學習
2020-06-21
強化學習
備戰世界盃！先用深度學習與強化學習踢場 FIFA 18
2018-06-07
深度學習強化學習

強化學習實戰 | 表格型Q-Learning玩井字棋（一）

步驟1：建立檔案

步驟2：建立類 Agent()

步驟3：建立類 Game()

步驟4：測試

相關文章