RL Introduction

Blackteaxx發表於2024-06-06

原文網址 : https://www.cnblogs.com/Blackteaxx/p/18236228

參考

動手學強化學習
drl

MDP

Markov Decision Process 是一個五元組\(<S, A, T, R, \gamma>\)

\(S\) 是狀態空間
\(A\) 是動作空間
\(T: S \times A \times S \to \mathbb{R}\) 是狀態轉移機率，\(T(s, a, s')\) 表示在狀態\(s\)下采取動作\(a\)轉移到狀態\(s'\)的機率
\(R: S \times A \times S \to \mathbb{R}\) 是獎勵函式，\(R(s, a, s')\) 表示在狀態\(s\)下采取動作\(a\)轉移到狀態\(s'\)的獎勵
\(\gamma\) 是折扣因子，用於平衡當前獎勵和未來獎勵的重要性

MDP 的目標是找到一個策略\(\pi: S \to A\)，使得在這個策略下，能夠最大化期望回報（Return），注意，是一個隨機過程，因此我們需要考慮期望回報

Bellman Equation:

\[Q(s, a) = E[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s, A_t = a] = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V(s')] \]

\[V(s) = E[U_t|S_t = s] = \sum_{a} \pi(a|s) \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V(s')] \]

Bellman Optimality Equation:

\[Q^*(s, a) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s')] \]

\[V^*(s) = \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s')] \]

於是我們可以推匯出最優方程的形式：

\[Q^*(s, a) = R(s, a) + \gamma \sum_{s'} T(s, a, s') \max_{a'} Q^*(s', a') \]

\[V^*(s) = \max_a Q^*(s, a) \]

而在回報和狀態轉移機率都是已知的情況下，我們可以有多種求解方法：

Value Iteration

\[V_{k+1}(s) = \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V_k(s')] = \max_a Q_k(s, a) \]

依據如上的等式，在一次迭代的時候遍歷所有的狀態，找出每一個狀態對應的最大估計 Q 值，然後更新 V 值，直到收斂。

def value_iteration(env:GridWorld, gamma=0.9, theta=1e-6):
    """
    Value iteration algorithm for solving a given environment.

    Parameters:
    env: GridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1. A higher gamma value makes the agent focus more on long-term rewards.

    theta: float, optional, default=1e-6
        The convergence threshold. It determines the stopping criterion for the value iteration algorithm. The algorithm stops when the maximum change in the value function is less than theta.

    Returns:
    policy: numpy.ndarray
        A policy matrix of shape (env.size, env.size), storing the optimal action for each state.
    """
    # initialize the value function and policy
    V = np.zeros((env.size, env.size))
    policy = np.zeros((env.size, env.size), dtype=int)
    ####
    # Implement the value iteration algorithm here

    iterations = 0

    while True:
        updated_V = V.copy()

        iterations += 1

        for now_state_x in range(env.size):
            for now_state_y in range(env.size):
                Q_values = []
                env.state = (now_state_x, now_state_y)

                for action in range(4):

                    # get s' and reward
                    next_state, reward = env.step(action=action)

                    next_state_x, next_state_y = next_state

                    # calc Q_value
                    Q_value = reward + gamma * V[next_state_x, next_state_y]

                    Q_values.append(Q_value)

                    # reset now_state
                    env.state = (now_state_x, now_state_y)

                # find max Q
                max_Q = max(Q_values)

                updated_V[now_state_x, now_state_y] = max_Q
                policy[now_state_x, now_state_y] = Q_values.index(max_Q)

        if np.amax(np.fabs(updated_V - V)) <= theta:
            print ('Value-iteration converged at iteration# %d.' %(iterations))
            break
        else:
            V = updated_V
    ####

    env.reset()
    return policy

Policy Iteration

從一個初始化的策略出發，先進行策略評估，然後改進策略，評估改進的策略，再進一步改進策略，經過不斷迭代更新，直達策略收斂，這種演算法被稱為“策略迭代”

Policy Evaluation

根據 Bellman 期望方程得出迭代式

\[V_{k+1}(s) = \sum_{a} \pi(a|s) \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V_k(s')] \]

我們可以知道\(V^k =V^\pi\)是一個不動點

當迭代到收斂時，我們可以得到這個策略下的狀態值函式

def policy_evaluation(policy:np.ndarray, env:GridWorld, gamma=0.9, theta=1e-6):
    """
    Evaluate a policy given an environment.

    Parameters:
    policy: numpy.ndarray
        A matrix representing the policy. Each entry contains an action to take at that state.

    env: ComplexGridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1.

    theta: float, optional, default=1e-6
        A threshold for the evaluation's convergence. When the change in value function is less than theta for all states, the evaluation stops.

    Returns:
    V: numpy.ndarray
        A value function representing the expected return for each state under the given policy.
    """
    V = np.zeros((env.size, env.size))
    ####
    # Implement the policy evaluation algorithm here

    while True:

            updated_V = V.copy()

            for now_state_x in range(env.size):
                for now_state_y in range(env.size):

                    env.state = (now_state_x, now_state_y)

                    action = policy[now_state_x, now_state_y]

                    next_state, reward = env.step(action=action)

                    updated_V[now_state_x, now_state_y] = reward + gamma * V[next_state[0], next_state[1]]

            if np.amax(np.fabs(updated_V - V)) <= theta:
                V = updated_V
                break
            else:
                V = updated_V

    ####

    return V

Policy Improvement

假設我們在原來的狀態價值函式的基礎上，對於每一個狀態，我們能夠找到一個更優的動作\(a\), 使得\(Q^\pi (s, a) \geq V^\pi(s)\)，那麼能夠獲得更高的彙報

現在如果我們能夠找到一個新的策略\(\pi'\)，使得\(V^{\pi'}(s) \geq V^\pi(s)\)，那麼我們就可以得到一個更好的策略

因此我們可以貪心的選擇每一個狀態動作價值最大的那個動作，也就是

\[\pi'(s) = \arg \max_a Q^\pi(s, a) = \arg \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^\pi(s')] \]

def policy_iteration(env:GridWorld, gamma=0.9, theta=1e-6):
    """
    Perform policy iteration to find the optimal policy for a given environment.

    Parameters:
    env: ComplexGridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1.

    theta: float, optional, default=1e-6
        A threshold for the evaluation's convergence. When the change in value function is less than theta for all states, the evaluation stops.

    Returns:
    policy: numpy.ndarray
        A matrix representing the optimal policy. Each entry contains the best action to take at that state.
    """
    policy = np.zeros((env.size, env.size), dtype=int)

    ####
    # Implement the policy iteration algorithm here

    while True:

        V = policy_evaluation(policy=policy, env=env)

        policy_stable = True

        for now_state_x in range(env.size):
            for now_state_y in range(env.size):
                Q_values = []
                env.state = (now_state_x, now_state_y)

                for action in range(4):

                    # get s' and reward
                    next_state, reward = env.step(action=action)

                    next_state_x, next_state_y = next_state

                    # calc Q_value
                    Q_value = reward + gamma * V[next_state_x, next_state_y]

                    Q_values.append(Q_value)

                    # reset now_state
                    env.state = (now_state_x, now_state_y)

                # update policy
                max_Q = max(Q_values)
                now_action = policy[now_state_x, now_state_y]
                new_action = Q_values.index(max_Q)

                if now_action != new_action:
                    policy_stable = False
                    policy[now_state_x, now_state_y] = Q_values.index(max_Q)

        if policy_stable:
            break

    ####
    env.reset()
    return policy

State-Action-Reward-State-Action (SARSA)

一個表格由所有狀態和動作組成，表格中的 Q-value 表示在某個狀態下采取某個動作的價值，我們可以透過不斷的更新這個表格來得到最優的策略

這個表格的值由策略決定，策略變化，表格的值也會變化

\[Q^\pi(s_t, a_t) = \mathbb{E}[R_{t} + \gamma Q^\pi(s_{t+1}, a_{t+1}) | S_t = s_t, A_t = a_t] \]

那麼左右兩邊都是可以計算的，並且都是對 Q 值的估計，我們可以透過不斷的迭代來更新這個表格

即使用觀測到的\(r_t\), \(s_{t+1}\) 以及抽樣的出的\(a_{t+1}\)，得到\(r_t + \gamma q(s_{t+1}, a_{t+1})\)

採用 TD 的思想，將\(q(s_t, a_t) = (1-\alpha) q(s_t, a_t) + \alpha r_t + \gamma q(s_{t+1}, a_{t+1})\)

SARSA 用到了五元組\((s_t, a_t, r_t, s_{t+1}, a_{t+1})\)，因此我們可以透過不斷的迭代來更新這個表格

在取樣最佳策略的時候，使用\(\epsilon\)-greedy 策略，即以\(\epsilon\)的機率隨機選擇動作，以\(1-\epsilon\)的機率選擇最優動作

\[a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg \max_a Q(s, a) & \text{with probability } 1-\epsilon \end{cases} \]

def sarsa(env:GridWorld, episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """
    SARSA algorithm for training an agent in a given environment.

    Parameters:
    env: ComplexGridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    episodes: int, optional, default=1000
        The number of episodes for training. In each episode, the agent starts from the initial state and interacts with the environment until it reaches the goal or the episode terminates.

    alpha: float, optional, default=0.1
        The learning rate. It determines the step size for updating the Q-values, ranging from 0 to 1. A higher alpha value means faster learning but may lead to instability.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1. A higher gamma value makes the agent focus more on long-term rewards.

    epsilon: float, optional, default=0.1
        The exploration rate. It determines the probability of the agent taking a random action, used to balance exploration and exploitation. A higher epsilon value makes the agent explore more.

    Returns:
    q_table: numpy.ndarray
        A Q-value table of shape (env.size, env.size, 4), storing the Q-values for each state-action pair.
    """
    q_table = np.zeros((env.size, env.size, 4))

    ####
    # Implement the SARSA algorithm here

    import random

    for _ in range(episodes):

        x, y = env.start
        policy = extract_policy(q_table=q_table)
        action = policy[x, y]
        # 抽樣
        if random.uniform(0, 1) <= epsilon:
                action = random.randint(0, 3)

        while True:
            policy = extract_policy(q_table=q_table)

            env.state = (x, y)
            new_state, reward = env.step(action=action)

            new_action = policy[new_state[0], new_state[1]]

            if random.uniform(0, 1) <= epsilon:
                new_action = random.randint(0, 3)

            q_table[x, y, action] += alpha * (reward + gamma * q_table[new_state[0], new_state[1], new_action] - q_table[x, y, action])

            x, y = new_state
            action = new_action

            if new_state == (9, 9):
                break

    ####

    env.reset()
    return q_table

Q-Learning

Q-Learning 是一種無模型的學習方法，它不需要環境的轉移機率，只需要環境的獎勵即可

基於 TD 的思想，我們可以透過不斷的迭代來更新 Q 值

\[Q^*(s_t, a_t) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma \max_{a'} Q^*(s', a')] \]

def q_learning(env:GridWorld, episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """
    Q-learning algorithm for training an agent in a given environment.

    Parameters:
    env: ComplexGridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    episodes: int, optional, default=1000
        The number of episodes for training. In each episode, the agent starts from the initial state and interacts with the environment until it reaches the goal or the episode terminates.

    alpha: float, optional, default=0.1
        The learning rate. It determines the step size for updating the Q-values, ranging from 0 to 1. A higher alpha value means faster learning but may lead to instability.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1. A higher gamma value makes the agent focus more on long-term rewards.

    epsilon: float, optional, default=0.1
        The exploration rate. It determines the probability of the agent taking a random action, used to balance exploration and exploitation. A higher epsilon value makes the agent explore more.

    Returns:
    q_table: numpy.ndarray
        A Q-value table of shape (env.size, env.size, 4), storing the Q-values for each state-action pair.
    """
    q_table = np.zeros((env.size, env.size, 4))
    ####
    # Implement the Q-learning algorithm here

    for _ in range(episodes):
        x, y = env.start

        while True:
            policy = extract_policy(q_table=q_table)
            action = policy[x, y]

            if random.uniform(0,1) <= epsilon:
                action = random.randint(0,3)

            env.state = (x, y)
            new_state, reward = env.step(action=action)

            q_table[x, y, action] += alpha * (reward + gamma *
                                              np.amax(q_table[new_state[0], new_state[1]]) - q_table[x, y, action])

            x, y = new_state

            if new_state == env.goal:
                break
    ####

    env.reset()
    return q_table

Introduction to A*
2019-04-24
Introduction to Vetors
2024-11-17
nodejs introduction
2021-01-13
NodeJS
self-introduction
2018-08-02
HTML 01 - Introduction
2024-05-13
HTML
TLS 1.3 Introduction
2019-02-08
TLS
A gentle introduction to multithreading
2019-03-13
thread
FFmpeg Filtering Introduction
2024-11-11
Filter
Introduction to Systems Programming .
2024-12-10
Machine Learning－Introduction
2019-04-03
Mac
HMAC: Introduction, History, and Applications
2024-04-22
MacAPP
1 Introduction to the Multitenant Architecture
2020-02-17
NaN
An introduction to SAP Business Workflow
2019-07-05
EESA01 Introduction to Environmental
2024-11-14
CSCI1120 Introduction to Computing
2024-11-26
CCIT4020 Introduction to Computer
2024-12-08
offline RL | D4RL：最常用的 offline 資料集之一
2024-03-09
Neural Radiance Field (NeRF): A Gentle Introduction
2024-04-03
CCIT4020 Introduction to Computer Programming
2024-11-16
CS439: Introduction to Data Science
2024-10-11
MA2552 Introduction to Computing (DLI)
2024-10-05
CPSC 219: Introduction to Computer Science II
2024-10-03
COMP42215 Introduction to Computer Science
2024-12-06
RL 基礎 | 如何使用 OpenAI Gym 介面，搭建自定義 RL 環境（詳細版）
2024-11-11
OpenAI
INT2067 Introduction to Programming and Problem Solving
2024-11-06
Introduction to Keras for Engineers--官網學習
2020-11-11
Keras
cs231n lecture1 introduction
2020-04-05
RL 基礎 | Policy Gradient 的推導
2024-03-21
SK電容MK146J40RL和MK155J40RL有什麼區別？
2020-08-22
Oracle 19c Concepts(01)：Introduction to Oracle Database
2019-03-08
OracleDatabase
[Java] Introduction to Java Programming 筆記： Chapter 1. 概念
2018-08-31
Java筆記APT
MIT6.S081 - Lecture1: Introduction and Examples
2024-04-15
MIT
【筆記】【THM】Introduction to Cryptography(密碼學簡介)
2024-07-18
筆記密碼學
CRS Resource Introduction In Oracle 19c RAC-20220125
2022-04-05
Oracle
【文獻閱讀】ES as a Scalable Alternative to RL（OpenAI 17）
2020-09-23
OpenAI
Diffuision Policy + RL -------個人部落格_ZSY_20241101
2024-11-01
UI
2018.09.22 上海大學技術分享 – An Introduction To Go Programming Language
2018-10-04
Go
[Java] Introduction to Java Programming 筆記： Chapter 2. 基礎
2018-08-31
Java筆記APT

RL Introduction

參考

MDP

Value Iteration

Policy Iteration

State-Action-Reward-State-Action (SARSA)

Q-Learning

相關文章