RL Introduction

Blackteaxx發表於2024-06-06

參考

動手學強化學習
drl

MDP

Markov Decision Process 是一個五元組\(<S, A, T, R, \gamma>\)

  • \(S\) 是狀態空間
  • \(A\) 是動作空間
  • \(T: S \times A \times S \to \mathbb{R}\) 是狀態轉移機率,\(T(s, a, s')\) 表示在狀態\(s\)下采取動作\(a\)轉移到狀態\(s'\)的機率
  • \(R: S \times A \times S \to \mathbb{R}\) 是獎勵函式,\(R(s, a, s')\) 表示在狀態\(s\)下采取動作\(a\)轉移到狀態\(s'\)的獎勵
  • \(\gamma\) 是折扣因子,用於平衡當前獎勵和未來獎勵的重要性

MDP 的目標是找到一個策略\(\pi: S \to A\),使得在這個策略下,能夠最大化期望回報(Return),注意,是一個隨機過程,因此我們需要考慮期望回報

Bellman Equation:

\[Q(s, a) = E[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s, A_t = a] = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V(s')] \]

\[V(s) = E[U_t|S_t = s] = \sum_{a} \pi(a|s) \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V(s')] \]

Bellman Optimality Equation:

\[Q^*(s, a) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s')] \]

\[V^*(s) = \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^*(s')] \]

於是我們可以推匯出最優方程的形式:

\[Q^*(s, a) = R(s, a) + \gamma \sum_{s'} T(s, a, s') \max_{a'} Q^*(s', a') \]

\[V^*(s) = \max_a Q^*(s, a) \]

而在回報和狀態轉移機率都是已知的情況下,我們可以有多種求解方法:

Value Iteration

\[V_{k+1}(s) = \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V_k(s')] = \max_a Q_k(s, a) \]

依據如上的等式,在一次迭代的時候遍歷所有的狀態,找出每一個狀態對應的最大估計 Q 值,然後更新 V 值,直到收斂。

def value_iteration(env:GridWorld, gamma=0.9, theta=1e-6):
    """
    Value iteration algorithm for solving a given environment.

    Parameters:
    env: GridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1. A higher gamma value makes the agent focus more on long-term rewards.

    theta: float, optional, default=1e-6
        The convergence threshold. It determines the stopping criterion for the value iteration algorithm. The algorithm stops when the maximum change in the value function is less than theta.

    Returns:
    policy: numpy.ndarray
        A policy matrix of shape (env.size, env.size), storing the optimal action for each state.
    """
    # initialize the value function and policy
    V = np.zeros((env.size, env.size))
    policy = np.zeros((env.size, env.size), dtype=int)
    ####
    # Implement the value iteration algorithm here

    iterations = 0

    while True:
        updated_V = V.copy()

        iterations += 1

        for now_state_x in range(env.size):
            for now_state_y in range(env.size):
                Q_values = []
                env.state = (now_state_x, now_state_y)

                for action in range(4):

                    # get s' and reward
                    next_state, reward = env.step(action=action)

                    next_state_x, next_state_y = next_state

                    # calc Q_value
                    Q_value = reward + gamma * V[next_state_x, next_state_y]

                    Q_values.append(Q_value)

                    # reset now_state
                    env.state = (now_state_x, now_state_y)

                # find max Q
                max_Q = max(Q_values)

                updated_V[now_state_x, now_state_y] = max_Q
                policy[now_state_x, now_state_y] = Q_values.index(max_Q)

        if np.amax(np.fabs(updated_V - V)) <= theta:
            print ('Value-iteration converged at iteration# %d.' %(iterations))
            break
        else:
            V = updated_V
    ####

    env.reset()
    return policy

Policy Iteration

從一個初始化的策略出發,先進行策略評估,然後改進策略,評估改進的策略,再進一步改進策略,經過不斷迭代更新,直達策略收斂,這種演算法被稱為“策略迭代”

  • Policy Evaluation

根據 Bellman 期望方程得出迭代式

\[V_{k+1}(s) = \sum_{a} \pi(a|s) \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V_k(s')] \]

我們可以知道\(V^k =V^\pi\)是一個不動點

當迭代到收斂時,我們可以得到這個策略下的狀態值函式

def policy_evaluation(policy:np.ndarray, env:GridWorld, gamma=0.9, theta=1e-6):
    """
    Evaluate a policy given an environment.

    Parameters:
    policy: numpy.ndarray
        A matrix representing the policy. Each entry contains an action to take at that state.

    env: ComplexGridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1.

    theta: float, optional, default=1e-6
        A threshold for the evaluation's convergence. When the change in value function is less than theta for all states, the evaluation stops.

    Returns:
    V: numpy.ndarray
        A value function representing the expected return for each state under the given policy.
    """
    V = np.zeros((env.size, env.size))
    ####
    # Implement the policy evaluation algorithm here

    while True:

            updated_V = V.copy()

            for now_state_x in range(env.size):
                for now_state_y in range(env.size):

                    env.state = (now_state_x, now_state_y)

                    action = policy[now_state_x, now_state_y]

                    next_state, reward = env.step(action=action)

                    updated_V[now_state_x, now_state_y] = reward + gamma * V[next_state[0], next_state[1]]

            if np.amax(np.fabs(updated_V - V)) <= theta:
                V = updated_V
                break
            else:
                V = updated_V

    ####

    return V
  • Policy Improvement

假設我們在原來的狀態價值函式的基礎上,對於每一個狀態,我們能夠找到一個更優的動作\(a\), 使得\(Q^\pi (s, a) \geq V^\pi(s)\),那麼能夠獲得更高的彙報

現在如果我們能夠找到一個新的策略\(\pi'\),使得\(V^{\pi'}(s) \geq V^\pi(s)\),那麼我們就可以得到一個更好的策略

因此我們可以貪心的選擇每一個狀態動作價值最大的那個動作,也就是

\[\pi'(s) = \arg \max_a Q^\pi(s, a) = \arg \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^\pi(s')] \]

def policy_iteration(env:GridWorld, gamma=0.9, theta=1e-6):
    """
    Perform policy iteration to find the optimal policy for a given environment.

    Parameters:
    env: ComplexGridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1.

    theta: float, optional, default=1e-6
        A threshold for the evaluation's convergence. When the change in value function is less than theta for all states, the evaluation stops.

    Returns:
    policy: numpy.ndarray
        A matrix representing the optimal policy. Each entry contains the best action to take at that state.
    """
    policy = np.zeros((env.size, env.size), dtype=int)

    ####
    # Implement the policy iteration algorithm here

    while True:

        V = policy_evaluation(policy=policy, env=env)

        policy_stable = True

        for now_state_x in range(env.size):
            for now_state_y in range(env.size):
                Q_values = []
                env.state = (now_state_x, now_state_y)

                for action in range(4):

                    # get s' and reward
                    next_state, reward = env.step(action=action)

                    next_state_x, next_state_y = next_state

                    # calc Q_value
                    Q_value = reward + gamma * V[next_state_x, next_state_y]

                    Q_values.append(Q_value)

                    # reset now_state
                    env.state = (now_state_x, now_state_y)

                # update policy
                max_Q = max(Q_values)
                now_action = policy[now_state_x, now_state_y]
                new_action = Q_values.index(max_Q)

                if now_action != new_action:
                    policy_stable = False
                    policy[now_state_x, now_state_y] = Q_values.index(max_Q)

        if policy_stable:
            break

    ####
    env.reset()
    return policy

State-Action-Reward-State-Action (SARSA)

一個表格由所有狀態和動作組成,表格中的 Q-value 表示在某個狀態下采取某個動作的價值,我們可以透過不斷的更新這個表格來得到最優的策略

這個表格的值由策略決定,策略變化,表格的值也會變化

\[Q^\pi(s_t, a_t) = \mathbb{E}[R_{t} + \gamma Q^\pi(s_{t+1}, a_{t+1}) | S_t = s_t, A_t = a_t] \]

那麼左右兩邊都是可以計算的,並且都是對 Q 值的估計,我們可以透過不斷的迭代來更新這個表格

即使用觀測到的\(r_t\), \(s_{t+1}\) 以及抽樣的出的\(a_{t+1}\),得到\(r_t + \gamma q(s_{t+1}, a_{t+1})\)

採用 TD 的思想,將\(q(s_t, a_t) = (1-\alpha) q(s_t, a_t) + \alpha r_t + \gamma q(s_{t+1}, a_{t+1})\)

SARSA 用到了五元組\((s_t, a_t, r_t, s_{t+1}, a_{t+1})\),因此我們可以透過不斷的迭代來更新這個表格

在取樣最佳策略的時候,使用\(\epsilon\)-greedy 策略,即以\(\epsilon\)的機率隨機選擇動作,以\(1-\epsilon\)的機率選擇最優動作

\[a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg \max_a Q(s, a) & \text{with probability } 1-\epsilon \end{cases} \]

def sarsa(env:GridWorld, episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """
    SARSA algorithm for training an agent in a given environment.

    Parameters:
    env: ComplexGridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    episodes: int, optional, default=1000
        The number of episodes for training. In each episode, the agent starts from the initial state and interacts with the environment until it reaches the goal or the episode terminates.

    alpha: float, optional, default=0.1
        The learning rate. It determines the step size for updating the Q-values, ranging from 0 to 1. A higher alpha value means faster learning but may lead to instability.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1. A higher gamma value makes the agent focus more on long-term rewards.

    epsilon: float, optional, default=0.1
        The exploration rate. It determines the probability of the agent taking a random action, used to balance exploration and exploitation. A higher epsilon value makes the agent explore more.

    Returns:
    q_table: numpy.ndarray
        A Q-value table of shape (env.size, env.size, 4), storing the Q-values for each state-action pair.
    """
    q_table = np.zeros((env.size, env.size, 4))

    ####
    # Implement the SARSA algorithm here

    import random

    for _ in range(episodes):

        x, y = env.start
        policy = extract_policy(q_table=q_table)
        action = policy[x, y]
        # 抽樣
        if random.uniform(0, 1) <= epsilon:
                action = random.randint(0, 3)

        while True:
            policy = extract_policy(q_table=q_table)

            env.state = (x, y)
            new_state, reward = env.step(action=action)

            new_action = policy[new_state[0], new_state[1]]

            if random.uniform(0, 1) <= epsilon:
                new_action = random.randint(0, 3)

            q_table[x, y, action] += alpha * (reward + gamma * q_table[new_state[0], new_state[1], new_action] - q_table[x, y, action])

            x, y = new_state
            action = new_action

            if new_state == (9, 9):
                break

    ####

    env.reset()
    return q_table

Q-Learning

Q-Learning 是一種無模型的學習方法,它不需要環境的轉移機率,只需要環境的獎勵即可

基於 TD 的思想,我們可以透過不斷的迭代來更新 Q 值

\[Q^*(s_t, a_t) = \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma \max_{a'} Q^*(s', a')] \]

def q_learning(env:GridWorld, episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """
    Q-learning algorithm for training an agent in a given environment.

    Parameters:
    env: ComplexGridWorld
        An instance of the environment, which includes the state space, action space, rewards, and transition dynamics.

    episodes: int, optional, default=1000
        The number of episodes for training. In each episode, the agent starts from the initial state and interacts with the environment until it reaches the goal or the episode terminates.

    alpha: float, optional, default=0.1
        The learning rate. It determines the step size for updating the Q-values, ranging from 0 to 1. A higher alpha value means faster learning but may lead to instability.

    gamma: float, optional, default=0.9
        The discount factor. It balances the importance of immediate rewards versus future rewards, ranging from 0 to 1. A higher gamma value makes the agent focus more on long-term rewards.

    epsilon: float, optional, default=0.1
        The exploration rate. It determines the probability of the agent taking a random action, used to balance exploration and exploitation. A higher epsilon value makes the agent explore more.

    Returns:
    q_table: numpy.ndarray
        A Q-value table of shape (env.size, env.size, 4), storing the Q-values for each state-action pair.
    """
    q_table = np.zeros((env.size, env.size, 4))
    ####
    # Implement the Q-learning algorithm here

    for _ in range(episodes):
        x, y = env.start

        while True:
            policy = extract_policy(q_table=q_table)
            action = policy[x, y]

            if random.uniform(0,1) <= epsilon:
                action = random.randint(0,3)

            env.state = (x, y)
            new_state, reward = env.step(action=action)

            q_table[x, y, action] += alpha * (reward + gamma *
                                              np.amax(q_table[new_state[0], new_state[1]]) - q_table[x, y, action])

            x, y = new_state

            if new_state == env.goal:
                break
    ####

    env.reset()
    return q_table

相關文章