Reinforcement Learning 第十一週課程筆記

powerx_yc發表於2015-11-02

筆記

This week

watching *Options. *
The readings are Sutton, Precup, Singh (1999) and Jong, Hester, Stone (2008) (including slides from resources link)

Generalizing Generalization

Things to make RL hard

Delayed reward: agent has weak feedback, the reward is a moving target
Need exploration to learn the model or action-reward pair for all or a good number of states.
computationally, the complexity of RL depends on number of states and # of Actions.
- Function approximation over value function (V(s), Q(s,a)) is abstraction over states, not actions.
- The focus of this class is abstraction over actions

Temporal Abstraction

Temporal Abstraction

Original grid world problem: walls separated big square into four rooms, agent could be in any location. Agent's actions are move L,R,U,D and its goal is to reach the small square in the bottom left room.
a set of new actions (goto_doorway_X, X is orientation) can be generated to represent original actions.
Temporal Abstraction is representing many actions with one or a few actions.(without doing anything to the states)
- Temporal Abstraction align many actions into on and thus makes a lot of states equivalent.
- Temporal Abstraction has computational benefits.

Temporal Abstraction Options

Options

Option is a tuple <I, π, β>
- I is initiation set of states
- π here is policy: probability of take action a in states s, (s,a) -> [0 1]
- β is termination set of states and it's a set of probability of ending at state s, [0 1].

Temporal Abstraction Option Function

Temporal Abstraction Option Function

The function is a rewrite of Bellman function
- using "o" to replace "a" (O is the generalization of A).
- V(s) ( and V(s') ) is the value function that needs to update, R(s,o) is reward in s and choose a, F works like a transition function. the discount factor is hidden.
Using options kind of violates the temporal assumption of MDPs
- MDPs have atomic actions, reward can be easily discounted for each step.
- Using options ends up with variable time actions, discount factor is hidden.
If o represent k steps, R and F are actually discounted. this is Semi-MDP or SMDP
- we can turn non-Markovian stuff into Markovian by keep tracking of it's history.

Pac-Man Problems

Quiz 1: Pac-Man Problems

Quiz 1 solution Pac-Man Problems

We can learn two things from the example:
- If done improperly, temporal Abstraction might not reach optimal policy.
- temporal Abstraction might introduce state abstractions (reduce the state space) so the problem can be solved more efficiently.

How It Comes Together

We can see options and state representations as high level representation. In fact, actions and states are also somewhat made up. Agent's goal is to make decisions with respect to those descriptions of the world, no matter they are action or option, states or abstract states
If construct options smartly, we might be able to ignore some states (decrease the state space) to reduce computational resource requirement.

Goal Abstraction

goal abstraction

agent has multiple parallel goals (predator-prey scenario ), at any given time, only one goal is dominating (more important).
β (the probability of terminated in a state) can be seen the probability of accomplishing a goal or the probability of another goal becomes more important (switching goals).
Options give ways to think actions as accomplishing parallel goals.
The goals do not have to be in the same scale.

Goal Abstraction

Modular RL sees options as sub-agents with different goals. There are three ways of choosing goals:
- Greatest mass Q-learning: each action as a Q function. For each option, we sum up the Q functions of each action in it. The option with largest Q get executed. (might end up with agent with small Q on every action).
- Top Q-learnings: choose the action with highest Q. ( but the agent might put high Q on many actions)
- Negotiated W-learning: Minimize loss.
Modular RL is often impossible because a fair voting system is hard to construct. e.g. the modules might have incompatible goals.

Monte Carlo Tree Search

Monte Carlo Tree Search

In the figure above, circles are states, edges are transitions. π =Q^(s,a) is the policy of the known part of the tree. In these states, we know what action to take following π (pink edges). When reach an unknown state, we apply the rollout policy π_r, and simulate actions to take deep in the tree, and then we backup and update π_r and π to figure out what to select at each state, including the unknown state where we started the simulation. π gets expanded as we figure out the policy at unknown state. Then repeat the "Select, Expand, simulate, back up" process.

initial policy π, we can SELECT actions following it. it can be updated at each iteration of tree search.
rollout policy π_r, we can simulate actions and sample from them starting from the unknown state following it.
we back up after stimulation with the simulated result to update π.
we figured out what action to take at the unknown state and states above and expand π to the previous unknow state.
now we can repeat the process to search deeper.

Monte Carlo Tree Search

when to stop? learn deeper enough before reach the computational resource cap
rollup policy π_r can be random. we know the action is good because I get better result by behaviouring randomly from that point on.
instead of purely random, on can behave randomly in respect to constraints. (e.g. not eaten by ghost).
- constraints: defined by failure
- goals: defined by success
MCTS is compatible with options to perform the tree search. in this case, π = Q_hat(s,o);
the Monte Carlo Tree Search can be seen as policy search. When reach a state we are not confident, a inner loop is executed to do some RL.

MCTS Properties

MCTS Properties

MCTS is useful for large state spaces and need lots of sample s to get good estimates
Planning time is independent of |S|
Running time is O( (|A|.X)^H ).
- X is how many steps does the search need to go through
- |A| is the size of the action space
The tradeoff is between number of states and how far away are we going to search

Recap

recap

2015-10-28 初稿
2015-11-01 finished.
2015-12-04 reviewed and revised

吳恩達機器學習第三課 Unsupervised learning recommenders reinforcement learning
2024-06-10
吳恩達機器學習
論文筆記之：Human-level control through deep reinforcement learning
2016-06-13
筆記
Reinforcement Learning Chapter2
2024-02-05
APT
Reinforcement Learning Basic Notes
2024-04-28
林軒田機器學習基石課程學習筆記3 — Types of Learning
2018-07-23
機器學習筆記
林軒田機器學習技法課程學習筆記13 — Deep Learning
2018-07-29
機器學習筆記
機器學習課程筆記
2018-05-15
機器學習筆記
林軒田機器學習基石課程學習筆記4 — Feasibility of Learning
2018-07-23
機器學習筆記
林軒田機器學習基石課程學習筆記2 — Learning to Answer Yes/No
2018-07-23
機器學習筆記
python課程筆記
2016-06-15
Python筆記
Join Query Optimization with Deep Reinforcement Learning Algorithms
2020-12-27
Go
Enhancing Diffusion Models with Reinforcement Learning
2024-07-24
ARS Reinforcement Learning using Gymnasium
2024-11-20
林軒田機器學習基石課程學習筆記16（完結） — Three Learning Principles
2018-07-25
機器學習筆記
會計學課程筆記
2020-10-22
筆記
王道C短期課程筆記
2021-01-01
筆記
物聯網課程筆記
2024-04-14
筆記
lua課程學習筆記
2024-07-09
筆記
架構師課程學習筆記-第二週知識點
2021-09-09
架構筆記
deep learning深度學習之學習筆記基於吳恩達coursera課程
2017-09-19
深度學習筆記吳恩達
達內課程學習筆記
2018-08-23
筆記
萬物互聯課程筆記
2018-08-17
筆記
Stanford機器學習課程筆記——SVM
2015-01-29
機器學習筆記
UI設計課程筆記（三）
2016-03-23
UI筆記
[Triton課程筆記] 2.2.3 BLS續
2024-06-26
筆記
清華大學計算機系統課程筆記-第十一講和第十二講
2020-09-27
計算機筆記
網站SEO課程筆記整理版！
2020-08-21
網站筆記
遨遊Unix–APUE課程筆記【１】
2019-05-12
筆記
資料庫課程作業筆記
2019-04-24
資料庫筆記
Python基礎課程筆記5
2021-09-09
Python筆記
Andrew ng 深度學習課程筆記
2018-01-27
深度學習筆記
遨遊Unix — APUE課程筆記【2】
2017-06-07
筆記
龍哥盟-PMP-課程筆記-四-
2024-10-28
筆記
龍哥盟-PMP-課程筆記-三-
2024-10-28
筆記
龍哥盟-PMP-課程筆記-十九-
2024-10-28
筆記
北航OS課程筆記--一、緒論
2024-07-25
筆記
北航OS課程筆記--六、磁碟管理
2024-07-25
筆記
計算機網路 - 課程筆記
2024-09-12
計算機網路筆記

Reinforcement Learning 第十一週課程筆記

Temporal Abstraction

Temporal Abstraction Options

Temporal Abstraction Option Function

Pac-Man Problems

How It Comes Together

Goal Abstraction

Monte Carlo Tree Search

MCTS Properties

Recap

相關文章