offline RL | Pessimistic Bootstrapping (PBRL)：在 Q 更新中懲罰 uncertainty，拉低 OOD Q value

MoonOut發表於2023-12-17

原文網址 : https://www.cnblogs.com/moonout/p/17909147.html

論文題目：Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning，ICLR 2022，6 6 8 8 spotlight。
pdf 版本：https://arxiv.org/abs/2202.11566
html 版本：https://ar5iv.labs.arxiv.org/html/2202.11566
open review：https://openreview.net/forum?id=Y4cs1Z3HnqL
GitHub：https://github.com/Baichenjia/PBRL

0 abstract

Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.

background：
- offline RL 從之前收集的 dataset 中學習策略，而無需探索環境。由於 OOD actions 導致的 extrapolation error，將 off-policy RL 直接應用於 offline RL 通常會失敗。
- 先前工作透過 penalize OOD action 的 Q value，或去約束 trained policy 接近 behavior policy 來解決此類問題。
- 然而，這些方法通常阻止了 value function generalize 到 offline dataset 之外，並且也缺乏對 OOD data 的精確表徵（characterization）。
method：
- 我們提出了 offline RL 的悲觀引導（Pessimistic Bootstrapping，PBRL），它是一個純粹的 uncertainty-driven 的 offline 演演算法，沒有明確的 policy constraint。
- 具體的，PBRL 透過 bootstrapped Q functions 的 disagreement 進行 uncertainty 的量化，並根據所估計的 uncertainty，對 value function 進行懲罰，從而實施 pessimistic updates。
- 對於 extrapolation error 的處理，我們進一步提出了一種新的 OOD sampling 方法。
- 理論：上述 OOD sampling + pessimistic bootstrapping，在 linear MDP 中形成了一個 uncertainty 的量化器，是可以證明的。
實驗：
- 在 D4RL 基準測試上的大量實驗表明，與最先進的演演算法相比，PBRL 具有更好的效能。

3 method

3.1 使用 bootstrapped-Q function 進行 uncertainty 的量化

維護 K 個各自 bootstrap 更新的 Q-function。
uncertainty \(U(s,a)=\mathrm{std}(Q^k(s,a))=\sqrt{\frac1K\sum(Q^k-\bar Q)^2}\) 。（看 figure 1(a)，感覺定義是有道理的）

3.2 pessimistic learning - 悲觀學習

idea：基於 uncertainty 來懲罰 Q function。
PBRL 的 loss function 由兩部分組成：① ID 資料的 TD-error、② OOD 資料的偽 TD-error。
① ID 資料的 TD-error，見公式 (4)，大概就是 \(\hat T^{in}Q^k(s,a):=r+\gamma \hat E\big[Q^k(s',a')-\beta_{in}U(s',a')\big]\) ，對所轉移去的 (s',a') 的 uncertainty 進行懲罰。
- （上文的 ID (s, a, r, s', a') 由 offline dataset 得到）
② OOD 資料的偽 TD-error，s' 好像是 ID 的 state，a' 是 policy 生成的（可能是 OOD 的）action。
- 懲罰方式的 idea： \(\hat T^{ood}Q^k(s^{ood},a^{ood}):=Q^k(s^{ood},a^{ood})-\beta_{ood}U(s^{ood},a^{ood})\) ，直接減去它的 uncertainty。
- （如果 (s,a) 是 ID state-action，那麼 uncertainty 會很小）
- 相關的實現細節：早期 Q function 的截斷 \(\max[0, \hat T^{ood}Q^k(s,a)]\) ，在訓練初期使用大的 β ood 實現對 OOD action 的強懲罰，在訓練過程中不斷減小 β ood 的值。
- （感覺也算是使用 sarsa 式更新…）
loss function：
\[L_{critic}=\hat E_{(s,a,r,s')\sim D_{in}}\bigg[(\hat T^{in}Q^k-Q^k)^2\bigg] + \hat E_{s^{ood}\sim D_{in},~a^{ood}\sim\pi(s^{ood})}\bigg[(\hat T^{ood}Q^k-Q^k)^2\bigg] \]
policy： policy 希望最大化 Q function，具體的，最大化 ensemble Q 中的最小值。

3.3 是理論。

offline RL | IQL：透過 sarsa 式 Q 更新避免 unseen actions
2023-11-25
offline RL | CQL：魔改 Bellman error 更新，得到 Q 函式 lower-bound
2023-11-07
Error函式
offline RL · PbRL | LiRE：構造 A>B>C 的 RLT 列表，得到更多 preference 資料
2024-11-30
RLHF · PBRL | 發現部分 D4RL tasks 不適合做 offline reward learning 的 benchmark
2023-11-13
offline RL | D4RL：最常用的 offline 資料集之一
2024-03-09
offline RL | TD3+BC：在最大化 Q advantage 時新增 BC loss 的極簡演算法
2023-11-19
演算法
專案微管理39 - 懲罰
2020-09-06
網站如何避免谷歌SEO的懲罰？
2020-08-17
網站谷歌
世界正在狠狠懲罰不願改變的人
2020-10-26
不要因為別人的過錯懲罰自己
2024-10-28
懲罰系統是如何傷害遊戲玩法的？
2019-08-07
遊戲
向死還生：遊戲裡的死亡懲罰（下）
2019-09-17
遊戲
基於RL(Q-Learning)的迷宮尋路演算法
2023-04-21
演算法
第五週【任務1】範數懲罰正則化 (筆記)
2020-12-26
筆記
核定徵收其實是稅務的一種懲罰措施？核定徵收的“雙重面孔”
2020-10-21
不要隨便給玩家獎勵！適得其反的獎勵是一種懲罰
2019-06-26
Object oriented design (OOD)
2020-11-07
Object
Q&A：「微搭低程式碼」計費相關問題
2022-03-01
菲律賓參議員提議涉及加密貨幣的犯罪活動要加重懲罰
2018-03-24
加密
kubernetes實踐之八：TLS bootstrapping
2018-03-24
TLSbootAPP
Q3 LeetCode34 在排序陣列中找起始位置
2024-06-01
LeetCode排序陣列
Q&A：在SQL Server 2005中編寫儲存過程RV
2022-03-22
SQLServer儲存過程
3小時一清榜，200名左右最遭殃？懲罰有規律，應對有策略！
2020-05-29
懲罰性賠償究竟是如何讓遊戲侵權者付出更大侵權成本的？
2020-08-05
遊戲
怎麼在華納上分===q=1503964774
2021-04-18
vim編輯器中:wq wq! x q q!的詳細區別
2024-06-22
機器學習中正則懲罰項L0/L1/L2範數詳解
2018-09-05
機器學習
亞馬遜大動作炒掉900名員工：這個時代正在懲罰混日子的人
2020-07-14
亞馬遜
q-analog 和 q-binomial
2024-07-03
RL Introduction
2024-06-06
OOD&OOP-單例模式
2020-12-13
OOP單例模式
2022年Q3-2023年Q3美國摩托羅拉智慧手機出貨量份額（附原資料表）
2023-11-22
美團被反壟斷處罰：罰款34.42億元約佔年收入3% 低於阿里4%
2021-10-08
阿里
用 AI 讓資料分析更智慧 - Amazon Q 在 Amazon Quicksight 中的應用
2024-04-09
AIUI
21億Q綁查詢21億Q綁查詢21億Q綁查詢21億Q綁查詢21億Q綁查詢
2024-05-09
Canalys：2020年Q2 中國智慧手機在印度依然保持領先地位
2020-08-02
Android Q Beta 2 已上線！我們來看看都做了哪些更新
2019-04-04
Android
SAP OData offline store在Android平臺的技術實現
2020-09-25
Android

offline RL | Pessimistic Bootstrapping (PBRL)：在 Q 更新中懲罰 uncertainty，拉低 OOD Q value

0 abstract

3 method

3.1 使用 bootstrapped-Q function 進行 uncertainty 的量化

3.2 pessimistic learning - 悲觀學習

相關文章