Trust Region Policy Optimization
Introduction
In this article, we first prove that minimizing a certain surrogate objective function guarantees policy improvementwith non-trivial step sizes. Then we make a series of approximations to the theoretically-justified algorithm, yielding a practical algorithm, which we call trust region pol-icy optimization (TRPO).
Preliminaries
Note thatLπ uses the visitation frequency ρπ rather than ρ ̃π, ignoring changes in state visitation density due to changes in the policy.
let πold denote the current policy, and let π′= arg maxπ′ Lπold(π′). The new policy π new was defined to be the following mixture
Kakade and Langford derived the following lower bound
Monotonic Improvement Guarantee forGeneral Stochastic Policies
we note the following relationship between the to-tal variation divergence and the KL divergence
Trust region policy optimization, which we propose in the following section, is an approximation to Algorithm 1,
Optimization of Parameterized Policies
In practice, if we used the penalty coefficient Crecom-mended by the theory above, the step sizes would be very small. One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:
Instead, we can usea heuristic approximation which considers the average KLdivergence:
Sample-Based Estimation of the Objective and Constraint
The previous section proposed a constrained optimizationproblem on the policy parameters (Equation (12)), which optimizes an estimate of the expected total rewardηsub-ject to a constraint on the change in the policy at each update. This section describes how the objective and con-straint functions can be approximated using a Monte Carlo simulation.
All that remains is to replace the expectations by sample averages and replace the value for an empirical estimate.The following sections describe two different schemes for performing this estimation.
Single Path
vine
Practical Algorithm
Connections with Prior Work
Natural Policy Gradient
The natural policy gradient (Kakade, 2002) can be obtained as a special case of the update in Equation(12) by using a linear approximation to L and a quadratic approximation to the DKL constraint
相關文章
- optimization
- Filter-Policy過濾策略&Route-policyFilter
- Content Security Policy
- Laravel Policy 使用Laravel
- 008 Rust 網路程式設計,使用 trust-dns-resolver 和 trust-dnsRust程式設計DNS
- Symbolic Discovery of Optimization AlgorithmsSymbolGo
- Communication Complexity of Convex Optimization
- hbase region 合併
- Feed The Rat Privacy Policy
- Deterministic Policy Gradient AlgorithmsGo
- Laravel Policy 寫法Laravel
- C++ Empty Class OptimizationC++
- Memory-Efficient Adaptive OptimizationAPT
- OReilly.Zero.Trust.Networks.2017.6.epubRust
- HBase Region合併分析
- HBase-Region詳解
- SciTech-Mathmatics-Probability+Statistics-Population:Region-Sampling of Region : Confidence Interval(置信區間)IDE
- 【AP】a pratical guide to robust optimization(1)GUIIDE
- Oracle SQL optimization-2(zt)OracleSQL
- Database | 淺談Query Optimization (2)Database
- Database | 淺談Query Optimization (1)Database
- 標準 OpenStack 多region配置
- CertPathValidatorException: Trust anchor for certification path not found解決方法ExceptionRust
- Join Query Optimization with Deep Reinforcement Learning AlgorithmsGo
- HO6 Condo Insurance Policy
- Content Security Policy (CSP) 介紹
- VS2022推送程式碼 到github錯誤: CertGetCertificateChain trust error CERT_TRUST_IS_PARTIAL_CHAIN的解決辦法GithubAIRustError
- ANNOVAR region-based annotation-上篇
- opengauss雙region流式容災搭建
- HBASE-使用問題-split region
- abc374E Sensor Optimization Dilemma 2
- Perceptron, Support Vector Machine and Dual Optimization Problem (3)Mac
- Perceptron, Support Vector Machine and Dual Optimization Problem (1)Mac
- Perceptron, Support Vector Machine and Dual Optimization Problem (2)Mac
- Web 頁面 Meta 的 Referrer PolicyWeb
- Laravel 許可權 Policy 學習Laravel
- open policy agent 語法總結
- Dynamics NAV 2018物料卡片Manufacturing Policy