Trust Region Policy Optimization
Introduction
In this article, we first prove that minimizing a certain surrogate objective function guarantees policy improvementwith non-trivial step sizes. Then we make a series of approximations to the theoretically-justified algorithm, yielding a practical algorithm, which we call trust region pol-icy optimization (TRPO).
Preliminaries
Note thatLπ uses the visitation frequency ρπ rather than ρ ̃π, ignoring changes in state visitation density due to changes in the policy.
let πold denote the current policy, and let π′= arg maxπ′ Lπold(π′). The new policy π new was defined to be the following mixture
Kakade and Langford derived the following lower bound
Monotonic Improvement Guarantee forGeneral Stochastic Policies
we note the following relationship between the to-tal variation divergence and the KL divergence
Trust region policy optimization, which we propose in the following section, is an approximation to Algorithm 1,
Optimization of Parameterized Policies
In practice, if we used the penalty coefficient Crecom-mended by the theory above, the step sizes would be very small. One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:
Instead, we can usea heuristic approximation which considers the average KLdivergence:
Sample-Based Estimation of the Objective and Constraint
The previous section proposed a constrained optimizationproblem on the policy parameters (Equation (12)), which optimizes an estimate of the expected total rewardηsub-ject to a constraint on the change in the policy at each update. This section describes how the objective and con-straint functions can be approximated using a Monte Carlo simulation.
All that remains is to replace the expectations by sample averages and replace the value for an empirical estimate.The following sections describe two different schemes for performing this estimation.
Single Path
vine
Practical Algorithm
Connections with Prior Work
Natural Policy Gradient
The natural policy gradient (Kakade, 2002) can be obtained as a special case of the update in Equation(12) by using a linear approximation to L and a quadratic approximation to the DKL constraint
相關文章
- rman中RETENTION POLICY和BACKUP OPTIMIZATION的制約關係!
- Do not trust anybody!Rust
- Website Performance OptimizationWebORM
- SQL Server OptimizationSQLServer
- Oracle SQL optimizationOracleSQL
- Filter-Policy過濾策略&Route-policyFilter
- Laravel Policy 使用Laravel
- region_new.sh
- C++ Empty Class OptimizationC++
- HBase-Region詳解
- HBase Region合併分析
- #region(C# 參考)C#
- Content Security Policy
- Laravel Policy 寫法Laravel
- 008 Rust 網路程式設計,使用 trust-dns-resolver 和 trust-dnsRust程式設計DNS
- Oracle SQL optimization-2(zt)OracleSQL
- Database | 淺談Query Optimization (1)Database
- Database | 淺談Query Optimization (2)Database
- Memory-Efficient Adaptive OptimizationAPT
- C#中#region,#if的作用C#
- Backup policy(備份策略)
- 【AP】a pratical guide to robust optimization(1)GUIIDE
- HBASE-使用問題-split region
- Deterministic Policy Gradient AlgorithmsGo
- Unable to load SELinux PolicyLinux
- Communication Complexity of Convex Optimization
- Perceptron, Support Vector Machine and Dual Optimization Problem (2)Mac
- Perceptron, Support Vector Machine and Dual Optimization Problem (3)Mac
- Perceptron, Support Vector Machine and Dual Optimization Problem (1)Mac
- Join Query Optimization with Deep Reinforcement Learning AlgorithmsGo
- SQLite Learning、SQL Query Optimization In Multiple RuleSQLite
- Availability and Optimization of Free Space in a Data Block(五)AIBloC
- Optimization with Function-Based Indexes (201)FunctionIndex
- SciTech-Mathmatics-Probability+Statistics-Population:Region-Sampling of Region : Confidence Interval(置信區間)IDE
- hbase啟動時分配region的流程
- opengauss雙region流式容災搭建
- SciTech-Mathmatics-Probability+Statistics-Population:Region-
- Content Security Policy (CSP) 介紹