Trust Region Policy Optimization

weixin_34124651發表於2018-08-05

Introduction

In this article, we first prove that minimizing a certain surrogate objective function guarantees policy improvementwith non-trivial step sizes. Then we make a series of approximations to the theoretically-justified algorithm, yielding a practical algorithm, which we call trust region pol-icy optimization (TRPO).

Preliminaries

1507799-261f0e6c3a2fab70.png
1507799-befaf9fdc06948ed.png
1507799-da81fa4916b79fc6.png
1507799-9c459d3bbf708c1f.png
1507799-92f27df9fa2b889c.png

Note thatLπ uses the visitation frequency ρπ rather than ρ ̃π, ignoring changes in state visitation density due to changes in the policy.

1507799-19b1a6942a269904.png
Lπ matches η to first order

let πold denote the current policy, and let π′= arg maxπ′ Lπold(π′). The new policy π new was defined to be the following mixture

1507799-bd4f57570b05e0a5.png

Kakade and Langford derived the following lower bound

1507799-a2d4596d38f568a6.png

Monotonic Improvement Guarantee forGeneral Stochastic Policies

1507799-a29b957d78838f95.png
1507799-2f27cf3090d460a3.png

we note the following relationship between the to-tal variation divergence and the KL divergence

1507799-fa738562e30c6f1d.png
1507799-1be695d9c8b9164a.png
1507799-796975ccef527f9c.png
1507799-2a03a9fb354a5ae6.png

Trust region policy optimization, which we propose in the following section, is an approximation to Algorithm 1,

Optimization of Parameterized Policies

1507799-27f4419e2d38a9d7.png

In practice, if we used the penalty coefficient Crecom-mended by the theory above, the step sizes would be very small. One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:

1507799-8d2d2d48528c6a88.png

Instead, we can usea heuristic approximation which considers the average KLdivergence:

1507799-001325066dab4a2a.png

Sample-Based Estimation of the Objective and Constraint

The previous section proposed a constrained optimizationproblem on the policy parameters (Equation (12)), which optimizes an estimate of the expected total rewardηsub-ject to a constraint on the change in the policy at each update. This section describes how the objective and con-straint functions can be approximated using a Monte Carlo simulation.

1507799-31e0b218f265d25a.png
1507799-6af71dcb765ab88e.png
1507799-40282fc883486dbc.png
1507799-d3b76f4dcacd3c06.png

All that remains is to replace the expectations by sample averages and replace the value for an empirical estimate.The following sections describe two different schemes for performing this estimation.

Single Path

1507799-97daf3d1a84d9bc0.png

vine

Practical Algorithm

1507799-d405e8de003be5ab.png
1507799-f668ef9039b69744.png
1507799-0395923b1843962f.png

Connections with Prior Work

Natural Policy Gradient

The natural policy gradient (Kakade, 2002) can be obtained as a special case of the update in Equation(12) by using a linear approximation to L and a quadratic approximation to the DKL constraint

1507799-c0ccda067ee36367.png
1507799-603150e9330a0be7.png

相關文章