Deterministic Policy Gradient Algorithms

weixin_33850890發表於2019-01-08

Background

優化目標


1507799-4ec11b2f59ba74df.png

隨機策略梯度理論


1507799-166dc720b2c50e72.png

這個公式使得隨機策略梯度變為簡單的計算一個期望

Off-Policy Actor-Critic

1507799-29290113c324fd64.png

Gradients of Deterministic Policies

Action-Value Gradients

對於連續的情況,使策略引數的移動方向正比於

1507799-af60dbab2b5153b4.png

所以

1507799-4d22eb9ace3a2481.png
1507799-77f38f31cf5c451b.png

However, the theory below shows that, like the stochastic policy gradient theorem, there is no need to compute the gradient of the state distribution; and that the intuitive update outlined above is following precisely the gradient of the performance objective

隨機性策略取極限

1507799-d3236a387176da75.png

Deterministic Policy Gradient Theorem

1507799-6adc387e6e48ca77.png
1507799-e9332f29752d89fa.png

Deterministic Actor-Critic Algorithms

On-Policy Deterministic Actor-Critic

1507799-f81a223cc21e9dd2.png

Off-Policy Deterministic Actor-Critic

目標函式變為target policy

1507799-8d49870fbb215349.png
1507799-f2cc66bf026e41d0.png
1507799-ddb88c3d599a21ca.png

We note that stochastic off-policy actor-critic algorithms typically use importance sampling for both actor and critic(Degris et al., 2012b). However, because the deterministic policy gradient removes the integral over actions, we can avoid importance sampling in the actor; and by using Q-learning, we can avoid importance sampling in the critic

Compatible Function Approximation

1507799-adb2d38a9d7497a5.png

相關文章