Deterministic Policy Gradient Al

2019-01-08  本文已影响10人  初七123

Background

优化目标


随机策略梯度理论


这个公式使得随机策略梯度变为简单的计算一个期望

Off-Policy Actor-Critic

Gradients of Deterministic Policies

Action-Value Gradients

对于连续的情况,使策略参数的移动方向正比于

所以

However, the theory below shows that, like the stochastic policy gradient theorem, there is no need to compute the gradient of the state distribution; and that the intuitive update outlined above is following precisely the gradient of the performance objective

随机性策略取极限

Deterministic Policy Gradient Theorem

Deterministic Actor-Critic Algorithms

On-Policy Deterministic Actor-Critic

Off-Policy Deterministic Actor-Critic

目标函数变为target policy

We note that stochastic off-policy actor-critic algorithms typically use importance sampling for both actor and critic(Degris et al., 2012b). However, because the deterministic policy gradient removes the integral over actions, we can avoid importance sampling in the actor; and by using Q-learning, we can avoid importance sampling in the critic

Compatible Function Approximation

上一篇 下一篇

猜你喜欢

热点阅读