深度强化学习基础到前沿

强化学习中的无模型预测

2020-03-14  本文已影响0人  小小何先生
在这里插入图片描述 本节目录

  在大多是强化学习(reinforcement learning RL)问题中,环境的model都是未知的,也就无法直接做动态规划。一种方法是去学MDP,在这个系列的理解强化学习中的策略迭代和值迭代这篇文章中有具体思路。但这种做法还是会存在很多问题,就是在sample过程中会比较麻烦,如果你随机sample的话就会有某些state你很难sample到,而按照某种策略sample的话,又很难得到真实的转移概率。一旦你的model出现了问题,值迭代和策略迭代都将会出现问题。

  于是就有了Model-free Reinforcement Learning,直接与环境交互,直接从数据中学到model

Model-free Reinforcement Learning

  Model-free Reinforcement Learning需要从数据中estimatevalue是多少(state or state-action pair),接下来拿到cumulative reward的期望,得到这些case之后,再去做model-freecontrol,去optimal当前的policy使得value function最大化。

  那model-freevalue function如何来做prediction呢?

  在model-freeRL中我们无法获取state transitionreward function,我们仅仅是有一些episodes。之前我们是拿这些episodesmodel,在model free的方法中拿这些episode直接学value function 或者是policy,不需要学MDP。这里面两个关键的key steps:1. estimate value function. 2. optimize policy.

Value Function Estimate

  In model-based RL (MDP), the value function is calculated by dynamic programming

v_{\pi}(s)=\mathbb{E_{\pi}}[R_{t+1}+\gamma v_{\pi}(\mathcal{S_{t+1}})|\mathcal{S_{t}=s}]

  在model free的方法中,我们不知道state transition,由此无法计算上述等式的期望。

Monte-Carlo Methods

  Monte-Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. For example, to calculate the circle's surface. As show in following figure:

Monte-Carlo Methods

  对上述方框中均匀撒上一些点,然后用如下等式计算即可:

\text{Circle Surface} = \text{Square Surface} \times \frac{\text{ points in circle}}{\text{points in total}}

Monte-Carlo Value Estimation

  我们有很多episodes,基于这些episode,我们去计算total discounted reward

G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1}

  Value functionexpected return可表示为如下数学形式:

V^{\pi}(s) = \mathbb{E} [G_{t}|s_{t}=s,\pi] \\ \approx \frac{1}{N} \sum_{i=1}^{N} G_{i}^{(i)}

  上述方法可总结为两步:1. 使用policy \pistates开始采样 Nepisodes 。2. 计算平均累计奖励(the average of cumulative reward )。可以看出来,这种基于采样的方法,直接一步到位,计算value而不需要计算MDP中的什么状态转移啥的。

  上述思想更加细致、更具体的方法可用如下形式表示:

Incremental Monte-Carlo Updates

\begin{array}{l} {N\left(S_{t}\right) \leftarrow N\left(S_{t}\right)+1} \\ {V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\frac{1}{N\left(S_{t}\right)}\left(G_{t}-V\left(S_{t}\right)\right)} \end{array}

  如果环境的state transitionreward function一直在变,我们把这个环境叫做non-stationary,环境本身肯定叫做stochastic环境。但是如果分布不变,叫做statically environment,但是环境本身的分布会发生变化的话,就需要去忘掉一些老的episode,如果用平均的方法去做的话,老的episode和新的episode一样,它就忘不掉老的episode

V(S_{t}) \leftarrow V(S_{t}) + \alpha (G_{t} - V(S_{t}))

  Monte-Carlo Value Estimation的一些特点:

  1. MC methods learn directly from episodes of experience
  2. MC is model-free: no knowledge of MDP transitions / rewards
  3. MC learns from complete episodes: no bootstrapping (discussed later)
  4. MC uses the simplest possible idea: value = mean return
  5. Caveat: can only apply MC to episodic MDPs i.e., all episodes must terminate

Temporal-Difference Learning

  TD的方法中引入对未来值函数的估计:

G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=R_{t+1}+\gamma V(S_{t+1})

V(S_{t}) \leftarrow V(S_{t}) + \alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_{t}))

  TD的算法主要有以下四个特点:

  1. TD methods learn directly from episodes of experience
  2. TD is model-free: no knowledge of MDP transitions / rewards
  3. TD learns from incomplete episodes, by bootstrapping
  4. TD updates a guess towards a guess
Monte Carlo vs. Temporal Difference

  Monte Carlo方法和Temporal Difference方法对比如下:

V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}-V(S_{t}))

Advantages and Disadvantages of MC vs. TD
Bias/Variance Trade-Off

  基于当前的策略去采样,然后计算平均值,这样得到的估计是无偏估计。

  TD target中由于存在对未来的估计V(S_{t+1}),这个估计如果是非常准确的,那TD target也是unbiased estimate,但是由于V(S_{t+1})很难估计准确,所以是 biased estimate

  TD target的方法一般比Return G_{t}要小。Return G_{t} depends on many random actions, transitions and rewards;TD target depends on one random action, transition and reward

Advantages and Disadvantages of MC vs. TD (2)

  MC方法具有好的 convergence properties (even with function approximation) 并且 Not very sensitive to initial value 但是需要 Very simple to understand and use。需要多采样去降低variance

  TD的方法 Usually more efficient than MC ,TD converges to V^{\pi}(S_{t}),but not always with function approximation。并且 More sensitive to initial value than MC。

n-step model-free prediction

  For time constraint, we may jump n-step prediction section and directly head to model-free control

G_{t}^{n} = R_{t+1} + \gamma R_{t+2} + \cdots +\gamma^{n-1}R_{t+n} + \gamma^{n} V(S_{t+n})

在这里插入图片描述

V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}^{(n)} - V(S_{t}))

  有了值函数之后,我们就需要去做策略改进了。

我的微信公众号名称:深度学习与先进智能决策
微信公众号ID:MultiAgent1024
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!

上一篇下一篇

猜你喜欢

热点阅读