Lecture 4: Model-Free Prediction

2020-04-13 本文已影响0人魏鹏飞

Author：David Silver

Outline

Introduction
Monte-Carlo Learning
Temporal-Difference Learning
TD(λ)

Model-Free Reinforcement Learning

Last lecture:
- Planning by dynamic programming
- Solve a known MDP
This lecture:
- Model-free prediction
- Estimate the value function of an unknown MDP
Next lecture:
- Model-free control
- Optimise the value function of an unknown MDP

Monte-Carlo Reinforcement Learning

MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs
- All episodes must terminate

Monte-Carlo Policy Evaluation

Goal: learn $v_{\pi}$ from episodes of experience under policy $\pi$
$S_1,A_1,R_2,...,S_k\sim\pi$
Recall that the return is the total discounted reward:
$G_t=R_{t+1}+\gamma R_{t+2}+...+\gamma^{T-1}R_T$
Recall that the value function is the expected return:
$v_{\pi}(s)=E_{\pi}[G_t|S_t=s]$
Monte-Carlo** policy evaluation uses empirical mean return instead of expected return**

First-Visit Monte-Carlo Policy Evaluation

To evaluate state s
The first time-step t that state s is visited in an episode,
Increment counter $N(s) ← N(s) + 1$
Increment total return $S(s) ← S(s) + G_t$
Value is estimated by mean return $V(s) = S(s)/N(s)$
By law of large numbers, $V(s) → v_π(s)$ as $N(s) → \infty$

Every-Visit Monte-Carlo Policy Evaluation

To evaluate state s
Every time-step t that state s is visited in an episode,
Increment counter $N(s) ← N(s) + 1$
Increment total return $S(s) ← S(s) + G_t$
Value is estimated by mean return $V(s) = S(s)/N(s)$
Again, $V(s) → v_π(s)$ as $N(s) → \infty$

Blackjack Example

Blackjack Value Function after Monte-Carlo Learning

Incremental Mean

Incremental Monte-Carlo Updates

Update $V(s)$ incrementally after episode $S_1, A_1, R_2, ..., S_T$
For each state $S_t$ with return $G_t$
$N(S_t)\gets N(S_t)+1$
$V(S_t)\gets V(S_t)+\frac{1}{N(S_t)}(G_t-V(S_t))$
In non-stationary problems, it can be useful to track a running
mean, i.e. forget old episodes.
$V(S_t)\gets V(S_t)+\alpha(G_t-V(S_t))$

Temporal-Difference Learning

TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

MC and TD

Driving Home Example

Driving Home Example: MC vs. TD

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments

Bias/Variance Trade-Off

Return $G_t = R_{t+1} + γR_{t+2} + ... + γ^{T−1}R_T$ is unbiased estimate of $v_π(S_t)$
True TD target $R_{t+1} + γv_π(S_{t+1})$ is unbiased estimate of $v_π(S_t)$
TD target $R_{t+1} +γV(S_{t+1})$ is biased estimate of $v_π(S_t)$
TD target is much lower variance than the return:
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD (2)

MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to vπ(s)
- (but not always with function approximation)
- More sensitive to initial value

Random Walk Example

Random Walk: MC vs. TD

Batch MC and TD

AB Example

AB Example

Certainty Equivalence

Advantages and Disadvantages of MC vs. TD (3)

TD exploits Markov property
- Usually more efficient in Markov environments
MC does not exploit Markov property
- Usually more effective in non-Markov environments

Monte-Carlo Backup

Temporal-Difference Backup

Dynamic Programming Backup

Bootstrapping and Sampling

Unified View of Reinforcement Learning

n-Step Prediction

n-Step Return

Large Random Walk Example

Averaging n-Step Returns

λ-return

TD(λ) Weighting Function

Forward-view TD(λ)

Forward-View TD(λ) on Large Random Walk

Backward View TD(λ)

Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences

Eligibility Traces

Backward View TD(λ)

TD(λ) and TD(0)

TD(λ) and MC

MC and TD(1)

Telescoping in TD(1)

TD(λ) and TD(1)

TD(1) is roughly equivalent to every-visit Monte-Carlo
Error is accumulated online, step-by-step
If value function is only updated offline at end of episode
Then total update is exactly the same as MC

Telescoping in TD(λ)

Forwards and Backwards TD(λ)

Offline Equivalence of Forward and Backward TD

Offline updates

Updates are accumulated within episode
but applied in batch at the end of episode

Onine Equivalence of Forward and Backward TD

Online updates

TD(λ) updates are applied online at each step within episode
Forward and backward-view TD(λ) are slightly different
NEW: Exact online TD(λ) achieves perfect equivalence
By using a slightly different form of eligibility trace
Sutton and von Seijen, ICML 2014

Summary of Forward and Backward TD(λ)

Reference：《UCL Course on RL》

上一篇下一篇

猜你喜欢

热点阅读