Lecture 4: Model-Free Prediction
2020-04-13 本文已影响0人
魏鹏飞
Author:David Silver
![]()
Outline
- Introduction
- Monte-Carlo Learning
- Temporal-Difference Learning
- TD(λ)
Model-Free Reinforcement Learning
- Last lecture:
Planning by dynamic programming
- Solve a known MDP
- This lecture:
Model-free prediction
- Estimate the value function of an unknown MDP
- Next lecture:
Model-free control
- Optimise the value function of an unknown MDP
Monte-Carlo Reinforcement Learning
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from complete episodes: no bootstrapping
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs
- All episodes must terminate
Monte-Carlo Policy Evaluation
-
Goal: learn
from episodes of experience under policy
- Recall that the return is the total discounted reward:
- Recall that the value function is the expected return:
- Monte-Carlo** policy evaluation uses
empirical mean return
instead of expected return**
First-Visit Monte-Carlo Policy Evaluation
- To evaluate state s
- The
first
time-step t that state s is visited in an episode, - Increment counter
- Increment total return
- Value is estimated by mean return
- By law of large numbers,
as
Every-Visit Monte-Carlo Policy Evaluation
- To evaluate state s
-
Every
time-step t that state s is visited in an episode, - Increment counter
- Increment total return
- Value is estimated by mean return
- Again,
as
Blackjack Example

Blackjack Value Function after Monte-Carlo Learning

Incremental Mean

Incremental Monte-Carlo Updates
- Update
incrementally after episode
- For each state
with return
-
In non-stationary problems
, it can be useful to track a running
mean, i.e. forget old episodes.
Temporal-Difference Learning
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from
incomplete
episodes, bybootstrapping
- TD updates a guess towards a guess
MC and TD

Driving Home Example

Driving Home Example: MC vs. TD

Advantages and Disadvantages of MC vs. TD
-
TD
can learn before knowing the final outcome-
TD
can learn online after every step -
MC
must wait until end of episode before return is known
-
-
TD
can learn without the final outcome-
TD
can learn from incomplete sequences -
MC
can only learn from complete sequences -
TD
works in continuing (non-terminating) environments -
MC
only works for episodic (terminating) environments
-
Bias/Variance Trade-Off
- Return
is
unbiased
estimate of - True TD target
is
unbiased
estimate of - TD target
is
biased
estimate of - TD target is much
lower variance
than the return:- Return depends on
many
random actions, transitions, rewards - TD target depends on
one
random action, transition, reward
- Return depends on
Advantages and Disadvantages of MC vs. TD (2)
- MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to vπ(s)
- (but not always with function approximation)
- More sensitive to initial value
Random Walk Example

Random Walk: MC vs. TD

Batch MC and TD

AB Example

AB Example

Certainty Equivalence

Advantages and Disadvantages of MC vs. TD (3)
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more effective in non-Markov environments
Monte-Carlo Backup

Temporal-Difference Backup

Dynamic Programming Backup

Bootstrapping and Sampling

Unified View of Reinforcement Learning

n-Step Prediction

n-Step Return

Large Random Walk Example

Averaging n-Step Returns

λ-return

TD(λ) Weighting Function

Forward-view TD(λ)

Forward-View TD(λ) on Large Random Walk

Backward View TD(λ)
- Forward view provides theory
- Backward view provides mechanism
- Update online, every step, from incomplete sequences
Eligibility Traces

Backward View TD(λ)

TD(λ) and TD(0)

TD(λ) and MC

MC and TD(1)

Telescoping in TD(1)

TD(λ) and TD(1)
- TD(1) is roughly equivalent to every-visit Monte-Carlo
- Error is accumulated online, step-by-step
- If value function is only updated offline at end of episode
- Then total update is exactly the same as MC
Telescoping in TD(λ)

Forwards and Backwards TD(λ)

Offline Equivalence of Forward and Backward TD
Offline
updates
- Updates are accumulated within episode
- but applied in batch at the end of episode
Onine Equivalence of Forward and Backward TD
Online
updates
- TD(λ) updates are applied online at each step within episode
- Forward and backward-view TD(λ) are slightly different
-
NEW
: Exact online TD(λ) achieves perfect equivalence - By using a slightly different form of eligibility trace
- Sutton and von Seijen, ICML 2014
Summary of Forward and Backward TD(λ)

Reference:《UCL Course on RL》