Lecture 4: Model-Free Prediction
2020-04-13 本文已影响0人
魏鹏飞
Author:David Silver
![]()
Outline
- Introduction
- Monte-Carlo Learning
- Temporal-Difference Learning
- TD(λ)
Model-Free Reinforcement Learning
- Last lecture:
Planning by dynamic programming
- Solve a known MDP
- This lecture:
Model-free prediction
- Estimate the value function of an unknown MDP
- Next lecture:
Model-free control
- Optimise the value function of an unknown MDP
Monte-Carlo Reinforcement Learning
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from complete episodes: no bootstrapping
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs
- All episodes must terminate
Monte-Carlo Policy Evaluation
-
Goal: learn
from episodes of experience under policy
- Recall that the return is the total discounted reward:
- Recall that the value function is the expected return:
- Monte-Carlo** policy evaluation uses
empirical mean return
instead of expected return**
First-Visit Monte-Carlo Policy Evaluation
- To evaluate state s
- The
first
time-step t that state s is visited in an episode, - Increment counter
- Increment total return
- Value is estimated by mean return
- By law of large numbers,
as
Every-Visit Monte-Carlo Policy Evaluation
- To evaluate state s
-
Every
time-step t that state s is visited in an episode, - Increment counter
- Increment total return
- Value is estimated by mean return
- Again,
as
Blackjack Example
data:image/s3,"s3://crabby-images/f9007/f9007e888e744c37df861ad37687a1436e23ce93" alt=""
Blackjack Value Function after Monte-Carlo Learning
data:image/s3,"s3://crabby-images/567d5/567d505fdf40d7e31dfa9c70c262fe195cd0916a" alt=""
Incremental Mean
data:image/s3,"s3://crabby-images/12b9c/12b9c96582e977bb80730a3784efeb716fdcc948" alt=""
Incremental Monte-Carlo Updates
- Update
incrementally after episode
- For each state
with return
-
In non-stationary problems
, it can be useful to track a running
mean, i.e. forget old episodes.
Temporal-Difference Learning
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from
incomplete
episodes, bybootstrapping
- TD updates a guess towards a guess
MC and TD
data:image/s3,"s3://crabby-images/9b394/9b394fa1cd4165498e1395d6e949880442ff146a" alt=""
Driving Home Example
data:image/s3,"s3://crabby-images/bcebe/bcebe7a447402e364679c68845799dff8f24fb16" alt=""
Driving Home Example: MC vs. TD
data:image/s3,"s3://crabby-images/1a60c/1a60c4fe76577ffb8100d36946d96008ba16c56b" alt=""
Advantages and Disadvantages of MC vs. TD
-
TD
can learn before knowing the final outcome-
TD
can learn online after every step -
MC
must wait until end of episode before return is known
-
-
TD
can learn without the final outcome-
TD
can learn from incomplete sequences -
MC
can only learn from complete sequences -
TD
works in continuing (non-terminating) environments -
MC
only works for episodic (terminating) environments
-
Bias/Variance Trade-Off
- Return
is
unbiased
estimate of - True TD target
is
unbiased
estimate of - TD target
is
biased
estimate of - TD target is much
lower variance
than the return:- Return depends on
many
random actions, transitions, rewards - TD target depends on
one
random action, transition, reward
- Return depends on
Advantages and Disadvantages of MC vs. TD (2)
- MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to vπ(s)
- (but not always with function approximation)
- More sensitive to initial value
Random Walk Example
data:image/s3,"s3://crabby-images/2b682/2b682cd82e2275629194cb109fe232866d9be316" alt=""
Random Walk: MC vs. TD
data:image/s3,"s3://crabby-images/72ad8/72ad8d8332e9d6f32cb7b7ee1fa870bd3039d39d" alt=""
Batch MC and TD
data:image/s3,"s3://crabby-images/6fe53/6fe53477e5e41b4a992ef4ed93551007ee3e3b9f" alt=""
AB Example
data:image/s3,"s3://crabby-images/ec0e2/ec0e2bec5f8b8e924adc4df3b229678a000e6d29" alt=""
AB Example
data:image/s3,"s3://crabby-images/62e6f/62e6f670ee46e3d7995177dcd82a07a053c2209a" alt=""
Certainty Equivalence
data:image/s3,"s3://crabby-images/7cd6a/7cd6a48940da18ebdedae279ec7f5f7233506a3b" alt=""
Advantages and Disadvantages of MC vs. TD (3)
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more effective in non-Markov environments
Monte-Carlo Backup
data:image/s3,"s3://crabby-images/9a28f/9a28fbdf5d27c6f84bae4eb026d4e785b580b49e" alt=""
Temporal-Difference Backup
data:image/s3,"s3://crabby-images/d3fc6/d3fc6803636c9687185ba89b0111a1fc0fae7335" alt=""
Dynamic Programming Backup
data:image/s3,"s3://crabby-images/32bfa/32bfa8f53d8e99750772b89e7eb621153388c7c1" alt=""
Bootstrapping and Sampling
data:image/s3,"s3://crabby-images/10e61/10e61e67845af0df83ed3ae081724dd6ef40ae79" alt=""
Unified View of Reinforcement Learning
data:image/s3,"s3://crabby-images/56b3b/56b3bb129f13e2bd754d10b6b55021f2104b92f9" alt=""
n-Step Prediction
data:image/s3,"s3://crabby-images/b2f01/b2f01e46342cdfb5e96393e2a86dfde009b3f718" alt=""
n-Step Return
data:image/s3,"s3://crabby-images/33c00/33c00ccdfcea1c397bfb6c9911526eaa3c5a363e" alt=""
Large Random Walk Example
data:image/s3,"s3://crabby-images/56c8b/56c8b790287dc9160a0348a6d85f0006d9595686" alt=""
Averaging n-Step Returns
data:image/s3,"s3://crabby-images/af486/af486689068bf3daf86c155b4de0f5d213c9aafc" alt=""
λ-return
data:image/s3,"s3://crabby-images/6e7ae/6e7ae139b51783e147141c6f6935118e22913df3" alt=""
TD(λ) Weighting Function
data:image/s3,"s3://crabby-images/bb34e/bb34e46b09f5a2e6d659c607d9e98b944f65f09d" alt=""
Forward-view TD(λ)
data:image/s3,"s3://crabby-images/95749/9574959a2656ebd406f25e760482ca3ec50cba33" alt=""
Forward-View TD(λ) on Large Random Walk
data:image/s3,"s3://crabby-images/bd9ed/bd9ed59531d5996e5299831c7d7305735602994b" alt=""
Backward View TD(λ)
- Forward view provides theory
- Backward view provides mechanism
- Update online, every step, from incomplete sequences
Eligibility Traces
data:image/s3,"s3://crabby-images/cbc35/cbc3534689bbbb09a3107f5f67a207c29211aa1f" alt=""
Backward View TD(λ)
data:image/s3,"s3://crabby-images/7d18b/7d18b175a44ea270cee3d71796c970c681631bf7" alt=""
TD(λ) and TD(0)
data:image/s3,"s3://crabby-images/4bf27/4bf276bfdb9c7ff832d911abde0f0177f60dc16d" alt=""
TD(λ) and MC
data:image/s3,"s3://crabby-images/effb0/effb040b8efdde0370e6089383cfc5ed9b640514" alt=""
MC and TD(1)
data:image/s3,"s3://crabby-images/04b1f/04b1f32554e379f2888e594a69346426c038dfdb" alt=""
Telescoping in TD(1)
data:image/s3,"s3://crabby-images/40ed5/40ed58ede8b76f549f55249dad758163e9eac0f9" alt=""
TD(λ) and TD(1)
- TD(1) is roughly equivalent to every-visit Monte-Carlo
- Error is accumulated online, step-by-step
- If value function is only updated offline at end of episode
- Then total update is exactly the same as MC
Telescoping in TD(λ)
data:image/s3,"s3://crabby-images/b7c96/b7c9603ce5a5762bb138777c933202437ebc15d4" alt=""
Forwards and Backwards TD(λ)
data:image/s3,"s3://crabby-images/bce87/bce87b0173d7be70dac3c09d224fee78000afc92" alt=""
Offline Equivalence of Forward and Backward TD
Offline
updates
- Updates are accumulated within episode
- but applied in batch at the end of episode
Onine Equivalence of Forward and Backward TD
Online
updates
- TD(λ) updates are applied online at each step within episode
- Forward and backward-view TD(λ) are slightly different
-
NEW
: Exact online TD(λ) achieves perfect equivalence - By using a slightly different form of eligibility trace
- Sutton and von Seijen, ICML 2014
Summary of Forward and Backward TD(λ)
data:image/s3,"s3://crabby-images/a76c2/a76c2a5c70a1a041e9c1e494b39d51eb57e4f3ac" alt=""
Reference:《UCL Course on RL》