Chapter 7

2021-01-10  本文已影响0人  MasterXiong

Chapter 7: n-step Bootstrapping

n-step TD methods span a spectrum with MC methods at one end and one-step TD methods at the other.

n-step TD Prediction

The target estimation of value functions in n-step TD is a combination of the first n steps' sample rewards and bootstrapping the value estimation of the sampled state after n steps, i.e., G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}). Correspondingly, the update rule is V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)]. Note that only the value of S_t changes at step t, while the values of all the other states remain unchanged, i.e., V_{t+n}(s) = V_{t+n-1}(s) for \forall s \neq S_t. An issue here is that the value estimation of V_{t+n-1}(s) may be updated long ago and does not reflect the true value of s under the current policy \pi_{t+n-1} in expectation. This is not covered in the RL book and I'm not sure if this will cause any problem in RL.
One-step TD (as introduced in the last chapter) can be seen as a special case of n-step TD when n=1, i.e., G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}). MC can be seen as the extreme of n-step TD in the opposite direction when n equals to the episode length, i.e., G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T.

n-step Sarsa

n-step Sarsa is a natural generalization of 1-step Sarsa with the target estimation as G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} Q_{t+n-1}(S_{t+n}, A_{t+n}), and the corresponding update rule is Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]. The RL book gives a gridworld example as shown below to illustrate the advantage of n-step Sarsa compared to one-step Sarsa. When the reward is sparse, n-step Sarsa can help speed up the reward propagation the earlier states.

n-step Sarsa example.

n-step Off-policy Control

For n-step off-policy control, we need to take importance sampling into consideration, which leads to the update rule as follows: Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], where \rho_{t:h} = \prod_{k=t}^{min(h,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}. Note that the ratio starts from step t+1 because we do not have to care how likely we were to select the action A_t; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions. This also explains why one-step Q-learning do not have the ratio term, as \rho_{t+1:t+n} = 1 for n=1.

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

Importance sampling is required in Q-learning because it is a sample update method. So we need to multiply with the importance sampling ratio to make the update's expectation unbiased w.r.t. the target policy. Thus a natural way to avoid importance sampling is to perform expected update w.r.t. the target policy \pi, i.e., the n-step tree backup algorithm.
In its simplest case with n=1, tree backup is exactly expected Sarsa, i.e., G_{t:t+1} = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q_t(S_{t+1}, a).
For a = A_{t+1}, we can further expand the corresponding Q_t(S_{t+1}, a) term in the equation above to get a two-step target. Recursively, we can get the tree backup target as follows: G_{t:t+n} = R_{t+1} + \gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1}) Q_{t+n-1}(S_{t+1},a) + \gamma \pi(A_{t+1}|S_{t+1}) G_{t+1:t+n}. And the update rule without importance sampling is Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].

上一篇 下一篇

猜你喜欢

热点阅读