Chapter 2

2020-09-02  本文已影响0人  MasterXiong

Chapter 2: Multi-armed Bandits

Multi-armed bandits can be seen as the simplest form of reinforcement learning, where there is only a single state. The key point here is how to estimate the action values. This chapter mainly aims to introduce some key concepts in RL, such as value estimation, exploration-exploitation dilemma and policy gradient, under a simplified setting.

Different methods for value estimation:

  1. Sample average: Q_{n+1} = \frac{1}{n} \sum_{i=1}^n R_i
    Pros: Easy to understand and implement; unbiased estimation; convergence guanrantee with infinite time step under stationary environment
    Cons: Time and memory complexity increases with the step number
  2. Incremental form: Q_{n+1} = Q_n + \frac{1}{n} (R_n - Q_n)
    This is equivalent to sample average but introduces an important update rule in RL, i.e., NewEstimate \leftarrow OldEstimate + StepSize \ [ Target - OldEstimate ] However, we need to note that the Target here is not the true action value q_*(a), but an unbiased estimation R_n, thus will introduce noise into the update rule.
    Pros: constant time and momory cost
    Cons: step size decreases with time step, thus is not suitable for nonstationary environments
  3. Constant step size: Q_{n+1} = Q_n + \alpha \ (R_n - Q_n)
    Pros: exponential weighted average over historical rewards, thus is suitable for nonstationary environments
    Cons: biased estimation of action values; no convergence guarantee

Different methods for exploration:

  1. \epsilon-greedy: the effect of different values of \epsilon on the learning curve; time-variant \epsilon scheme
  2. Optimistic intial values: encourage exploration at the early stage, but not a general method
    initial values as prior knowledge (MAML)
  3. Upper-confidence-bound (UCB): Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}
    Closely related to other methods like Bayesian optimization

Gradient Bandit Algorithms

Instead of selecting action based on the learned action values, directly learn a numerical preference for each action.
Follow the derivation in the book, which formulates it as a SGD method.
Baseline can help reduce variance.

Contextual Bandits

At each time step, the agent is presented with a random bandit from a set of different bandits, and the identity of the bandit is known.
It is different from multi-armed bandits, as different states (bandits) are involved.
It is different from RL as the action in each step only influences the immediate reward, but has no effect on the next state and the rewards in the future.

上一篇 下一篇

猜你喜欢

热点阅读