MAB 简介

2018-07-26  本文已影响87人  朱小虎XiaohuZhu

Multi armed bandits 多臂老虎机

首先来定义模型(model):

n 个臂(arms):1,...,n,对每个臂 i\mathscr{D}_i,其中 \mathscr{D}_i \in [0,1] 对参与人未知:

\mu_i = \mathbb{E} \mathscr{D}_i

不失一般性,假设:\mu_1\geq \mu_2 \geq \mu_3 \geq \cdots \geq \mu_2

多臂老虎机问题的求解思想就是通过不断地尝试慢慢发现最好的臂


下面介绍 (\varepsilon \text{-}\delta)-Probably Approximate Correct(PAC)算法

目标:令 j 为由算法返回的臂,希望有:
Pr[\mu_i - \mu_j \leq \varepsilon] \geq 1 - \delta

均匀采样(uniform sampling)

  1. 拉每个臂 \frac{4 \ln n/\delta}{\varepsilon^2}
  2. 返回 \arg\max_{i\in[n]} \{\hat{\mu_i}\}
    这里 \hat{} 指的是经验均值

场景1:\mu_1 = 0.9, \mu_2 = 0.89999
场景2:\mu_1 = 0.9, \mu_2 = 0.5

正确性(Correctness):
Pr[|\mu - \hat{\mu}| \leq \frac{\varepsilon}{2}] \geq 1- \exp(-2 \frac{\varepsilon^2}{2} \frac{4\ln \frac{n}{\delta}}{\varepsilon^2}) = 1 - 2 \frac{\delta^2}{n^2}

Hoeffding's inequality
假设 x_1,x_2,...,x_n \in [0,1],彼此独立,则
Pr[\frac{1}{n} \Big| \sum_{i=1}^{n} x_i - \mathbb{E} \sum_{i=1}^{n} x_i\Big| > \gamma] \leq 2\exp(-2\gamma^2 n)

所以根据 union bound,得:
Pr[\forall i \in [n], \mu_i \in \hat{\mu_i} \pm \frac{\varepsilon}{2}] \geq 1 - 2 \frac{\delta^2}{n}

当该事件发生时,令 j=\arg\max_{i} \{\hat{\mu_i}\},有
\mu_j \geq \hat{\mu_j} - \frac{\varepsilon}{2} \geq \hat{\mu_1} - \frac{\varepsilon}{2} \geq \mu_1 - \frac{\varepsilon}{2}- \frac{\varepsilon}{2} = \mu_1- \varepsilon

采样复杂度(sample complexity)\mathcal{O}(\frac{n}{\varepsilon^2} \ln \frac{n}{\delta})
lower bound\mathcal{O}(\frac{n}{\varepsilon^2} \ln \frac{1}{\delta})


Exploration-and-commit for regret minimization

算法:

  1. US 纯探索,参数为 \varepsilon\delta = \frac{1}{T} (探索步 exploration step)
  2. 持续拉臂直到时间 T (开发步 exploitation step)

分析:

\text{Regret}_{phase1} \leq \frac{4 \ln (nT) n}{\varepsilon^2}

\text{Regret}_{phase2} \leq (T-\frac{4\ln (nT) n}{\varepsilon^2}) \varepsilon + Pr[\text{step 1 failed}]\cdot T \leq T \cdot \varepsilon + 1

total 悔值:\frac{4n\ln(nT)}{\varepsilon^2}+T \varepsilon + 1
\varepsilon = \Big(\frac{n\ln(nT)}{T}\Big)^{1/3}
\text{Regret}_{\text{total}} \leq \mathcal{O}(n^{1/3}T^{2/3}\ln^{1/3}(nT))

\text{Regret}_{\text{average}} : \mathcal{O}(\frac{n \ln(nT)}{T})^{1/3}

上一篇 下一篇

猜你喜欢

热点阅读