[Notes]Regret Analysis of Stocha

2018-10-22  本文已影响0人  半山来客

完整paper地址:https://arxiv.org/pdf/1204.5721.pdf

Introduction

Regret R_n:

Regret Definition

WHERE:

In general, both rewards Xi,t and forecaster’s choices It might be stochastic. Distinguish between the two following notions of averaged regret:

Expected regret

expected regret
The expectation of the regret with respect to the action which is optimal on the sequence of reward realizations.
Psudo-regret

A weaker notion of regret, since one competes against the action which is optimal only in expectation.

?感觉这个解释不太对啊?再看看,每个t中选i的时候可以变吗?

Pseudo-regret ≤ Expected regret

Stochastic bandits (UCB algorithm)

Stochastic bandits

Probability distributions ν_i,\dots,v_k~ on ~[0,1] ~~ X_{I_t,t} \sim v_{I_t}
\mu_i: the mean of ν_i (mean reward of arm i).

Psudo-regret

\mu_i: the mean of ν_i (Mean reward of arm i).

Lai and Robbins [125], who introduced the technique of upper confidence bounds for the asymptotic analysis of regret:
[125] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985.

Adversarial bandits(Exp3 randomized algorithm)

Adversarial bandits

In this adversarial setting, the goal is to obtain regret bounds in high probability or in expectation with respect to any possible randomization in the strategies used by the forecaster or the opponent and irrespective of the opponent. In the case of a nonobvious adversary this is not an easy task, and for this reason, we usually start by bounding
the pseudo-regret:


image.png

Markovian bandits(Gittins indices)


Terminology

上一篇 下一篇

猜你喜欢

热点阅读