- 环境先验模型(a-priori model of environment)
which is revealed to the agent before its interaction with the environment starts.
- 生成模型(a generative model): 输入一个状态和动作,输出奖励和根据转移函数的到的随机下一个状态。

声明模型(a declarative model)
我们在此处区分学习agents (未提供先验模型)和规划agents(可以访问生成模型或者声明模型)
Figuratively, the agent faces a slot machine (a bandit) with multiple arms, and it has to decide which arm to pull based only on the payouts it received so far. Each arm provides a random reward according to a probability distribution that is specific to the arm and constant over time, but the agent has no initial information on the distribution.
exploration-exploitation dilemma 探索和获取收益的窘境
on the one hand, since the agent has no a-priori knowledge of the environment, it has to explore its possibilities to learn from the feedback the environment provides. And on the other, since it aims to maximize the accumulated reward over all runs, it has an incentive to exploit the knowledge it has gathered by executing the action it believes to be best. In other words, the agent learns from its trial-and-error interaction with the environment, and it has to make sure that the balance between exploration and exploitation is such that it learns the best action without sacrificing too much reward in early decisions.

一个三臂的例子MAB问题。 边缘用转移概率p和奖励r标记为p:r。
Q(al) = 10, Q(am) = 52, Q*(ar) = 50