03-07 Dyna

2017-12-21 本文已影响0人 woodwood2000

hallucinate 产生幻觉

Dyna-Q：混合 Model-Free 和 Model-based

image.png

每一次和真实世界的交互，都会自己更新100次。

image.png

T'[s,a,s']: 从状态 s，采取动作 a，到状态 s’的概率
R'[s,a]: 从状态 s，采取动作 a的 reward

image.png

根据真实世界发生的次数，更新 T

image.png

练习: How To Evaluate T?

Type in your expression usingMathQuill

Correction: The expression should be:

Computing transition probabilities using counts

image.png

R：模型中的 Reward
r: 真实的立即 reward

image.png

The Dyna architecture consists of a combination of:

direct reinforcement learning from real experience tuples gathered by acting in an environment,
updating an internal model of the environment, and,
using the model to simulate experiences.

Sutton and Barto.
Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]

Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In

Proceedings of the Seventh International Conference on Machine Learning, Austin, TX, 1990. [pdf]
Sutton and Barto.

Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]
RL course by David Silver

(videos, slides)
- Lecture 8: Integrating Learning and Planning [pdf]