CS234 学习笔记 I

2020-03-11 本文已影响0人专注挖坑的汪

先修要求

学完以后你会get到

Lecture I Introduction

Reinforcement Learning involves：

Optimization
Delayed Consequences
Exploration
Generalization

下面这一块主要是介绍RL的特点已经将RL与其它的AI Learning进行区分：

Optimization：

Goal is to find an optimal way to make decisions

Delayed Consequences:

this is the challenge that the decisions are made now, you might not realize whether or not they're a good decision until much later.
Two challenges：
1、When planning: decisions involve reasoning about not just immediate benefit of a decision but also its longer term ramifications
2、When learning: temporal credit assignment is hard (what caused later high or low rewards?)

Exploration:

Learning about the world by making decisions,Censored data,Decisions impact what we learn about

Generalization:

Policy is mapping from past experience to action
Why not just pre-program a policy? 原因是policy的可选空间太大了

another thing that comes up a lot in artificial intelligence is planning:
举例：对于Alpha Go属于AI Planning问题，涉及Optimization， Generaliztion， Delayed consequences但是它不涉及exploration
The idea of planning is that you're given a model of how the world works!

Supervised Learning:
相对于RL 这个涉及到Optimization，Generalization 一般不涉及Exploration和Delayed consequences；它有被提供correct labels，一般是做一个decsion

Imitation Learning:
相对于RL 这个涉及到Optimization，Generalization，Delayed consequences
特点是Learning from experience of others，SO instead of our intelligent agent getting to take experiences from the world and make its own decisions,it might watch another intelligent agent which might be a person,make decisions,observe outcomes and then use that experience to figure out how it wants to act
优点 Great tools for supervised learning;Avoids exploration problem;With big data lots of data about outcomes of decisions
局限：Can be expensive to capture;Limited by data collected
Imitation Learning + RL promising !

接下来是介绍sequential decision making under uncertainty

sequential decision making
In these settings is sort of an interactive closed-loop process,where we have some agent,an intelligent agent hopefully that is,taking actions that are affecting the state of the world and then it's giving back an observation and a reward.

Goal:Select actions to maximize total expected(因为包含随机性) future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards

例子是Artificial Tutor

aritificial tutor
it's a kindergartner that student doesn't know anything about math and we're trying to figure out how to teach the student math,and that the reward structure for the teaching agent is they get a plus one every time a sutdent gets something right
还举了一个machine teaching的例子，machine teaching efficient，就是两个intelligent agent进行cooperative action，比如一个agent知道另一个agent要教它一条线上哪块区域是+，哪块区域是-的. SO in general you're going to need some amount of samples,wear some points along the lien where you have to get positive or negative labels.if you're in an active learning setting,you can reduce that to roughly log n by being strategic about asking people to label particularly points in a line,one of the really cool things for machine teaching is that, if i know you are trying to teach me where to divide this line,you'll only need one point or at most two points essentially constant.Because,if i'm trying to teach you,there's no way i'm just going to randomly label things,i'm just gonna label you a single plus and a minus and that's gonna tell you exactly where the line goes.

We're going to think about almost always about there being discrete timer

通用模型
很重要的事情是define a state-space,whenever you're in a real application,this is exactly what you have to define, is how to write down the representation of the world.WE ARE GOING TO ASSUME IN THIS CLASS IS THAT THE STATE IS A FUNCTION OF THE HISTORY!

state

markov模型的假设
有些问题能够用markov模型来模拟，但有的不行，比如对于Hyptertension Control，let state be current blood pressure，and action be whether to take medication or not,这个系统不是Markov，因为现在的blood pressure 可能会受到过去的影响，比如你刚吃饭

markov模型受欢迎的原因
问题：你能总是将一些情况试做markov吗，答案是可以的如果你在一个状态中把所有历史都包含进来。
实际中可能将最近的4次观察作为合理充足的状态

对于环境只能部分观察的情况

网页上的广告推荐系统就是一种bandits

how the world changes

一种change的方式是deterministic 另一种change的方式是stochastic

RL算法的组成部分

Model

Policy

Value

RL agent的种类

exploration and exploitation

exploration and exploitation的例子

Evaluation and Control

两个基本问题Evaluation and Control
RL一个很棒的特点是we often can do this evaluation off policy which means we can use data gathered from other policies to evaluate the counterfactual of what different policies might do.

这集主要是课程的简单描述，笔记都是ppt，看到这里感觉有点抽象