Lecture 2: Markov Decision Proce

2020-03-23 本文已影响0人魏鹏飞

Author：David Silver

Outline

Markov Processes
Markov Reward Processes
Markov Decision Processes
Extensions to MDPs

Introduction to MDPs

Markov decision processes formally describe anenvironment` for reinforcement learning
Where the environment is fully observable
i.e. The current state completely characterises the process
Almost all RL problems can be formalised as MDPs, e.g.
- Optimal control primarily deals with continuous MDPs
- Partially observable problems can be converted into MDPs
- Bandits are MDPs with one state

Markov Property

“The future is independent of the past given the present”

The state captures all relevantinformation from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future

State Transition Matrix

For a Markov state s and successor state s' , the state transition probability is defined by

$P_{ss'}=P[S_{t+1}=s'|S_t=s]$

State transition matrix $P$ defines transition probabilities from all states s to all successor states s',

where each row of the matrix sums to 1.

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random states $S_1, S_2, ...$ with the Markov property.

Example: Student Markov Chain

Example: Student Markov Chain Episodes

Example: Student Markov Chain Transition Matrix

Markov Reward Process

A Markov reward process is a Markov chain with values.

Example: Student MRP

Return

The discount $\gamma \in [0, 1]$ is the present value of future rewards
The value of receiving reward $R$ after $k + 1$ time-steps is $\gamma^kR$ .
This values immediate reward above delayed reward.
- $\gamma$ close to 0 leads to ”myopic” evaluation
$\gamma$ close to 1 leads to ”far-sighted” evaluation

Why discount?

Most Markov reward and decision processes are discounted. Why?

Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate reward
It is sometimes possible to use undiscounted Markov reward processes (i.e. $\gamma = 1$ ), e.g. if all sequences terminate.

Value Function

The value function $v(s)$ gives the long-term value of state $s$ .

Example: Student MRP Returns

Example: State-Value Function for Student MRP (1)

Example: State-Value Function for Student MRP (2)

Example: State-Value Function for Student MRP (3)

Bellman Equation for MRPs

The value function can be decomposed into two parts:

immediate reward $R_{t+1}$
discounted value of successor state $\gamma v(S_{t+1})$

Bellman Equation for MRPs (2)

Example: Bellman Equation for Student MRP

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices,

$v = R+\gamma Pv$

where $v$ is a column vector with one entry per state

$\begin{bmatrix} v(1) \\ \vdots\\ v(n) \end{bmatrix}=\begin{bmatrix} R_1 \\ \vdots\\ R_n \end{bmatrix} + \gamma \begin{bmatrix} P_{11} & \dots & P_{1n}\\ \vdots \\ P_{n1} & \dots & P_{nn} \end{bmatrix}\begin{bmatrix} v(1) \\ \vdots\\ v(n) \end{bmatrix}$

Solving the Bellman Equation

The Bellman equation is a linear equation
It can be solved directly:

$v=R+\gamma Pv$
$(I-\gamma P)v=R$
$v=(I-\gamma P)^{-1}R$

Computational complexity is $O(n^3)$ for n states
Direct solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.
- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

Example: Student MDP

Policies (1)

A policy fully defines the behaviour of an agent
MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent), $A_t \sim \pi(\cdot|S_t),\forall t>0$

Policies (2)

Given an MDP $M = <S,A,P,R,\gamma>$ and a policy $\pi$
The state sequence $S_1, S_2, ...$ is a Markov process $<S, P^{\pi}>$
The state and reward sequence $S_1, R_2, S_2, ...$ is a Markov reward process $<S, P^{\pi}, R^{\pi}, \gamma>$
where

$P^{\pi}_{s,s'}=\sum_{a\in A}\pi(a|s)P_{ss'}^a$
$R_s^{\pi}=\sum_{a\in A}\pi(a|s)R_s^a$

Value Function

Example: State-Value Function for Student MDP

Bellman Expectation Equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state,

$v_{\pi}(s)=E_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s]$

The action-value function can similarly be decomposed,

$q_{\pi}(s,a)=E_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1},A_{t+1})|S_t=s,A_t=a]$

Bellman Expectation Equation for $V^{\pi}$

Bellman Expectation Equation for $Q^{\pi}$

Bellman Expectation Equation for $v_{\pi}$ (2)

Bellman Expectation Equation for $q_{\pi}$ (2)

Example: Bellman Expectation Equation in Student MDP

Bellman Expectation Equation (Matrix Form)

The Bellman expectation equation can be expressed concisely using the induced MRP,

$v_{\pi}=R^{\pi}+\gamma P^{\pi}v_{\pi}$

with direct solution

$v_{\pi}=(I-\gamma P^{\pi})^{-1}R^{\pi}$

Optimal Value Function

The optimal value function specifies the best possible performance in the MDP.
An MDP is “solved” when we know the optimal value fn (v+q).

Example: Optimal Value Function for Student MDP

Example: Optimal Action-Value Function for Student MDP

Optimal Policy

Define a partial ordering over policies:

$\pi \geq \pi' \mbox{ if } v_{\pi}(s)\geq v_{\pi'}(s), \forall_s$

Finding an Optimal Policy

An optimal policy can be found by maximising over $q_∗(s,a)$ ,

$\pi_{*}(a|s) = \begin{cases} 1, & \mbox{if }a=\argmax_{a\in A}q_{*}(s,a) \\ 0, & \mbox{otherwise } \end{cases}$

There is always a deterministic optimal policy for any MDP
If we know $q_∗(s,a)$ , we immediately have the optimal policy

Example: Optimal Policy for Student MDP

Bellman Optimality Equation for $v_*$

Bellman Optimality Equation for $Q_*$

Bellman Optimality Equation for $V^*$ (2)

Bellman Optimality Equation for $Q^*$ (2)

Example: Bellman Optimality Equation in Student MDP

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear
No closed form solution (in general)
Many iterative solution methods
- Value Iteration
- Policy Iteration
- Q-learning
- Sarsa

Extensions to MDPs (no exam)

Infinite and continuous MDPs
Partially observable MDPs
Undiscounted, average reward MDPs

Infinite MDPs (no exam)

The following extensions are all possible:

Countably infinite state and/or action spaces
- Straightforward
Continuous state and/or action spaces
- Closed form for linear quadratic model (LQR)
Continuous time
- Requires partial differential equations
- Hamilton-Jacobi-Bellman (HJB) equation
- Limiting case of Bellman equation as time-step → 0

POMDPs (no exam)

Belief States (no exam)

Reductions of POMDPs (no exam)

Ergodic Markov Process (no exam)

Ergodic MDP (no exam)

Average Reward Value Function (no exam)

Questions?

The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton

Reference：《UCL Course on RL》

Lecture 2: Markov Decision Proce

Outline

Introduction to MDPs

Markov Property

State Transition Matrix

Markov Process

Example: Student Markov Chain

Example: Student Markov Chain Episodes

Example: Student Markov Chain Transition Matrix

Markov Reward Process

Example: Student MRP

Return

Why discount?

Value Function

Example: Student MRP Returns

Example: State-Value Function for Student MRP (1)

Example: State-Value Function for Student MRP (2)

Example: State-Value Function for Student MRP (3)

Bellman Equation for MRPs

Bellman Equation for MRPs (2)

Example: Bellman Equation for Student MRP

Bellman Equation in Matrix Form

Solving the Bellman Equation

Markov Decision Process

Example: Student MDP

Policies (1)

Policies (2)

Value Function

Example: State-Value Function for Student MDP

Bellman Expectation Equation

Bellman Expectation Equation for

Bellman Expectation Equation for

Bellman Expectation Equation for (2)

Bellman Expectation Equation for (2)

Example: Bellman Expectation Equation in Student MDP

Bellman Expectation Equation (Matrix Form)

Optimal Value Function

Example: Optimal Value Function for Student MDP

Example: Optimal Action-Value Function for Student MDP

Optimal Policy

Finding an Optimal Policy

Example: Optimal Policy for Student MDP

Bellman Optimality Equation for

Bellman Optimality Equation for

Bellman Optimality Equation for (2)

Bellman Optimality Equation for (2)

Example: Bellman Optimality Equation in Student MDP

Solving the Bellman Optimality Equation

Extensions to MDPs (no exam)

Infinite MDPs (no exam)

POMDPs (no exam)

Belief States (no exam)

Reductions of POMDPs (no exam)

Ergodic Markov Process (no exam)

Ergodic MDP (no exam)

Average Reward Value Function (no exam)

Questions?

猜你喜欢

热点阅读

Bellman Expectation Equation for $V^{\pi}$

Bellman Expectation Equation for $Q^{\pi}$

Bellman Expectation Equation for $v_{\pi}$ (2)

Bellman Expectation Equation for $q_{\pi}$ (2)

Bellman Optimality Equation for $v_*$

Bellman Optimality Equation for $Q_*$

Bellman Optimality Equation for $V^*$ (2)

Bellman Optimality Equation for $Q^*$ (2)