Lecture 3: Planning by Dynamic P

2020-04-04 本文已影响0人魏鹏飞

Author：David Silver

Outline

Introduction
Policy Evaluation
Policy Iteration
Value Iteration
Extensions to Dynamic Programming
Contraction Mapping

What is Dynamic Programming?

Dynamic sequential or temporal component to the problem
Programming optimising a “program”, i.e. a policy

c.f. linear programming
A method for solving complex problems
By breaking them down into subproblems
- Solve the subproblems
- Combine solutions to subproblems

Requirements for Dynamic Programming

Dynamic Programming is a very general solution method for problems which have two properties:

Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused
Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP
It is used for planning in an MDP
For prediction:
- Input: MDP $<S,A,P,R,\gamma>$ and policy $\pi$
- or: MRP $<S,P^{\pi},R^{\pi},\gamma>$
- Output: value function $v_{\pi}$
Or for control:
- Input: MDP $<S,A,P,R,\gamma>$
- Output: optimal value function $v_*$
- and: optimal policy $\pi_*$

Other Applications of Dynamic Programming

Dynamic programming is used to solve many other problems, e.g.

Scheduling algorithms
String algorithms (e.g. sequence alignment)
Graph algorithms (e.g. shortest path algorithms)
Graphical models (e.g. Viterbi algorithm)
Bioinformatics (e.g. lattice models)

Iterative Policy Evaluation

Problem: evaluate a given policy π
Solution: iterative application of Bellman expectation backup
$v_1 →v_2 →...→v_π$
Using synchronous backups,
- At each iteration k + 1
- For all states s ∈ S
- Update $v_{k+1}(s)$ from $v_k (s )$
- where s′ is a successor state of s
We will discuss asynchronous backups later
Convergence to $v_{\pi}$ will be proven at the end of the lecture

Iterative Policy Evaluation (2)

Evaluating a Random Policy in the Small Gridworld

Iterative Policy Evaluation in Small Gridworld

Iterative Policy Evaluation in Small Gridworld (2)

How to Improve a Policy

Policy Iteration

Jack’s Car Rental

Policy Iteration in Jack’s Car Rental

Policy Improvement

Policy Improvement (2)

Modified Policy Iteration

Does policy evaluation need to converge to $v_π$ ?
Or should we introduce a stopping condition
- e.g. ε-convergence of value function
Or simply stop after $k$ iterations of iterative policy evaluation?
For example, in the small gridworld $k = 3$ was sufficient to achieve optimal policy
Why not update policy every iteration? i.e. stop after k = 1
- This is equivalent to value iteration (next section)

Generalised Policy Iteration

Principle of Optimality

Deterministic Value Iteration

Example: Shortest Path

Value Iteration

Problem: find optimal policy π
Solution: iterative application of Bellman optimality backup
$v_1 →v_2 →...→v_∗$
Using synchronous backups
- At each iteration $k + 1$
- For all states $s ∈ S′$
- Update $v_{k+1}(s)$ from $v_k (s′)$
Convergence to $v_∗$ will be proven later
Unlike policy iteration, there is no explicit policy
Intermediate value functions may not correspond to any policy

Value Iteration (2)

Example of Value Iteration in Practice

Synchronous Dynamic Programming Algorithms

Asynchronous Dynamic Programming

DP methods described so far used synchronous backups
i.e. all states are backed up in parallel
Asynchronous DP backs up states individually, in any order
For each selected state, apply the appropriate backup
Can significantly reduce computation
Guaranteed to converge if all states continue to be selected

Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:

In-place dynamic programming 2. Prioritised sweeping
Real-time dynamic programming

In-Place Dynamic Programming

Prioritised Sweeping

Real-Time Dynamic Programming

Full-Width Backups

Sample Backups

Approximate Dynamic Programming

Some Technical Questions

How do we know that value iteration converges to $v_∗$ ?
Or that iterative policy evaluation converges to $v_π$ ?
And therefore that policy iteration converges to $v_∗$ ?
Is the solution unique?
How fast do these algorithms converge?
These questions are resolved by contraction mapping theorem

Value Function Space

Consider the vector space V over value functions
There are $|S|$ dimensions
Each point in this space fully specifies a value function v(s)
What does a Bellman backup do to points in this space?
We will show that it brings value functions closer
And therefore the backups must converge on a unique solution

Value Function $\infty$ -Norm

Bellman Expectation Backup is a Contraction

Contraction Mapping Theorem

Convergence of Iter. Policy Evaluation and Policy Iteration

The Bellman expectation operator $T^π$ has a unique fixed point
$v_π$ is a fixed point of $T^π$ (by Bellman expectation equation)
By contraction mapping theorem
Iterative policy evaluation converges on $v_π$
Policy iteration converges on $v_∗$

Bellman Optimality Backup is a Contraction

Convergence of Value Iteration

The Bellman optimality operator $T^∗$ has a unique fixed point
$v_∗$ is a fixed point of $T^∗$ (by Bellman optimality equation)
By contraction mapping theorem
Value iteration converges on $v_∗$

Reference：《UCL Course on RL》

上一篇下一篇

猜你喜欢

热点阅读