增强学习Reinforcement Learning

Model-based RL

2018-09-19  本文已影响13人  海街diary

注:以下内容基于CS598.

1. Estimate Model

给定数据集D=\{(s_1, a_1, r_1, s_1^\prime), (s_2, a_2, r_2, s_2^\prime) \cdots (s_n, a_n, r_n, s_n^\prime) \}, 采用极大似然对模型进行估计。用|D_{s,a}|表示(s_i=s, a_i=a, \cdots)的样本数。

\hat P(s^\prime|s, a) = \frac{1}{|D_{s,a}|} \sum_{(r, s^\prime) \in D_{s,a} }\mathbb{I}_{s^\prime}

\hat R(s,a) = \frac{1}{|D_{s,a}|} \sum_{(r, s^\prime) \in D_{s, a}} r

2. Analysis of Certainty-Equivalence RL

2.1 Naive analysis

根据Hoeffding's Inequality: With probability at least 1-\delta,

|\frac{S_n}{n} - \frac{\mathbb{E}[S_n|}{n}| \leq (b-a) \sqrt{\frac{1}{2n}\ln{\frac{2}{\delta}}}

将失败率\delta分别平摊到|S \times A||S \times A \times S|个事件上,有:

\max_{s,a} |\hat R(s,a) - R(s,a) | \leq R_{max} \sqrt{\frac{1}{2n}\ln{\frac{4|S| \times |A|}{\delta}}}

\max_{s,a, s^\prime} |\hat P(s^\prime|s,a) - P(s^\prime|s, a) | \leq \sqrt{\frac{1}{2n}\ln{\frac{4|S| \times |A| \times |S|}{\delta}}}

所以, 定义P(s,a) = P(\cdot|s,a)为一个|S|维的vector,有:

\max_{s, a} ||\hat P(s, a) - P(s, a)||_1 \leq \max_{s,a} |S| \cdot ||\hat P(s, a) - P(s, a)||_{\infty} \leq |S| \cdot \sqrt{\frac{1}{2n}\ln{\frac{4|S| \times |A| \times |S|}{\delta}}}

If \max_{s,a}|\hat R(s,a) - R(s,a)| \leq \epsilon_R and \max_{s,a}|\hat P(s,a) - P(s,a)| \leq \epsilon_P, then for any policy \pi: S \to A, we have \forall s \in S
|| V^{\pi}_{\hat M} - V^{\pi}_{M} ||_{\infty} \leq \frac{\epsilon_R}{1 - \gamma} + \frac{\gamma \epsilon_P R_{max}}{2(1-\gamma)^2}

Proof: \forall s \in S
\begin{align*} | V^{\pi}_{\hat M}(s) - V^{\pi}_{M}(s) | &= | \hat R(s,a) + \gamma \hat {P}(s'|s,a) \cdot V^\pi_{\hat M}(s') - R(s,a) - \gamma P(s'|s,a) \cdot V^\pi_M(s')| \\ & \leq |\hat R(s,a)-R(s,a)| + \gamma | \hat P(s'|s,a) \cdot V^\pi_{\hat M}(s') - P(s'|s,a) \cdot V^\pi_M(s')| \\ &\leq \epsilon_R + \gamma |\hat P(s,a) \cdot V^\pi_{\hat M} - P(s,a) \cdot V^\pi_{\hat M} + P(s,a) \cdot V^\pi_{\hat M} - P(s,a) \cdot V^\pi_M| \quad (ignore \, s')\\ &\leq \epsilon_R +\gamma |(\hat P(s,a) - P(s,a)) \cdot V^\pi_{\hat M}| +\gamma | P(s,a) \cdot (V^\pi_{\hat M} - V^\pi_M) | \\ &\leq \epsilon_R + \gamma |(\hat P(s,a) - P(s,a)) \cdot (V^\pi_{\hat M} - \frac{R_{max}}{2(1-\gamma)} {\mathbf{1}} )| + \gamma ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \\ &\leq \epsilon_R + \gamma ||\hat P(s,a) - P(s,a)||_{1} \times ||V^\pi_{\hat M} - \frac{R_{max}}{2(1-\gamma)} {\mathbf{1}} ||_{\infty} + \gamma |(V^\pi_{\hat M} - V^\pi_M)|_{\infty} \\ &\leq \epsilon_R + \gamma \frac{ \epsilon_P R_{max}}{2(1-\gamma)} + \gamma ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty}\\ Thus,\\ (1-\gamma)& ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \leq \epsilon_R + \gamma \frac{ \epsilon_P R_{max}}{2(1-\gamma)} \\ &||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \leq \frac{\epsilon_R}{1 - \gamma} + \frac{\gamma \epsilon_P R_{max}}{2(1-\gamma)^2} \end{align*}

\forall s \in S, V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) \leq 2 \sup_{\pi: S \to A}||V^\pi_{\hat M} - V^\pi_{M}||_{\infty}.

Proof: \forall s \in S,
\begin{align*} V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) &= V^*_M(s) - V^{\pi^*_M}_{\hat M}(s) + V^{\pi^*_M}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s) \\ &\leq V^*_M(s) - V^{\pi^*_M}_{\hat M}(s) + V^{\pi^*_{\hat M}}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s) \qquad ( \pi^*_{\hat M} \,maxmizes \quad v_{\hat M}) \\ & \leq || V^*_M(s) - V^{\pi^*_M}_{\hat M}(s)||_{\infty} + || V^{\pi^*_{\hat M}}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s)||_{\infty}. \\ & \leq 2[\frac{\epsilon_R}{1 - \gamma} + \frac{\gamma \epsilon_P R_{max}}{2(1-\gamma)^2}] \qquad(Lemma \,1) \\ Thus, \\ V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) &= \tilde{O} (\frac{|S|}{\sqrt{n}(1 - \gamma)^2}) \qquad \forall s \in S. \end{align*}
Here \tilde{O}(\cdot)supresses poly-logarithmic dependences on |S| and |A|.

2.2 Improving S to \sqrt{|S|}

对于任意向量v \in \mathbb{R}^{|S|}, 有
||v||_1 = \sup_{u \in {-1, 1}^{|S|}} u^T v.

所以对于任意给定的(s,a) 和任意给定的u \in \{-1, 1\}^{|S|}, u^T \hat P(s,a) 是以[-1, 1]为界的随机变量,以至少1 - \delta/(2|S\times A| \cdot 2^{|S|}), 有
u^T (\hat P(s,a) - P(s, a)) \leq 2 \sqrt{\frac{1}{2n} \ln \frac{2|S \times A| \cdot 2^{|S|}}{\delta}}

所以, 以至少1 - \delta/2的概率,有
\max_{s, a} ||\hat P(s,a) - P(s, a)||_1 = \max_{s,a} \max_{u \in {-1, 1}^{|S|}} u^T (\hat P(s,a) - P(s, a)) \leq 2 \sqrt{\frac{1}{2n} \ln \frac{2|S \times A| \cdot 2^{|S|}}{\delta}}

所以,
V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) = \tilde{O} (\frac{ \sqrt{|S|}}{\sqrt{n}(1 - \gamma)^2}) \qquad \forall s \in S

上一篇 下一篇

猜你喜欢

热点阅读