Efficient Reinforcement Learning

2019-08-20  本文已影响0人  太捉急

这是一篇1994年的老文章,也是分析强化学习算法复杂度比较经典的一篇文章。废话少说,直接开看。
首先定义RL的基本概念:

Optimal Policy:
\pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \{R(\tau) \}
Optimal Value:
V^*(s) = V^{\pi^*}(s) = \max_\pi \mathbb{E}_{\tau \sim \pi, s_0 = s} \{R(\tau) \} \\ Q^*(s, a) = Q^{\pi^*}(s,a) = \max_\pi \mathbb{E}_{\tau \sim \pi, s_0 = s, a_0=a} \{R(\tau) \}

Obviously for any policy \pi, \quad V^{\pi} \leq V^*.
V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi, s_0 = s} \{R(\tau) \} = \sum_{a \in \mathcal{A}} \pi(s,a) \{ \mathbb{E}[\mathcal{R}(s,a)] + \gamma \sum_{s' \in \mathcal{S}} V^\pi(s')\mathcal{T}(s,a,s') \}

这里定义本文最重要的概念PAC Learner (Probability Approximate Correct) :

An algorithm \mathcal{L} is a PAC learner if for any environment \mathcal{E} = (\mathcal{S, A, T, R}, d_0), \epsilon > 0, \delta > 0, \gamma \in (0, 1], it produce a policy \pi such that:
\Pr \{ |V^{*}(s_0) - V^{\pi}(s_0)| \leq \epsilon \} \geq 1-\delta
in time polynomial in N = |\mathcal{S}|, K=|\mathcal{A}|, r_\max = \max \mathcal{R}(s,a), 1/\epsilon, 1/\delta, 1/(1-\gamma)

就是说,假如有一个算法,它能在一定时间内,得到一个策略\pi,使得V^{\pi}V^*的误差小于\epsilon的概率大于1-\delta,这里的一定时间是关于N, K, r_\max, 1/\epsilon, 1/\delta, 1/(1-\gamma)的多项式,那么这个算法是PAC Learner。


我们首先考虑限制步长为M的MDP, \tau = (s_0, a_0, r_1, \dots, s_M), \quad R(\tau) = \sum_{t=0}^{M-2}\gamma^t r_{t+1}, \quad V_M ^\pi(s_M) = 0, V_M^\pi(s_0) = \mathbb{E}_{\tau \sim \pi}\{R(\tau)\}, then
V^{\pi}(s_0) = \lim_{M\rightarrow \infty}V_M^{\pi}(s_0)

For any s_0 \in \mathcal{S},
|V^\pi(s_0) - V_M^{\pi}(s_0) | \leq \gamma^{M}r_\max / (1 - \gamma)

这个很容易证明, 对于 \tau_{>M} = (s_M, a_M, r_{M+1}, s_{M+1}, \dots)
|V^\pi(s_0) - V_M^{\pi}(s_0) | = \gamma^M |\mathbb{E} [ R(\tau_{>M}) ]| \leq \gamma^M v_\max, \quad v_\max = r_\max / (1 - \gamma)
我们要求

|V^\pi(s_0) - V_M^{\pi}(s_0) | \leq \gamma^M v_\max < \epsilon_c
那么

M \geq \log_\gamma (\epsilon_c / v_\max) = \ln(v_\max / \epsilon_c) / (\ln(1/\gamma)) \\
Use \ln x \approx x - 1 for x \in (0, 1],
M \geq ln(v_\max / \epsilon_c) /(1-\gamma)

上一篇 下一篇

猜你喜欢

热点阅读