游戏AI基础知识梳理

2019-06-09 本文已影响0人 YukiRain

在公司看文档，对用到的一些知识做简单梳理；大部分idea来源于DeepMind或OpenAI

PPO的目标函数

PPO有两种目标函数形式，第一种一般简称adaptive KL

$\theta_{k+1}=\arg\max_{\theta}\mathbb{E}_{\pi'}[\sum_{t=0}^{\infty}\gamma^{t}\frac{\pi'(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}A^{\pi}(s_{t},a_{t})-\beta_{k}D_{KL}(\pi'||\pi_{\theta})]$

第二种一般被称作clipped surrogate

$\theta_{k+1}=\arg\max_{\theta}\mathbb{E}_{\pi'}[\sum_{t=0}^{\infty}[\min(\frac{\pi'(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}A^{\pi}(s_{t},a_{t}),\ \text{clip}(\frac{\pi'(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})},1-\epsilon,1+\epsilon)A^{\pi}(s_{t},a_{t}))]]$

其中

$\theta$ 是policy模型的参数， $\pi_{\theta}$ 是我们要训练迭代的模型
$\pi'$ 是迭代之前旧的policy模型，一般做法是初始化两个结构相同的网络，使用 $\pi'$ 与环境交互得到的训练数据 (trajectory) 更新 $\pi$ ，若干步后把 $\pi$ 的参数全部copy给 $\pi'$
$A^{\pi}(s_{t},a_{t})=Q^{\pi}(s_{t},a_{t})-V^{\pi}(s_{t})=R_{t}+\gamma{V^{\pi}(s_{t+1})}-V^{\pi}(s_{t})$ 是advantage function
$\epsilon$ 一般取0.1之类的
$D_{KL}(\pi'||\pi)=\mathbb{E}_{s_{t}\sim{d^{\pi'}(s)}}\mathbb{E}_{a_{t}\sim\pi'}[\log(\frac{\pi'(a_{t}|s_{t})}{\pi(a_{t}|s_{t})})]$ ，就是常说的KL divergence。对于离散空间直接两个交叉熵除一下即可；对于连续空间一般会采用reparameterization-trick将网络参数化成一个Gaussian distribution （就是让网络输出两个向量一个代表 $\mu$ 一个代表 $\sigma$ 然后从中采样），两个Gaussian之间的KL有闭式解
$D_{KL}(\mathcal{N}(\mu_{1},\sigma_{1},\mathcal{N}(\mu_{2},\sigma_{2})))=\log(\frac{\sigma_{2}}{\sigma_{1}})+\frac{\sigma_{1}^{2}+\mu_{2}-\mu_{1}}{2\sigma_{2}^{2}}-\frac{1}{2}$
$V^{\pi}(s)=\mathbb{E}_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}R_{t}]$ ，等号右边这个东西叫做 total discounted reward，是所有强化学习算法优化的最终目标，一般参数化成价值网络的形式，直接用监督学习训练，近几年的强化学习算法普遍采用GAE来估计 total discounted reward

设计这种目标函数的目的

这两种目标函数的目的都是为了近似自然梯度 $\tilde{g}=F^{-1}g=(\nabla^{2}_{\theta}KL(\pi'||\pi_{\theta}))^{-1}g$ ，式中的 $F$ 是 Fisher information matrix，由 $F$ 可以确定一个在概率空间中具有不变性的黎曼度量，使得 $F^{-1}g$ 是逆变向量，i.e., 由 $F^{-1}g$ 所确定的自然梯度与 $\pi_{\theta}$ 的参数化形式无关，因而拥有较小的训练方差
在bounded KL范围内迭代可以有单调提升的（弱）bound：
$J(\pi_{\theta})-J(\pi')\geq \mathbb{E}_{\pi'}[\sum_{t=0}^{\infty}\gamma^{t}\frac{\pi'(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}A^{\pi}(s_{t},a_{t})]-\frac{4\gamma\max_{s,a}A^{\pi}(s,a)}{(1-\gamma)^{2}}\mathbb{E}_{s\sim{d^{\pi'}}}[D_{KL}(\pi'||\pi_{\theta})]$
i.e., 在 bounded KL ball 中迭代policy可以保证收敛的稳定
off-policyness：可以用从 $\pi'$ 中采样得到的trajectory优化 $\pi_{\theta}$ ，这样做有利于实现分布式计算框架，但其实可以想到这个迭代速度不会很快，因为PPO的目标函数形式限制了每步迭代中 $D_{KL}(\pi'||\pi_{\theta})$ 的大小，且每次更新完以后都要把 $\pi_{\theta}$ 的参数copy回去给 $\pi'$

个人经验，训练时要尤其关注KL的变化，KL bound得比较好的话，policy的improvement是有理论保障的；反之如果bound的不好，有时会出现 policy 的退化现象，越训练越差

References

GAE的简单解释

全称 generalized advantage estimator，出自论文High-Dimensional Continuous Control Using Generalized Advantage Estimation，in a word，其最终目的是在 advantage function $A(s,a)=Q(s,a)-V(s)$ 的各种估计方式估计中找一个bias-variance tradeoff 的平衡点

[站外图片上传中...(image-7f9026-1560066957306)]

以上六种policy gradient的形式中，3拥有最小的理论方差，但实际计算中由于复杂度问题一般会采用5

一种估计 $A^{\pi}(s_{t},a_{t})$ 的方法是把整个trajectory的reward都考虑在内,这种估计方式有较小的bias和较大的variance：
$\hat{A}^{\pi}(s_{t},a_{t})=\sum_{t=0}^{T}\gamma_{t}R_{t}+\gamma^{T+1}V(s_{T})$

另一种方式是利用已有的 $V(s)$ 函数进行辅助估计，这种方法有较大的bias与较小的variance：
$\hat{A}(s_{t},a_{t})=R_{t}+\gamma V(s_{t+1})-V(s_{t})$

也可以构造出介于两者之间的形式，总结为：

$\hat{A}_t^{(1)} = R_t + \gamma V(s_{t+1}) - V(s_t) \\ \hat{A}_t^{(2)} = R_t + \gamma R_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t) \\ \hat{A}_{t}^{(n)} = \sum_{k=0}^{n-1}\gamma^{k-1}R_{t+k}+\gamma^{n}V(s_{t+n})-V(s_{t}) \\ \hat{A}_t^{(\infty)} = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots - V(s_t)$

在以上所有形式的advantage estimator中取bias-variance平衡点的方法，就是使用从 $\hat{A}_{t}^{(1)}$ 到 $\hat{A}_{t}^{(\infty)}$ 的几何平均：

$\hat{A}_t^{GAE(\gamma,\lambda)} = (1-\lambda)\Big(\hat{A}_{t}^{(1)} + \lambda \hat{A}_{t}^{(2)} + \lambda^2 \hat{A}_{t}^{(3)} + \cdots \Big) = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}^{V}$

AlphaStar: Mastering the Real-Time Strategy Game StarCraft II

[站外图片上传中...(image-878c4b-1560066957306)]

DeepMind用来做星际2的框架，里面包含的内容非常多

模型采用off-policy的actor critic，加experience replay、self-imitation learning以及policy distillation
为保证策略的多样性，先用SL训练一个baseline的模型（上图001），然后在每段iteration开始时,从前一轮iteration的模型中copy几个相同的出来进行自对弈，每个模型的超参都不一样，甚至reward定义都不一样，用PBT训练。上一轮迭代的模型不再更新，称之为frozen competitor，采用和人类玩家天梯匹配系统类似的方式设计自对弈中的对手匹配系统
用了transformer结构输出每个unit的action，结合了pointer network以及centralized value baseline

Population based training of neural networks

简称PBT，出自DeepMind论文Population based training of neural networks，见DeepMind博客地址

本质就是用genetic algorithm的思路来做hyper-parameter tuning：同时训练N个模型，每个模型有自己的一套超参，训练一段时间，取部分效果较好的模型，在其超参基础上做一些外延探索，并继续训练模型

这样的做法也可以防止模型陷入局部最优

Policy distillation

image

Policy distillation的原始论文中，teacher是DQN，这一点和我们这边差别很大，我们一般用policy gradient类的方法可以直接蒸馏，不存在作者文章中讨论的各种不同的loss问题

对于student网络的蒸馏，作者在文章中试验了三种不同的loss

将student参数化为 $\pi_{\theta}:\mathcal{S}\rightarrow{\mathcal{A}}$ 的形式，直接从DQN的replay memory中拿之前的数据出来当做action的标签
将student参数化为 $Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ 的形式，loss用均方误差
将student参数化为 $Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ 的形式，将student和teacher网络输出的Q值都过一个softmax，然后用Hinton论文里的KL散度作为loss

实验结果表明第三种loss效果最好

Centralized value baseline

出自DeepMind发表在AAAI-2018的文章 Counterfactual Multi-Agent Policy Gradients，idea很简单，就是对于multi-agent问题场景，所有的actor共享一个critic

这个idea很像是OpenAI那篇 Multi-agent actor critic 的弱化版，OpenAI那篇好歹认真考虑了 multi-agent 设定下存在的 non-stationary MDP 问题，DeepMind这篇直接无视掉了这一点，侧面说明可能理论上存在的 non-stationary MDP 问题实际在工程上并不影响模型效果

Transformer

以下内容参考了博客 The Illustrated Transformer 与 Google Brain的原始paper "Attention is All You Need"

Transformer是BERT的基础，NLP任务从此开始走上了【去RNN】化的道路

简单来说，这篇文章只用attention和feed-forward network就在很多任务上取得了很好的效果，主要有几部分组成

Self-attention, encoder-decoder attention, and multi-head attention
Feed-forward network
Positional encoding
Residual blocks

首先解释 self-attention

[站外图片上传中...(image-9527cb-1560066957306)]

其中self-attention形式为
$Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V \\ \text{where}\ Q=W_{Q}X, K=W_{K}X, V=W_{V}X$

一图胜千言
[站外图片上传中...(image-a4d7c5-1560066957306)]

然后是encoder-decoder attention

这个attention和之前seq2seq中的attention其实是一样的

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence

Multi-head attention: 说白了就是在同一层堆叠多个self-attention，作用主要有两点

It expands the model’s ability to focus on different positions

It gives the attention layer multiple "representation subspaces"

Positional Encoding

To be updated...

此外transformer网络中还用到了residual block来防止由于网络过深导致梯度退化

[站外图片上传中...(image-631f0c-1560066957306)]