Improved Training of Wasserstein

2019-04-09 本文已影响0人 Lornatang

Improved Training of Wasserstein GANs翻译上

4 Gradient penalty

4梯度罚款

We now propose an alternative way to enforce the Lipschitz constraint. A differentiable function is 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere, so we consider directly constraining the gradient norm of the critic’s output with respect to its input. To circumvent tractability issues, we enforce a soft version of the constraint with a penalty on the gradient norm for random samples

image

. Our new objective is

我们现在提出一种强制Lipschitz约束的替代方法。可区分函数是1-Lipschtiz，当且仅当它具有最多1个范数的梯度时，所以我们考虑直接约束评论者输出相对于其输入的梯度范数。为了避免易处理性问题，我们强制执行约束的软版本，对随机样本

image

的梯度范数进行惩罚。我们的新目标是

image Sampling distribution We implicitly deﬁne

image sampling uniformly along straight lines between pairs of points sampled from the data distribution

image and the generator distribution

image . This is motivated by the fact that the optimal critic contains straight lines with gradient norm 1 connecting coupled points from

image and

image (see Proposition 1). Given that enforcing the unit gradient norm constraint everywhere is intractable, enforcing it only along these straight lines seems sufﬁcient and experimentally results in good performance. 采样分布我们隐式地定义

image 沿着从数据分布

image 和生成器分布

image 采样的点对之间的直线均匀采样。这是因为最优评论家包含直线，其中梯度范数1连接来自

image 和

image 的耦合点（参见命题1）。鉴于在任何地方强制执行单位梯度范数约束是难以处理的，仅沿着这些直线强制执行它似乎是足够的，并且在实验上导致良好的性能。 Penalty coefﬁcient All experiments in this paper use

image

, which we found to work well across a variety of architectures and datasets ranging from toy tasks to large ImageNet CNNs.

惩罚系数本文中的所有实验都使用

image

，我们发现它可以很好地适用于从玩具任务到大型ImageNet CNN的各种架构和数据集。

No critic batch normalization Most prior GAN implementations [22, 23, 2] use batch normalization in both the generator and the discriminator to help stabilize training, but batch normalization changes the form of the discriminator’s problem from mapping a single input to a single output to mapping from an entire batch of inputs to a batch of outputs [23]. Our penalized training objective is no longer valid in this setting, since we penalize the norm of the critic’s gradient with respect to each input independently, and not the entire batch. To resolve this, we simply omit batch normalization in the critic in our models, ﬁnding that they perform well without it. Our method works with normalization schemes which don’t introduce correlations between examples. In particular, we recommend layer normalization [3] as a drop-in replacement for batch normalization.

没有评论批量标准化大多数先前的GAN实现[22,23,2]在生成器和鉴别器中都使用批量标准化来帮助稳定训练，但批量标准化会将鉴别器问题的形式从单个输入映射到单个输出变为从一批输入映射到一批输出[23]。我们的惩罚性培训目标在此设置中不再有效，因为我们会独立地惩罚评论者关于每个输入的梯度的标准，而不是整个批次。为了解决这个问题，我们在模型中忽略批评规范化，发现它们在没有它的情况下表现良好。我们的方法适用于规范化方案，这些方案不会引入示例之间的相关性。特别是，我们建议将层标准化[3]作为批量标准化的直接替代。

Two-sided penalty We encourage the norm of the gradient to go towards 1 (two-sided penalty) instead of just staying below 1 (one-sided penalty). Empirically this seems not to constrain the critic too much, likely because the optimal WGAN critic anyway has gradients with norm 1 almost everywhere under

image and

image and in large portions of the region in between (see subsection 2.3). In our early observations we found this to perform slightly better, but we don’t investigate this fully. We describe experiments on the one-sided penalty in the appendix. 双边惩罚我们鼓励梯度的范数朝向1（双面惩罚），而不是仅仅保持在1以下（单侧惩罚）。根据经验，这似乎并没有过多地限制批评者，可能是因为最佳的WGAN评论家无论如何都在

image 和

image 之间的几乎所有地方都有规范1的渐变，并且在其间的大部分地区（见2.3小节）。在我们的早期观察中，我们发现它的表现略好一些，但我们并未对此进行全面调查。我们描述了附录中片面惩罚的实验。

5 Experiments

5实验

5.1 Training random architectures within a set

5.1培训集合中的随机体系结构

We experimentally demonstrate our model’s ability to train a large number of architectures which we think are useful to be able to train. Starting from the DCGAN architecture, we deﬁne a set of architecture variants by changing model settings to random corresponding values in Table 1. We believe that reliable training of many of the architectures in this set is a useful goal, but we do not claim that our set is an unbiased or representative sample of the whole space of useful architectures: it is designed to demonstrate a successful regime of our method, and readers should evaluate whether it contains architectures similar to their intended application.

我们通过实验证明了我们的模型训练大量架构的能力，我们认为这些架构对训练有用。从DCGAN架构开始，我们通过将模型设置更改为表1中的随机对应值来定义一组架构变体。我们相信，对这一系列中的许多架构进行可靠的培训是一个有用的目标，但我们并不认为我们的集合是整个有用架构空间的公正或有代表性的样本：它旨在展示我们的成功制度。方法，读者应评估它是否包含与其预期应用类似的架构。

Table 1: We evaluate WGAN-GP’s ability to train the architectures in this set.

表1：我们评估WGAN-GP在该组中训练架构的能力。

image From this set, we sample 200 architectures and train each on

image ImageNet with both WGAN-GP and the standard GAN objectives. Table 2 lists the number of instances where either: only the standard GAN succeeded, only WGAN-GP succeeded, both succeeded, or both failed, where success is deﬁned as

image . For most choices of score threshold, WGAN-GP successfully trains many architectures from this set which we were unable to train with the standard GAN objective. We give more experimental details in the appendix. 从这个集合中，我们对

image ImageNet中的200个体系结构进行了采样，并使用WGAN-GP和标准GAN目标进行训练。表2列出了以下任一情况的实例数：只有标准GAN成功，只有WGAN-GP成功，成功或两者都失败，成功定义为

image 。对于大多数得分阈值的选择，WGAN-GP成功地训练了许多我们无法用标准GAN目标训练的架构。我们在附录中提供了更多实验细节。

Table 2: Outcomes of training 200 random architectures, for different success thresholds. For comparison, our standard DCGAN scored 7.24.

表2：针对不同的成功阈值，培训200个随机体系结构的结果。相比之下，我们的标准DCGAN得分为7.24。

image

101-layer ResNet G and D 5.2 Training varied architectures on LSUN bedrooms To demonstrate our model’s ability to train many architectures with its default settings, we train six different GAN architectures on the LSUN bedrooms dataset [31]. In addition to the baseline DCGAN architecture from [22], we choose six architectures whose successful training we demonstrate: (1) no BN and a constant number of ﬁlters in the generator, as in [2], (2) 4-layer 512-dim ReLU MLP generator, as in [2], (3) no normalization in either the discriminator or generator (4) gated multiplicative nonlinearities, as in [24], (5) tanh nonlinearities, and (6) 101-layer ResNet generator and discriminator.

基于LSUN卧室的101层ResNet G和D 5.2培训各种架构为了展示我们的模型能够以默认设置训练许多架构，我们在LSUN卧室数据集上训练了六种不同的GAN架构[31]。除了[22]的基线DCGAN架构外，我们选择了六种架构，我们展示了它们的成功训练：（1）发生器中没有BN和恒定数量的滤波器，如[2]，（2）4层512 -dim ReLU MLP发生器，如[2]中所述，（3）在鉴别器或发生器中没有归一化（4）门控乘法非线性，如[24]，（5）tanh非线性，和（6）101层ResNet发生器和鉴别器。

image

Figure 2: Different GAN architectures trained with different methods. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP.

图2：使用不同方法训练的不同GAN架构。我们只使用WGAN-GP成功地使用一组共享的超参数来训练每个架构。

Although we do not claim it is impossible without our method, to the best of our knowledge this is the ﬁrst time very deep residual networks were successfully trained in a GAN setting. For each architecture, we train models using four different GAN methods: WGAN-GP, WGAN with weight clipping, DCGAN [22], and Least-Squares GAN [18]. For each objective, we used the default set of optimizer hyperparameters recommended in that work (except LSGAN, where we searched over learning rates).

虽然我们没有声称没有我们的方法是不可能的，但据我们所知，这是第一次在GAN设置中成功训练非常深的残留网络。对于每种架构，我们使用四种不同的GAN方法训练模型：WGAN-GP，带权重限幅的WGAN，DCGAN [22]和最小二乘GAN [18]。对于每个目标，我们使用了该工作中推荐的默认优化器超参数集（除了LSGAN，我们搜索了学习率）。

For WGAN-GP, we replace any batch normalization in the discriminator with layer normalization (see section 4). We train each model for 200K iterations and present samples in Figure 2. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP. For every other training method, some of these architectures were unstable or suffered from mode collapse.

对于WGAN-GP，我们用层规范化替换鉴别器中的任何批量标准化（参见第4节）。我们训练每个模型进行200K次迭代，并在图2中显示样本。我们只使用WGAN-GP成功地使用一组共享的超参数来训练每个架构。对于其他所有训练方法，其中一些架构不稳定或遭受模式崩溃。

5.3 Improved performance over weight clipping

5.3改善了重量削减的性能

One advantage of our method over weight clipping is improved training speed and sample quality. To demonstrate this, we train WGANs with weight clipping and our gradient penalty on CIFAR10 [13] and plot Inception scores [23] over the course of training in Figure 3. For WGAN-GP, we train one model with the same optimizer (RMSProp) and learning rate as WGAN with weight clipping, and another model with Adam and a higher learning rate. Even with the same optimizer, our method converges faster and to a better score than weight clipping. Using Adam further improves performance. We also plot the performance of DCGAN [22] and ﬁnd that our method converges more slowly (in wall-clock time) than DCGAN, but its score is more stable at convergence.

我们的方法优于减重的一个优点是提高了训练速度和样本质量。为了证明这一点，我们在图3中的训练过程中训练WGAN进行了体重削减和CIFAR10 [13]的梯度惩罚以及初始得分[23]。对于WGAN-GP，我们训练一个模型使用相同的优化器（RMSProp）和学习率作为WGAN进行权重削减，另一个模型使用Adam和更高的学习率。即使使用相同的优化器，我们的方法收敛速度更快，并且比重量限幅更好。使用Adam进一步提高了性能。我们还绘制了DCGAN [22]的性能，并发现我们的方法比DCGAN收敛得更慢（在挂钟时间内），但其收敛在收敛时更稳定。

image

Figure 3: CIFAR-10 Inception score over generator iterations (left) or wall-clock time (right) for four models: WGAN with weight clipping, WGAN-GP with RMSProp and Adam (to control for the optimizer), and DCGAN. WGAN-GP signiﬁcantly outperforms weight clipping and performs comparably to DCGAN.

图3：CIFAR-10在四个模型的生成器迭代（左）或挂钟时间（右）上的初始得分：具有权重削减的WGAN，具有RMSProp和Adam的WGAN-GP（用于控制优化器）和DCGAN。WGAN-GP显着优于减重并且与DCGAN相当。

5.4 Sample quality on CIFAR-10 and LSUN bedrooms

5.4 CIFAR-10和LSUN卧室的样品质量

For equivalent architectures, our method achieves comparable sample quality to the standard GAN objective. However the increased stability allows us to improve sample quality by exploring a wider range of architectures. To demonstrate this, we ﬁnd an architecture which establishes a new state of the art Inception score on unsupervised CIFAR-10 (Table 3). When we add label information (using the method in [20]), the same architecture outperforms all other published models except for SGAN.

对于等效架构，我们的方法实现了与标准GAN目标相当的样本质量。然而，增加的稳定性使我们能够通过探索更广泛的架构来提高样品质量。为了证明这一点，我们找到了一种架构，它在无人监督的CIFAR-10上建立了一种新的最先进的入门分数（表3）。当我们添加标签信息时（使用[20]中的方法），相同的架构优于除SGAN之外的所有其他已发布模型。

Table 3: Inception scores on CIFAR-10. Our unsupervised model achieves state-of-the-art performance, and our conditional model outperforms all others except SGAN.

表3：CIFAR-10的初始分数。我们的无监督模型实现了最先进的性能，我们的条件模型优于除SGAN之外的所有其他模型。

Unsupervised Supervised believe these samples are at least competitive with the best reported so far on any resolution for this We also train a deep ResNet on

image

LSUN bedrooms and show samples in Figure 4. We dataset.

无监督的监督认为这些样本至少与迄今为止报道的最佳报告竞争对手。我们还在

image

LSUN卧室培训深度ResNet并在图4中显示样本。我们的数据集。

image

5.5 Modeling discrete data with a continuous generator

5.5使用连续发电机建模离散数据

To demonstrate our method’s ability to model degenerate distributions, we consider the problem of modeling a complex discrete distribution with a GAN whose generator is deﬁned over a continuous space. As an instance of this problem, we train a character-level GAN language model on the Google Billion Word dataset [6]. Our generator is a simple 1D CNN which deterministically transforms a latent vector into a sequence of 32 one-hot character vectors through 1D convolutions. We apply a softmax nonlinearity at the output, but use no sampling step: during training, the softmax output is to the best published results so far. Figure 4: Samples of

image

LSUN bedrooms. We believe these samples are at least comparable passed directly into the critic (which, likewise, is a simple 1D CNN). When decoding samples, we just take the argmax of each output vector.

为了证明我们的方法能够对退化分布进行建模，我们考虑使用GAN对复杂离散分布建模的问题，其中GAN的生成器是在连续空间上定义的。作为这个问题的一个例子，我们在Google Billion Word数据集上训练了一个字符级的GAN语言模型[6]。我们的生成器是一个简单的1D CNN，通过1D卷积确定性地将潜在向量转换为32个单热字符向量的序列。我们在输出端应用softmax非线性，但不使用采样步骤：在训练期间，softmax输出到目前为止发布的最佳结果。图4：

image

LSUN卧室的样品。我们相信这些样本至少可以直接传递给评论家（同样，这是一个简单的1D CNN）。解码样本时，我们只取每个输出向量的argmax。

image

We present samples from the model in Table 4. Our model makes frequent spelling errors (likely because it has to output each character independently) but nonetheless manages to learn quite a lot about the statistics of language. We were unable to produce comparable results with the standard GAN objective, though we do not claim that doing so is impossible.

我们在表4中提供了模型中的样本。我们的模型经常出现拼写错误（可能是因为它必须独立输出每个字符），但仍然能够学到很多关于语言统计的知识。我们无法与标准GAN目标产生可比较的结果，但我们并未声称这样做是不可能的。

Table 4: Samples from a WGAN-GP character-level language model trained on sentences from the Billion Word dataset, truncated to 32 characters. The model learns to directly output one-hot character embeddings from a latent vector without any discrete sampling step. We were unable to achieve comparable results with the standard GAN objective and a continuous generator.

表4：来自WGAN-GP字符级语言模型的样本，该模型使用Billion Word数据集中的句子进行训练，截断为32个字符。该模型学习直接从潜在向量输出单热字符嵌入而无需任何离散采样步骤。我们无法使用标准GAN物镜和连续发电机获得可比较的结果。

image The difference in performance between WGAN and other GANs can be explained as follows. Consider the simplex

image , and the set of vertices on the simplex (or one-hot vectors)

image . If we have a vocabulary of size n and we have a distribution

image over sequences of size T , we have that

image is a distribution on

image . Since

image is a subset of

image , we can also treat

image as a distribution on

image (by assigning zero probability mass to all points not in

image ). WGAN与其他GAN之间的性能差异可以解释如下。考虑单纯形

image ，以及单纯形（或单热矢量）

image 上的顶点集。如果我们有一个大小为n的词汇表，并且我们在大小为T的序列上有一个分布

image ，那么

image 就是

image 上的一个分布。由于

image 是

image 的子集，我们还可以将

image 视为

image 上的分布（通过为不在

image 中的所有点分配零概率质量）。

image is discrete (or supported on a ﬁnite number of elements, namely

image ) on

image , but

image can easily be a continuous distribution over

image . The KL divergences between two such distributions are inﬁnite, and so the JS divergence is saturated. Although GANs do not literally minimize these divergences [16], in practice this means a discriminator might quickly learn to reject all samples that don’t lie on

image (sequences of one-hot vectors) and give meaningless gradients to the generator. However, it is easily seen that the conditions of Theorem 1 and Corollary 1 of [2] are satisﬁed even on this non-standard learning scenario with

image . This means that

image is still well deﬁned, continuous everywhere and differentiable almost everywhere, and we can optimize it just like in any other continuous variable setting. The way this manifests is that in WGANs, the Lipschitz constraint forces the critic to provide a linear gradient from all

image towards towards the real points in

image .

image 在

image 上是离散的（或在有限数量的元素上支持，即

image ），但

image 可以很容易地在

image 上连续分布。两个这样的分布之间的KL差异是有限的，因此JS分歧是饱和的。尽管GAN并没有从字面上最小化这些差异[16]，但实际上这意味着鉴别器可能会很快学会拒绝所有不在

image （单热矢量序列）上的样本，并为发生器提供无意义的梯度。然而，很容易看出，即使在

image 的非标准学习场景中，[2]的定理1和推论1的条件也令人满意。这意味着

image 仍然很好地定义，无处不在，几乎无处不在，我们可以像在任何其他连续变量设置中一样对其进行优化。这表明在WGAN中，Lipschitz约束迫使评论家提供从所有

image 到

image 中的实际点的线性渐变。

image

Figure 5: (a) The negative critic loss of our model on LSUN bedrooms converges toward a minimum as the network trains. (b) WGAN training and validation losses on a random 1000-digit subset of MNIST show overﬁtting when using either our method (left) or weight clipping (right). In particular, with our method, the critic overﬁts faster than the generator, causing the training loss to increase gradually over time even as the validation loss drops.

图5：（a）我们的LSUN卧室模型的负面批评损失在网络训练时趋于最小。（b）当使用我们的方法（左）或权重削减（右）时，随机的1000位MNIST子集上的WGAN训练和验证损失显示过度拟合。特别是，使用我们的方法，批评者比发电机更快，导致培训损失随着时间的推移逐渐增加，即使验证损失下降。

Other attempts at language modeling with GANs [32, 14, 30, 5, 15, 10] typically use discrete models and gradient estimators [28, 12, 17]. Our approach is simpler to implement, though whether it scales beyond a toy language model is unclear.

使用GAN [32,14,30,5,15,10]进行语言建模的其他尝试通常使用离散模型和梯度估计[28,12,17]。我们的方法实现起来比较简单，但是它是否超出了玩具语言模型还不清楚。

5.6 Meaningful loss curves and detecting overﬁtting

5.6有意义的损耗曲线和检测过度拟合

An important beneﬁt of weight-clipped WGANs is that their loss correlates with sample quality and converges toward a minimum. To show that our method preserves this property, we train a WGAN-GP on the LSUN bedrooms dataset [31] and plot the negative of the critic’s loss in Figure 5a. We see that the loss converges as the generator minimizes

image

重量限制WGAN的一个重要好处是它们的损失与样品质量相关，并且收敛到最小。为了表明我们的方法保留了这个属性，我们在LSUN卧室数据集上训练了一个WGAN-GP [31]，并绘制了图5a中评论家损失的负面影响。我们看到损失在发生器最小化

image

时收敛。

Given enough capacity and too little training data, GANs will overﬁt. To explore the loss curve’s behavior when the network overﬁts, we train large unregularized WGANs on a random 1000-image subset of MNIST and plot the negative critic loss on both the training and validation sets in Figure 5b. In both WGAN and WGAN-GP, the two losses diverge, suggesting that the critic overﬁts and provides an inaccurate estimate of

image

, at which point all bets are off regarding correlation with sample quality. However in WGAN-GP, the training loss gradually increases even while the validation loss drops.

如果有足够的容量和太少的训练数据，GAN将会过度。为了探索网络过度时的损失曲线的行为，我们在MNIST的随机1000图像子集上训练大的非正规化WGAN，并在图5b中的训练和验证集上绘制负面评论者损失。在WGAN和WGAN-GP中，这两种损失有所不同，这表明对于过滤器的批评并提供了对

image

的不准确估计，此时所有的投注均与样本质量相关。然而，在WGAN-GP中，即使验证损失下降，训练损失也逐渐增加。

[29] also measure overﬁtting in GANs by estimating the generator’s log-likelihood. Compared to that work, our method detects overﬁtting in the critic (rather than the generator) and measures overﬁtting against the same loss that the network minimizes.

[29]还通过估计发电机的对数似然来测量GAN中的过量配置。与该工作相比，我们的方法检测批评者（而不是发电机）中的过度配置，并针对网络最小化的相同损失进行测量。

6 Conclusion

六，结论

In this work, we demonstrated problems with weight clipping in WGAN and introduced an alternative in the form of a penalty term in the critic loss which does not exhibit the same problems. Using our method, we demonstrated strong modeling performance and stability across a variety of architectures. Now that we have a more stable algorithm for training GANs, we hope our work opens the path for stronger modeling performance on large-scale image datasets and language. Another interesting direction is adapting our penalty term to the standard GAN objective function, where it might stabilize training by encouraging the discriminator to learn smoother decision boundaries.

在这项工作中，我们展示了WGAN中减重的问题，并在批评者损失中以惩罚性术语的形式引入了替代方案，其没有表现出相同的问题。使用我们的方法，我们展示了各种架构的强大建模性能和稳定性。现在我们有了一个更稳定的GAN训练算法，我们希望我们的工作为大规模图像数据集和语言打开更强大的建模性能之路。另一个有趣的方向是使我们的惩罚项适应标准的GAN目标函数，它可以通过鼓励鉴别器学习更平滑的决策边界来稳定训练。

Acknowledgements

致谢

We would like to thank Mohamed Ishmael Belghazi, L´eon Bottou, Zihang Dai, Stefan Doerr, Ian Goodfellow, Kyle Kastner, Kundan Kumar, Luke Metz, Alec Radford, Colin Raffel, Sai Rajeshwar, Aditya Ramesh, Tom Sercu, Zain Shah and Jake Zhao for insightful comments.

我们要感谢Mohamed Ishmael Belghazi，L'Thon Bottou，Zihang Dai，Stefan Doerr，Ian Goodfellow，Kyle Kastner，Kundan Kumar，Luke Metz，Alec Radford，Colin Raffel，Sai Rajeshwar，Aditya Ramesh，Tom Sercu，Zain Shah和杰克赵的见解很有见地。

文章引用于 http://tongtianta.site/paper/3418
编辑 Lornatang
校准 Lornatang

Improved Training of Wasserstein

猜你喜欢

热点阅读