Improved Training of Wasserstein
Improved Training of Wasserstein GANs翻译 上
4 Gradient penalty
We now propose an alternative way to enforce the Lipschitz constraint. A differentiable function is 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere, so we consider directly constraining the gradient norm of the critic’s output with respect to its input. To circumvent tractability issues, we enforce a soft version of the constraint with a penalty on the gradient norm for random samples![](
. Our new objective is
, which we found to work well across a variety of architectures and datasets ranging from toy tasks to large ImageNet CNNs.
,我们发现它可以很好地适用于从玩具任务到大型ImageNet CNN的各种架构和数据集。
No critic batch normalization Most prior GAN implementations [22, 23, 2] use batch normalization in both the generator and the discriminator to help stabilize training, but batch normalization changes the form of the discriminator’s problem from mapping a single input to a single output to mapping from an entire batch of inputs to a batch of outputs [23]. Our penalized training objective is no longer valid in this setting, since we penalize the norm of the critic’s gradient with respect to each input independently, and not the entire batch. To resolve this, we simply omit batch normalization in the critic in our models, finding that they perform well without it. Our method works with normalization schemes which don’t introduce correlations between examples. In particular, we recommend layer normalization [3] as a drop-in replacement for batch normalization.
Two-sided penalty We encourage the norm of the gradient to go towards 1 (two-sided penalty) instead of just staying below 1 (one-sided penalty). Empirically this seems not to constrain the critic too much, likely because the optimal WGAN critic anyway has gradients with norm 1 almost everywhere under![](
5 Experiments
5.1 Training random architectures within a set
We experimentally demonstrate our model’s ability to train a large number of architectures which we think are useful to be able to train. Starting from the DCGAN architecture, we define a set of architecture variants by changing model settings to random corresponding values in Table 1. We believe that reliable training of many of the architectures in this set is a useful goal, but we do not claim that our set is an unbiased or representative sample of the whole space of useful architectures: it is designed to demonstrate a successful regime of our method, and readers should evaluate whether it contains architectures similar to their intended application.
Table 1: We evaluate WGAN-GP’s ability to train the architectures in this set.
Table 2: Outcomes of training 200 random architectures, for different success thresholds. For comparison, our standard DCGAN scored 7.24.
101-layer ResNet G and D 5.2 Training varied architectures on LSUN bedrooms To demonstrate our model’s ability to train many architectures with its default settings, we train six different GAN architectures on the LSUN bedrooms dataset [31]. In addition to the baseline DCGAN architecture from [22], we choose six architectures whose successful training we demonstrate: (1) no BN and a constant number of filters in the generator, as in [2], (2) 4-layer 512-dim ReLU MLP generator, as in [2], (3) no normalization in either the discriminator or generator (4) gated multiplicative nonlinearities, as in [24], (5) tanh nonlinearities, and (6) 101-layer ResNet generator and discriminator.
基于LSUN卧室的101层ResNet G和D 5.2培训各种架构为了展示我们的模型能够以默认设置训练许多架构,我们在LSUN卧室数据集上训练了六种不同的GAN架构[31]。除了[22]的基线DCGAN架构外,我们选择了六种架构,我们展示了它们的成功训练:(1)发生器中没有BN和恒定数量的滤波器,如[2],(2)4层512 -dim ReLU MLP发生器,如[2]中所述,(3)在鉴别器或发生器中没有归一化(4)门控乘法非线性,如[24],(5)tanh非线性,和(6)101层ResNet发生器和鉴别器。
Figure 2: Different GAN architectures trained with different methods. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP.
Although we do not claim it is impossible without our method, to the best of our knowledge this is the first time very deep residual networks were successfully trained in a GAN setting. For each architecture, we train models using four different GAN methods: WGAN-GP, WGAN with weight clipping, DCGAN [22], and Least-Squares GAN [18]. For each objective, we used the default set of optimizer hyperparameters recommended in that work (except LSGAN, where we searched over learning rates).
虽然我们没有声称没有我们的方法是不可能的,但据我们所知,这是第一次在GAN设置中成功训练非常深的残留网络。对于每种架构,我们使用四种不同的GAN方法训练模型:WGAN-GP,带权重限幅的WGAN,DCGAN [22]和最小二乘GAN [18]。对于每个目标,我们使用了该工作中推荐的默认优化器超参数集(除了LSGAN,我们搜索了学习率)。
For WGAN-GP, we replace any batch normalization in the discriminator with layer normalization (see section 4). We train each model for 200K iterations and present samples in Figure 2. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP. For every other training method, some of these architectures were unstable or suffered from mode collapse.
5.3 Improved performance over weight clipping
One advantage of our method over weight clipping is improved training speed and sample quality. To demonstrate this, we train WGANs with weight clipping and our gradient penalty on CIFAR10 [13] and plot Inception scores [23] over the course of training in Figure 3. For WGAN-GP, we train one model with the same optimizer (RMSProp) and learning rate as WGAN with weight clipping, and another model with Adam and a higher learning rate. Even with the same optimizer, our method converges faster and to a better score than weight clipping. Using Adam further improves performance. We also plot the performance of DCGAN [22] and find that our method converges more slowly (in wall-clock time) than DCGAN, but its score is more stable at convergence.
我们的方法优于减重的一个优点是提高了训练速度和样本质量。为了证明这一点,我们在图3中的训练过程中训练WGAN进行了体重削减和CIFAR10 [13]的梯度惩罚以及初始得分[23]。对于WGAN-GP,我们训练一个模型使用相同的优化器(RMSProp)和学习率作为WGAN进行权重削减,另一个模型使用Adam和更高的学习率。即使使用相同的优化器,我们的方法收敛速度更快,并且比重量限幅更好。使用Adam进一步提高了性能。我们还绘制了DCGAN [22]的性能,并发现我们的方法比DCGAN收敛得更慢(在挂钟时间内),但其收敛在收敛时更稳定。
Figure 3: CIFAR-10 Inception score over generator iterations (left) or wall-clock time (right) for four models: WGAN with weight clipping, WGAN-GP with RMSProp and Adam (to control for the optimizer), and DCGAN. WGAN-GP significantly outperforms weight clipping and performs comparably to DCGAN.
5.4 Sample quality on CIFAR-10 and LSUN bedrooms
5.4 CIFAR-10和LSUN卧室的样品质量
For equivalent architectures, our method achieves comparable sample quality to the standard GAN objective. However the increased stability allows us to improve sample quality by exploring a wider range of architectures. To demonstrate this, we find an architecture which establishes a new state of the art Inception score on unsupervised CIFAR-10 (Table 3). When we add label information (using the method in [20]), the same architecture outperforms all other published models except for SGAN.
Table 3: Inception scores on CIFAR-10. Our unsupervised model achieves state-of-the-art performance, and our conditional model outperforms all others except SGAN.
Unsupervised Supervised believe these samples are at least competitive with the best reported so far on any resolution for this We also train a deep ResNet on![](
LSUN bedrooms and show samples in Figure 4. We dataset.
5.5 Modeling discrete data with a continuous generator
To demonstrate our method’s ability to model degenerate distributions, we consider the problem of modeling a complex discrete distribution with a GAN whose generator is defined over a continuous space. As an instance of this problem, we train a character-level GAN language model on the Google Billion Word dataset [6]. Our generator is a simple 1D CNN which deterministically transforms a latent vector into a sequence of 32 one-hot character vectors through 1D convolutions. We apply a softmax nonlinearity at the output, but use no sampling step: during training, the softmax output is to the best published results so far. Figure 4: Samples of![](
LSUN bedrooms. We believe these samples are at least comparable passed directly into the critic (which, likewise, is a simple 1D CNN). When decoding samples, we just take the argmax of each output vector.
为了证明我们的方法能够对退化分布进行建模,我们考虑使用GAN对复杂离散分布建模的问题,其中GAN的生成器是在连续空间上定义的。作为这个问题的一个例子,我们在Google Billion Word数据集上训练了一个字符级的GAN语言模型[6]。我们的生成器是一个简单的1D CNN,通过1D卷积确定性地将潜在向量转换为32个单热字符向量的序列。我们在输出端应用softmax非线性,但不使用采样步骤:在训练期间,softmax输出到目前为止发布的最佳结果。图4:![](
LSUN卧室的样品。我们相信这些样本至少可以直接传递给评论家(同样,这是一个简单的1D CNN)。解码样本时,我们只取每个输出向量的argmax。
We present samples from the model in Table 4. Our model makes frequent spelling errors (likely because it has to output each character independently) but nonetheless manages to learn quite a lot about the statistics of language. We were unable to produce comparable results with the standard GAN objective, though we do not claim that doing so is impossible.
Table 4: Samples from a WGAN-GP character-level language model trained on sentences from the Billion Word dataset, truncated to 32 characters. The model learns to directly output one-hot character embeddings from a latent vector without any discrete sampling step. We were unable to achieve comparable results with the standard GAN objective and a continuous generator.
表4:来自WGAN-GP字符级语言模型的样本,该模型使用Billion Word数据集中的句子进行训练,截断为32个字符。该模型学习直接从潜在向量输出单热字符嵌入而无需任何离散采样步骤。我们无法使用标准GAN物镜和连续发电机获得可比较的结果。
Figure 5: (a) The negative critic loss of our model on LSUN bedrooms converges toward a minimum as the network trains. (b) WGAN training and validation losses on a random 1000-digit subset of MNIST show overfitting when using either our method (left) or weight clipping (right). In particular, with our method, the critic overfits faster than the generator, causing the training loss to increase gradually over time even as the validation loss drops.
图5:(a)我们的LSUN卧室模型的负面批评损失在网络训练时趋于最小。 (b)当使用我们的方法(左)或权重削减(右)时,随机的1000位MNIST子集上的WGAN训练和验证损失显示过度拟合。特别是,使用我们的方法,批评者比发电机更快,导致培训损失随着时间的推移逐渐增加,即使验证损失下降。
Other attempts at language modeling with GANs [32, 14, 30, 5, 15, 10] typically use discrete models and gradient estimators [28, 12, 17]. Our approach is simpler to implement, though whether it scales beyond a toy language model is unclear.
使用GAN [32,14,30,5,15,10]进行语言建模的其他尝试通常使用离散模型和梯度估计[28,12,17]。我们的方法实现起来比较简单,但是它是否超出了玩具语言模型还不清楚。
5.6 Meaningful loss curves and detecting overfitting
An important benefit of weight-clipped WGANs is that their loss correlates with sample quality and converges toward a minimum. To show that our method preserves this property, we train a WGAN-GP on the LSUN bedrooms dataset [31] and plot the negative of the critic’s loss in Figure 5a. We see that the loss converges as the generator minimizes![](
重量限制WGAN的一个重要好处是它们的损失与样品质量相关,并且收敛到最小。为了表明我们的方法保留了这个属性,我们在LSUN卧室数据集上训练了一个WGAN-GP [31],并绘制了图5a中评论家损失的负面影响。我们看到损失在发生器最小化![](
Given enough capacity and too little training data, GANs will overfit. To explore the loss curve’s behavior when the network overfits, we train large unregularized WGANs on a random 1000-image subset of MNIST and plot the negative critic loss on both the training and validation sets in Figure 5b. In both WGAN and WGAN-GP, the two losses diverge, suggesting that the critic overfits and provides an inaccurate estimate of![](
, at which point all bets are off regarding correlation with sample quality. However in WGAN-GP, the training loss gradually increases even while the validation loss drops.
[29] also measure overfitting in GANs by estimating the generator’s log-likelihood. Compared to that work, our method detects overfitting in the critic (rather than the generator) and measures overfitting against the same loss that the network minimizes.
6 Conclusion
In this work, we demonstrated problems with weight clipping in WGAN and introduced an alternative in the form of a penalty term in the critic loss which does not exhibit the same problems. Using our method, we demonstrated strong modeling performance and stability across a variety of architectures. Now that we have a more stable algorithm for training GANs, we hope our work opens the path for stronger modeling performance on large-scale image datasets and language. Another interesting direction is adapting our penalty term to the standard GAN objective function, where it might stabilize training by encouraging the discriminator to learn smoother decision boundaries.
We would like to thank Mohamed Ishmael Belghazi, L´eon Bottou, Zihang Dai, Stefan Doerr, Ian Goodfellow, Kyle Kastner, Kundan Kumar, Luke Metz, Alec Radford, Colin Raffel, Sai Rajeshwar, Aditya Ramesh, Tom Sercu, Zain Shah and Jake Zhao for insightful comments.
我们要感谢Mohamed Ishmael Belghazi,L'Thon Bottou,Zihang Dai,Stefan Doerr,Ian Goodfellow,Kyle Kastner,Kundan Kumar,Luke Metz,Alec Radford,Colin Raffel,Sai Rajeshwar,Aditya Ramesh,Tom Sercu,Zain Shah和杰克赵的见解很有见地。
编辑 Lornatang
校准 Lornatang