CVPR paper translation

Are GANs Created Equal? A Large-

2019-04-10  本文已影响0人  Lornatang

Are GANs Created Equal? A Large-Scale Study翻译 上

5. Metrics

5.指标

In this work we focus on two sets of metrics. We first analyze the recently proposed FID in terms of robustness (of the metric itself), and conclude that it has desirable properties and can be used in practice. Nevertheless, this metric, as well as Inception Score, is incapable of detecting overfitting: a memory GAN which simply stores all training samples would score perfectly under both measures. Based on these shortcomings, we propose an approximation to precision and recall for GANs and how that it can be used to quantify the degree of overfitting. We stress that the proposed method should be viewed as complementary to IS or FID, rather than a replacement.

在这项工作中,我们关注两组指标。我们首先根据稳健性(度量本身)来分析最近提出的FID,并得出结论它具有理想的属性并且可以在实践中使用。尽管如此,该指标以及初始分数无法检测到过度拟合:仅存储所有训练样本的记忆GAN在两种测量下都能得到完美的分数。基于这些缺点,我们提出了GAN的精确度和召回率的近似值,以及它如何用于量化过度拟合的程度。我们强调所提议的方法应被视为IS或FID的补充,而不是替代。

5.1. Fr´echet Inception Distance

5.1。 Fr'echet初始距离

FID was shown to be robust to noise [10]. Here we quantify the bias and variance of FID, its sensitivity to the encoding network and sensitivity to mode dropping. To this end, we partition the data set into two groups, i.e. image . Then, we define the data distribution image as the empirical distribution on a random subsample of image and the model distribution image to be the empirical distribution on a random subsample from image . For a random partition this “model distribution” should follow the data distribution. FID显示出对噪声的鲁棒性[10]。在这里,我们量化FID的偏差和方差,它对编码网络的敏感性和对模式下降的敏感性。为此,我们将数据集分为两组,即 image 。然后,我们将数据分布 image 定义为 image 的随机子样本和模型分布 image 的经验分布,作为来自 image 的随机子样本的经验分布。对于随机分区,此“模型分布”应遵循数据分布。 Bias and variance. We evaluate the bias and variance of FID on four classic data sets used in the GAN literature. We start by using the default train vs. test partition and compute the FID between the test set (limited to image samples for CelebA) and the sample of size N from the train set. The sampling from the train set is performed image times. The optimistic estimate of FID are reported in Table 2. We observe that FID has rather high bias, but small variance. From this perspective, estimating the full covariance matrix might be unnecessary and counter-productive, and a constrained version might suffice. 偏见和差异。我们在GAN文献中使用的四个经典数据集上评估FID的偏差和方差。我们首先使用默认列车与测试分区,并计算测试集(限于CelebA的 image 样本)和列车集中大小为N的样本之间的FID。火车组的采样执行 image 次。表2列出了对FID的乐观估计。我们观察到FID具有相当高的偏差,但方差很小。从这个角度来看,估计完整的协方差矩阵可能是不必要的并且适得其反,而受约束的版本可能会受到影响。

2Furthermore, while we present the results which were obtained by a random search, we have also investigated sequential Bayesian optimization, which resulted in comparable results.

2此外,虽然我们提出了通过随机搜索获得的结果,但我们还研究了顺序贝叶斯优化,这导致了可比较的结果。

Table 2: Bias and variance of FID. If the data distribution matches the model distribution, FID should evaluate to zero. However, we observe some bias and low variance on samples of size 10000.

表2:FID的偏差和方差。如果数据分布与模型分布匹配,则FID应评估为零。然而,我们观察到大小为10000的样本存在一些偏差和低方差。

image To test the sensitivity to this initial choice of train vs. test partitioning, we consider 50 random partitions (keeping the relative sizes fixed, i.e. 6 : 1 for MNIST) and compute the FID with image

sample. We observe results similar to Table 2 which is to be expected if the train and test data sets are drawn from the same distribution.

为了测试对列车与测试分区的初始选择的敏感性,我们考虑50个随机分区(保持相对大小固定,即MNIST为6:1)并使用 image

样本计算FID。我们观察到类似于表2的结果,如果列车和测试数据集来自相同的分布,则可以预期。

Detecting mode dropping with FID. To simulate missing modes, we fix a partition of data set image and we subsample image and keep only samples from the first k classes, increasing k from 1 to 10. For each k, we consider 50 random subsamples from image . Figure 1 shows that FID is heavily influenced by the missing modes. 使用FID检测模式下降。为了模拟丢失模式,我们对数据集 image 进行分区,并对子样本 image 进行子样本处理,并保留仅来自第一个k类的样本,将k从1增加到10。对于每个k,我们考虑来自 image 的50个随机子样本。图1显示FID受到缺失模式的严重影响。 image

Figure 1: As the sample captures more classes the FID with respect to the reference data set decreases. We observe that FID drastically increases under mode dropping.

图1:当样本捕获更多类时,FID相对于参考数据集减少。我们观察到FID在模式下降时急剧增加。

Sensitivity to encoding network. Suppose we compute FID using a different network and encoding layer. Would the ranking of models change? To test this we apply VGG trained on ImageNet and consider the layer FC7 of dimension 4096. Figure 2 shows the resulting distribution. We observe high Spearman’s rank correlation image

which encourages the use of the default coding layer suggested by the authors. Of course, a natural comparison would be to apply VGG trained on some other data set, which we leave for future work.

对编码网络的敏感性。假设我们使用不同的网络和编码层计算FID。模型的排名会改变吗?为了测试这个,我们应用在ImageNet上训练的VGG并考虑尺寸为4096的层FC7。图2显示了得到的分布。我们观察到高Spearman等级相关性 image

,它鼓励使用作者建议的默认编码层。当然,自然的比较是将VGG应用于其他一些数据集,我们留待将来的工作。

image Figure 2: The difference between FID score computed on InceptionNet vs FID computed using VGG for the CELEBA data set (for interesting range: FID < 200). We observe high rank correlation (Spearman’s image

) which encourages the use of the default coding layer suggested by the authors.

图2:在InceptionNet上计算的FID得分与使用VGG为CELEBA数据集计算的FID之间的差异(对于感兴趣的范围:FID <200)。我们观察到高等级相关性(Spearman的 image

),它鼓励使用作者建议的默认编码层。

5.2. Precision, Recall and F1 Score

5.2。精确度,召回率和F1分数

Precision, recall and image score are proven and widely adopted techniques for quantitatively evaluating the quality of discriminative models. Precision measures the fraction of relevant retrieved instances among the retrieved instances, while recall measures the fraction of the retrieved instances among relevant instances. image score is the harmonic average of precision and recall. 精确度,召回率和 image 评分是经过验证的广泛采用的技术,用于定量评估判别模型的质量。精确度测量检索到的实例中相关检索实例的分数,而召回则测量相关实例中检索到的实例的分数。 image 得分是精度和召回的谐波平均值。

Notice that IS only captures precision: It will not penalize the model for not producing all modes of the data distribution — it will only penalize the model for not producing all classes. On the other hand, FID captures both precision and recall. Indeed, a model which fails to recover different modes of the data distribution will suffer in terms of FID.

请注意,IS仅捕获精度:它不会因为不生成数据分布的所有模式而惩罚模型 - 它只会惩罚模型而不生成所有类。另一方面,FID捕获精确度和召回率。实际上,无法恢复数据分布的不同模式的模型将在FID方面受到影响。

We propose a simple and effective data set for evaluating (and comparing) generative models. Our main motivation is that the currently used data sets are either too simple (e.g. simple mixtures of Gaussians, or MNIST) or too complex (e.g. ImageNet). We argue that it is critical to be able to increase the complexity of the task in a relatively smooth and controlled fashion. To this end, we present a set of tasks for which we can approximate the precision and recall of each model. As a result, we can compare different models based on established metrics.

我们提出了一种简单有效的数据集,用于评估(和比较)生成模型。我们的主要动机是当前使用的数据集太简单(例如高斯的简单混合或MNIST)或太复杂(例如ImageNet)。我们认为能够以相对平稳和受控的方式增加任务的复杂性至关重要。为此,我们提出了一组任务,我们可以近似精确度和每个模型的召回。因此,我们可以根据既定指标比较不同的模型。

Manifold of convex polygons. The main idea is to construct a data manifold such that the distances from samples to the manifold can be computed efficiently. As a result, the problem of evaluating the quality of the generative model is effectively transformed into a problem of computing the distance to the manifold. This enables an intuitive approach for defining the quality of the model. Namely, if the samples from the model distribution image

are (on average) close to the manifold, its precision is high. Similarly, high recall implies that the generator can recover (i.e. generate something close to) any sample from the manifold.

凸多边形的流形。主要思想是构造数据流形,使得可以有效地计算从样本到流形的距离。结果,评估生成模型的质量的问题被有效地转换成计算到歧管的距离的问题。这使得能够以直观的方式来确定模型的质量。即,如果来自模型分布 image

的样本(平均)接近歧管,则其精度高。类似地,高召回率意味着发生器可以从歧管中恢复(即产生接近)任何样品。

image

Figure 3: Samples from models with (a) high recall and precision, (b) high precision, but low recall (lacking in diversity), (c) low precision, but high recall (can decently reproduce triangles, but fails to capture convexity), and (d) low precision and low recall.

图3:来自模型的样本具有(a)高召回率和高精度,(b)高精度,但低召回率(缺乏多样性),(c)低精度,但高召回率(可以适当地再现三角形,但无法捕获凸度),和(d)精度低,召回率低。

For general data sets, this reduction is impractical as one has to compute the distance to the manifold which we are trying to learn. However, if we construct a manifold such that this distance is efficiently computable, the precision and recall can be efficiently evaluated.

对于一般数据集,这种减少是不切实际的,因为必须计算到我们试图学习的流形的距离。然而,如果我们构造一个歧管使得该距离可以有效地计算,则可以有效地评估精度和召回率。

To this end, we propose a set of toy data sets for which such computation can be performed efficiently: The manifold of convex polygons. As the simplest example, let us focus on gray-scale triangles represented as one channel images as in Figure 3. These triangles belong to a lowdimensional manifold image embedded in image . Intuitively, the coordinate system of this manifold represents the axes of variation (e.g. rotation, translation, minimum angle size, etc.). A good generative model should be able to capture these factors of variation and recover the training samples. Furthermore, it should recover any sample from this manifold from which we can efficiently sample which is illustrated in Figure 3. 为此,我们提出了一套玩具数据集,可以有效地执行这样的计算:凸多边形的多样性。作为最简单的例子,让我们专注于表示为一个通道图像的灰度三角形,如图3所示。这些三角形属于 image 中嵌入的低维流形 image 。直观地,该歧管的坐标系表示变化轴(例如,旋转,平移,最小角度尺寸等)。一个好的生成模型应该能够捕获这些变异因素并恢复训练样本。此外,它应该从这个歧管中回收任何样品,我们可以从中有效地对样品进行采样,如图3所示。 Computing the distance to the manifold. Let us consider the simplest case: single-channel gray scale images represented as vectors image . The distance of a sample image to the manifold is defined as the squared Euclidean distance to the closest sample from the manifold image , i.e. 计算到歧管的距离。让我们考虑最简单的情况:单通道灰度图像表示为矢量 image 。 image 样本到歧管的距离定义为距离流形 image 最近的样本的欧几里德距离的平方,即 image

Figure 4: How does the minimum FID behave as a function of the budget? The plot shows the distribution of the minimum FID achievable for a fixed budget along with one standard deviation interval. For each budget, we estimate the mean and variance using 5000 bootstrap resamples out of 100 runs. We observe that, given a relatively low budget (say less than 15 hyperparameter settings), all models achieve a similar minimum FID. Furthermore, for a fixed FID, “bad” models can outperform “good” models given enough computational budget. We argue that the computational budget to search over hyperparameters is an important aspect of the comparison between algorithms.

图4:最小FID如何作为预算的函数?该图显示了固定预算可实现的最小FID的分布以及一个标准偏差间隔。对于每个预算,我们使用100次运行中的5000次自举重采样来估计均值和方差。我们观察到,由于预算相对较低(比如少于15个超参数设置),所有模型都达到了类似的最小FID。此外,对于固定的FID,“坏”模型在给定足够的计算预算的情况下可以胜过“好”模型。我们认为,搜索超参数的计算预算是算法之间比较的一个重要方面。

image This is a non-convex optimization problem. We find an approximate solution by gradient descent on the vertices of the triangle (more generally, a convex polygon), ensuring that each iterate is a valid triangle (more generally, a convex polygon). To reduce the false-negative rate we repeat the algorithm several times from random initial solutions. To compute the latent representation of a sample image image we invert the generator, i.e. we solve 这是一个非凸优化问题。我们通过梯度下降在三角形的顶点(更一般地,凸多边形)上找到近似解,确保每个迭代是有效三角形(更一般地,凸多边形)。为了降低假阴性率,我们从随机初始解决方案中多次重复该算法。为了计算样本 image image 的潜在表示,我们反转生成器,即我们解决 image

using gradient descent on z while keeping G fixed [15].

在z上使用梯度下降同时保持G fi xed [15]。

6. Large-scale Experimental Evaluation

6.大规模实验评估

We consider two budget-constrained experimental setups whereby in the (i) wide one-shot setup one may select 100 samples of hyper-parameters per model, and where the range for each hyperparameter is wide, and (ii) the narrow two-shots setup where one is allowed to select 50 samples from more narrow ranges which were manually selected by first performing the wide hyperparameter search over a specific data set. For the exact ranges and hyperparameter search details we refer the reader to the Appendix A. In the second set of experiments we evaluate the models based on the ”novel” metric: image

score on the proposed data set. Finally, we included the Variational Autoencoder [13] in the experiments as a popular alternative.

我们考虑两个预算受限的实验设置,其中在(i)宽单次设置中,每个模型可以选择100个超参数样本,并且每个超参数的范围很宽,以及(ii)窄的两个镜头设置允许从较窄范围中选择50个样本,这些样本通过首先在特定数据集上执行宽超参数搜索来手动选择。有关确切范围和超参数搜索详细信息,请参阅附录A.在第二组实验中,我们基于“新颖”度量评估模型: image

对建议数据集的评分。最后,我们在实验中将变分自动编码器[13]作为一种流行的替代方案。

6.1. Experimental Setup

6.1。实验装置

To ensure a fair comparison, we made the following choices: (i) we use the generator and discriminator architecture from INFO GAN [5] as the resulting function space is rich enough and all considered GANs were not originally designed for this architecture. Furthermore, it is similar to a proven architecture used in DCGAN [20]. The exception is BEGAN where an autoencoder is used as the discriminator. We maintain similar expressive power to INFO GAN by using identical convolutional layers the encoder and approximately matching the total number of parameters.

为了确保公平比较,我们做出了以下选择:(i)我们使用INFO GAN [5]中的生成器和鉴别器体系结构,因为得到的函数空间足够丰富,并且所有考虑的GAN最初都不是为此体系结构设计的。此外,它类似于DCGAN [20]中使用的成熟架构。例外是BEGAN,其中自动编码器用作鉴别器。我们通过使用相同的卷积层编码器并大致匹配参数总数来保持与INFO GAN类似的表达能力。

For all experiments we fix the latent code size to 64 and the prior distribution over the latent space to be uniform on image , except for VAE where it is Gaussian image I). We choose Adam [12] as the optimization algorithm as it was the most popular choice in the GAN literature 3. We apply the same learning rate for both generator and discriminator. We set the batch size to 64 and perform optimization for 20 epochs on MNIST and FASHION MNIST, 40 on CELEBA and 100 on CIFAR4. 对于所有实验,我们将潜在代码大小设置为64,并且在潜在空间上的先验分布在 image 上是均匀的,除了VAE,其中它是高斯 image I)。我们选择Adam [12]作为优化算法,因为它是GAN文献中最受欢迎的选择3。我们对发生器和鉴别器应用相同的学习率。我们将批量大小设置为64,并在MNIST和FASHION MNIST上执行20个时期的优化,在CELEBA上执行40个时期,在CIFAR4上执行100个时期。 Finally, we allow for recent suggestions, such as batch normalization in the discriminator, and imbalanced update frequencies of generator and discriminator. We explore these possibilities, together with learning rate, parameter image

for ADAM, and hyperparameters of each model. We report the hyperparameter ranges and other details in Appendix A.

最后,我们允许最近的建议,例如鉴别器中的批量归一化,以及发生器和鉴别器的不平衡更新频率。我们将探讨这些可能性,以及学习速率,ADAM参数 image

以及每个模型的超参数。我们在附录A中报告了超参数范围和其他详细信息。

6.2. A Large Hyperparameter Search

6.2。大型超参数搜索

We perform hyperparameter optimization and, for each run, look for the best FID across the training run (simulating early stopping). To choose the best model, every 5 epochs we compute the FID between the 10k samples generated by the model and the 10k samples from the test set. We have performed this computationally expensive search for each data set. We present the sensitivity of models to the hyperparameters in Figure 5 and the best FID achieved by each model in Table 3.

我们执行超参数优化,并且对于每次运行,在整个训练运行中寻找最佳FID(模拟早期停止)。为了选择最佳模型,我们每5个时期计算模型生成的10k样本与测试集中的10k样本之间的FID。我们对每个数据集执行了这种计算成本高的搜索。我们将模型的灵敏度呈现给图5中的超参数,以及表3中每种模型实现的最佳FID。

3An empirical comparison to RMSProp is provided in Appendix F 4Those four data sets are a popular choice for generative modeling. They are of simple to medium complexity, making it possible to run many experiments as well as getting decent results.

3附录F中提供了与RMSProp的经验比较。这四个数据集是生成建模的流行选择。它们具有简单到中等的复杂性,可以运行许多实验并获得不错的结果。

image

Figure 5: A wide range hyperparameter search (100 hyperparameter samples per model). Black stars indicate the performance of suggested hyperparameter settings. We observe that GAN training is extremely sensitive to hyperparameter settings and there is no model which is significantly more stable than others. The importance of hyperparameter search is further highlighted in Figure 15.

图5:广泛的超参数搜索(每个模型100个超参数样本)。黑色星标表示建议的超参数设置的性能。我们观察到GAN训练对超参数设置极其敏感,并且没有比其他模型显着更稳定的模型。超参数搜索的重要性在图15中进一步突出显示。

Table 3: Best FID obtained in a large-scale hyperparameter search for each data set. The scores were computed in two phases: first, we run a large-scale search on a wide range of hyperparameters, and select the best model. Then, we re-run the training of the selected model 50 times with different initialization seeds, to estimate the stability of the training and report the mean FID and standard deviation, excluding outliers. The asterisk () on some combinations of models and data sets indicates the presence of significant outlier runs, usually severe mode collapses or training failures (* indicates up to 20% failures). We observe that the performance of each model heavily depends on the data set and no model strictly dominates the others. We note that VAE is heavily penalized due to the blurriness of the generated images. Note that these results are not “state-of-the-art”: (i) larger architectures could improve all models, (ii) authors often report the best FID which opens the door for random seed optimization.

表3:在每个数据集的大规模超参数搜索中获得的最佳FID。分数分两个阶段计算:首先,我们对各种超参数进行大规模搜索,并选择最佳模型。然后,我们使用不同的初始化种子重新运行所选模型的训练50次,以估计训练的稳定性并报告平均FID和标准偏差,排除异常值。某些模型和数据集组合上的星号()表示存在重要的异常值运行,通常是严重模式崩溃或训练失败(*表示失败率高达20%)。我们观察到每个模型的性能在很大程度上取决于数据集,并且没有模型严格地支配其他模型。我们注意到由于生成的图像模糊,VAE受到严重惩罚。请注意,这些结果不是“最先进的”:(i)较大的架构可以改进所有模型,(ii)作者经常报告最佳FID,为随机种子优化打开了大门。

image

Critically, we consider the mean FID as the computational budget increases which is shown in Figure 4. There are three important observations. Firstly, there is no algorithm which clearly dominates others. Secondly, for an interesting range of FIDs, a “bad” model trained on a large budget can out perform a “good” model trained on a small budget. Finally, when the budget is limited, any statistically significant comparison of the models is unattainable.

关键的是,我们将平均FID视为计算预算的增加,如图4所示。有三个重要的观察结果。首先,没有明显支配他人的算法。其次,对于一系列有趣的FID,在大预算下训练的“坏”模型可以执行在小预算下训练的“好”模型。最后,当预算有限时,任何统计上显着的模型比较都是无法实现的。

6.3. Impact of Limited Computational Budget

6.3。有限计算预算的影响

In some cases, the computational budget available to a practitioner is too small to perform such a large-scale hyperparameter search. Instead, one can tune the range of hyperparameters on one data set and interpolate the good hy perparameter ranges for other data sets. We now consider this setting in which we allow only 50 samples from a set of narrow ranges, which were selected based on the wide hyperparameter search on the FASHION-MNIST data set. We report the narrow hyperparameter ranges in Appendix A. Figure 15 shows the variance of FID per model, where the hyperparameters were selected from narrow ranges. From the practical point of view, there are significant differences between the models: in some cases the hyperparameter ranges transfer from one data set to the others (e.g. NS GAN), while others are more sensitive to this choice (e.g. WGAN). We note that better scores can be obtained by a wider hyperparameter search. These results supports the conclusion that discussing the best score obtained by a model on a data set is not a meaningful way to discern between these models. One should instead discuss the distribution of the obtained scores.

在某些情况下,从业者可用的计算预算太小而无法执行如此大规模的超参数搜索。相反,可以调整一个数据集上的超参数范围,并为其他数据集插入良好的hy参数范围。我们现在考虑这种设置,其中我们仅允许来自一组窄范围的50个样本,这些样本是基于FASHION-MNIST数据集上的宽超参数搜索而选择的。我们在附录A中报告了狭窄的超参数范围。图15显示了每个模型的FID方差,其中超参数选自窄范围。从实际的角度来看,模型之间存在显着差异:在某些情况下,超参数范围从一个数据集转移到另一个数据集(例如NS GAN),而其他情况则对此选择更敏感(例如WGAN)。我们注意到,通过更广泛的超参数搜索可以获得更好的分数。这些结果支持这样的结论:讨论模型在数据集上获得的最佳分数并不是识别这些模型之间有意义的方法。人们应该讨论获得的分数的分布。

6.4. Robustness to Random Initialization

6.4。随机初始化的稳健性

For a fixed model, hyperparameters, training algorithm, and the order that the data is presented to the model, one would expect similar model performance. To test this hypothesis we re-train the best models from the limited hyperparameter range considered for the previous section, while changing the initial weights of the generator and discriminator networks (i.e. by varying a random seed). Table 3 and Figure 16 show the results for each data set. Most models are relatively robust to random initialization, except LSGAN, even though for all of them the variance is significant and should be taken into account when comparing models.

对于固定模型,超参数,训练算法以及将数据呈现给模型的顺序,可以预期类似的模型性能。为了测试该假设,我们从前一部分考虑的有限超参数范围重新训练最佳模型,同时改变发生器和鉴别器网络的初始权重(即通过改变随机种子)。表3和图16显示了每个数据集的结果。除了LSGAN之外,大多数模型对随机初始化都相对鲁棒,即使对于所有模型,方差都很重要,在比较模型时应该考虑到这些差异。

6.5. Precision, recall, and F1

6.5。精确,召回和F1

We perform a search over the wide range of hyperparameters and compute precision and recall by considering image samples. In particular, we compute the precision of the model by computing the fraction of generated samples with distance below a threshold image . We then consider n samples from the test set and invert each sample x to compute image and compute the squared Euclidean distance between x and image . We define the recall as the fraction of samples with squared Euclidean distance below δ. Figure 6 shows the results where we select the best image score for a fixed model and hyperparameters and vary the budget. We observe that even for this seemingly simple task, many models struggle to achieve a high image score. Analogous plots where we instead maximize precision or recall for various thresholds are presented in Appendix E. 我们通过考虑 image 样本,对各种超参数进行搜索,并计算精度和召回率。特别地,我们通过计算距离低于阈值 image 的生成样本的分数来计算模型的精度。然后我们考虑来自测试集的n个样本并反转每个样本x以计算 image 并计算x和 image 之间的平方欧几里德距离。我们将回忆定义为欧氏距离平方低于δ的样本分数。图6显示了我们为固定模型和超参数选择最佳 image 分数并改变预算的结果。我们观察到即使对于这个看似简单的任务,许多模型仍难以获得高 image 分数。附录E中给出了我们相反最大化精度或召回各种阈值的类比图。 image

_

Figure 6: How does image score vary with computational budget? The plot shows the distribution of the maximum image score achievable for a fixed budget with a 95% confidence interval. For each budget, we estimate the mean and confidence interval (of the mean) using 5000 bootstrap resamples out of 100 runs. When optimizing for image score, both NS GAN and WGAN enjoy high precision and recall. The underwhelming performance of BEGAN and VAE on this particular data set merits further investigation.

_

_

图6: image 得分如何随计算预算而变化?该图显示了固定预算可实现的最大 image 分数的分布,其中95%的置信区间。对于每个预算,我们使用100次运行中的5000次自举重采样来估计平均值和置信区间(均值)。在优化 image 得分时,NS GAN和WGAN都享有高精度和召回率。BEGAN和VAE在这一特定数据集上的表现令人沮丧,值得进一步研究。

7. Conclusion & Open Problems

7.结论和开放性问题

In this paper we have started a discussion on how to neutrally and fairly compare GANs. We focus on two sets of evaluation metrics: (i) The Fr´echet Inception Distance, and (ii) precision, recall and image

. We provide empirical evidence that FID is a reasonable metric due to its robustness with respect to mode dropping and encoding network choices.

在本文中,我们已经开始讨论如何中性和公平地比较GAN。我们关注两组评估指标:(i)Fr'echet初始距离,以及(ii)精确度,召回率和 image

。我们提供经验证据表明FID是一个合理的度量标准,因为它在模式丢弃和编码网络选择方面具有鲁棒性。

Comparison based on FID. Our main insight is that to compare models it is meaningless to report the minimum FID achieved. Instead, distributions of the FID for a fixed computational budget should be compared. Indeed, empirical evidence presented herein imply that algorithmic differences in state-of-the-art GANs become less relevant, as the computational budget increases. Furthermore, given a limited budget (say a month of compute-time), a “good” algorithm might be outperformed by a “bad” algorithm.

基于FID的比较。我们的主要观点是,比较模型,报告实现的最小FID是没有意义的。相反,应该比较固定计算预算的FID分布。实际上,本文提供的经验证据表明,随着计算预算的增加,现有技术GAN中的算法差异变得不那么重要。此外,考虑到有限的预算(比如一个月的计算时间),“好”算法可能会胜过“坏”算法。

Comparison based on precision, recall and image score. Our simple triangle data set allows us to compute well understood precision and recall metrics, and consequently the image score. We observe that even for this seemingly simple task, many models struggle to achieve a high image score. When optimizing for image

score both NS GAN and WGAN enjoy both high precision and recall. Other models, such as DRAGAN and WGAN GP fail to reach high recall values. Fi nally, we observe that it is possible to achieve high precision and high recall on this task (cf. Appendix E).

比较基于精度,召回和 image 得分。我们的简单三角形数据集允许我们计算出易于理解的精度和召回指标,从而计算 image 得分。我们观察到即使对于这个看似简单的任务,许多模型也难以获得高 image 分数。在优化 image

得分时,NS GAN和WGAN都享有高精度和召回。其他型号(如DRAGAN和WGAN GP)无法达到高召回率。最后,我们观察到可以在此任务上实现高精度和高召回率(参见附录E)。

Comparison with respect to original GAN. While many algorithms have claimed superiority over the original GAN model [8], we found no empirical evidence which supports such claims, across all data sets. In fact, the NS GAN performs on par with most other models and achieves the best overall FID on MNIST. Furthermore, it outperforms other models in terms of the image

score on TRIANGLES.

与原始GAN的比较。虽然许多算法声称优于原始GAN模型[8],但我们没有找到支持所有数据集的此类声明的经验证据。实际上,NS GAN与大多数其他型号相当,并且在MNIST上实现了最佳的整体FID。此外,它在TRIANGLES的 image

得分方面优于其他车型。

Open problems. It remains to be examined whether FID is stable under a more radical change of the encoding, e.g using a network trained on a different task. Also, FID cannot detect overfitting to the training data set, and an algorithm that just remembers all the training examples would perform very well. Finally, FID can probably be “fooled” by artifacts that are not detected by the embedding network.

打开问题。还需要检查FID是否在编码的更激进的变化下是稳定的,例如使用在不同任务上训练的网络。此外,FID无法检测到对训练数据集的过度配置,并且只记得所有训练样例的算法将表现得非常好。最后,FID可能会被嵌入网络未检测到的工件“欺骗”。

The triangles data set can be made progressively more complex by: (i) introducing multiple convex polygons at once, (ii) providing color or texture inside the polygon, and (iii) gradually increasing the resolution. While the performance of existing models might be improved given a bigger computational budget and larger model capacity, we argue that algorithmic improvements should drive better performance. Having such a series of tasks of increasing complexity should greatly benefit the research community.

通过以下方式可以使三角形数据集逐渐变得更复杂:(i)一次引入多个凸多边形,(ii)在多边形内提供颜色或纹理,以及(iii)逐渐增加分辨率。虽然现有模型的性能可能会因为更大的计算预算和更大的模型容量而得到改善,但我们认为算法改进应该会带来更好的性能。拥有如此一系列日益复杂的任务应该对研究界有很大的帮助。

As discussed in Section 4, many dimensions have to be taken into account when comparing different models, and this work only explores a subset of the options. We cannot exclude the possibility that that some models significantly outperform others under currently unexplored conditions.

如第4节所述,在比较不同模型时必须考虑许多维度,而这项工作仅探讨了选项的一个子集。我们不能排除某些模型在目前尚未开发的情况下显着优于其他模型的可能性。

Finally, this work strongly suggest that future GAN research should be more experimentally systematic and models should be compared on a neutral ground.

最后,这项工作强烈建议未来的GAN研究应该更具实验系统性,模型应该在中立的基础上进行比较。

文章引用于 http://tongtianta.site/paper/3092
编辑 Lornatang
校准 Lornatang

上一篇下一篇

猜你喜欢

热点阅读