Very Deep Convolutional Networks

2019-03-27 本文已影响0人 Lornatang

Very Deep Convolutional Networks for Large-Scale Image Recognition翻译上

4 CLASSIFICATION EXPERIMENTS

4分类实验

Dataset. In this section, we present the image classiﬁcation results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges). The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). The classiﬁcation performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classiﬁcation error, i.e. the proportion of incorrectly classiﬁed images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.

数据集。在本节中，我们将介绍所描述的ConvNet体系结构在ILSVRC-2012数据集（用于ILSVRC 2012-2014挑战）上实现的图像分类结果。该数据集包括1000个类别的图像，并且被分成三组：训练（1.3M图像），验证（50K图像）和测试（具有伸出类别标签的100K图像）。分类性能评估使用两个度量：前1和前5的错误。前者是一个多级分类错误，即错误分类图像的比例;后者是ILSVRC中使用的主要评估标准，并且按照图像的比例计算，以使地面实况类别超出前5类预测类别。

For the majority of experiments, we used the validation set as the test set. Certain experiments were also carried out on the test set and submitted to the ofﬁcial ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al., 2014).

对于大多数实验，我们使用验证集作为测试集。还对测试集进行了一些实验，并将其提交给官方ILSVRC服务器，作为参加ILSVRC-2014竞赛的“VGG”团队（Russakovsky等，2014）。

4.1 SINGLE SCALE EVALUATION

4.1单一尺度评估

We begin with evaluating the performance of individual ConvNet models at a single scale with the layer conﬁgurations described in Sect. 2.2. The test image size was set as follows:

image for ﬁxed S, and

image for jittered

image . The results of are shown in Table 3. 我们从单个尺度上评估各个ConvNet模型的性能开始，第一部分描述了层配置。 2.2。测试图像大小设置如下：用于固定S的

image 和用于抖动

image 的

image 。结果如表3所示。

First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E).

首先，我们注意到在没有任何标准化层的情况下，使用本地响应标准化（A-LRN网络）在模型A上没有改进。因此，我们在深层架构（B-E）中不采用规范化。

Second, we observe that the classiﬁcation error decreases with the increased ConvNet depth: from 11 layers in A to 19 layers in E. Notably, in spite of the same depth, the conﬁguration C (which contains three

image conv. layers), performs worse than the conﬁguration D, which uses

image conv. layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. ﬁlters with non-trivial receptive ﬁelds (D is better than C). The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneﬁcial for larger datasets. We also compared the net B with a shallow net with ﬁve

image conv. layers, which was derived from B by replacing each pair of

image conv. layers with a single

image conv. layer (which has the same receptive ﬁeld as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which conﬁrms that a deep net with small ﬁlters outperforms a shallow net with larger ﬁlters. 其次，我们观察到分类错误随着ConvNet深度的增加而减少：从A层的11层到E层的19层。值得注意的是，尽管深度相同，但配置C（其中包含三个

image 转换层）的性能差于使用

image conv的配置D.整个网络层。这表明虽然额外的非线性确实有帮助（C比B好），但使用conv捕获空间上下文也很重要。具有非平凡接受场的滤波器（D比C好）。当深度达到19层时，我们架构的错误率会饱和，但更深的模型可能对更大的数据集有好处。我们还将净B与浅层网进行比较，其中有五个

image conv。层，它是从B中通过替换每对

image conv获得的。带有一个

image conv的图层。层（它具有与2.3节中解释的相同的接受域）。测得浅网的前1个误差比B（中心作物）的误差高7％，这证实具有小滤网的深网优于具有更大滤网的浅网。 Finally, scale jittering at training time (

image ) leads to signiﬁcantly better results than training on images with ﬁxed smallest side (

image or

image ), even though a single scale is used at test time. This conﬁrms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics. 最后，训练时刻的缩放抖动（

image ）导致比固定最小侧（

image 或

image ）的图像训练更好的结果，即使在测试时使用单个缩放比例也是如此。这证实了训练集按比例增加抖动确实有助于捕获多尺度图像统计。

image

4.2 MULTI-SCALE EVALUATION

4.2多尺度评估

Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time. It consists of running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models trained with ﬁxed S were evaluated over three test image sizes, close to the training one:

image

image . At the same time, scale jittering at training time allows the network to be applied to a wider range of scales at test time, so the model trained with variable

image was evaluated over a larger range of sizes

image

在评估ConvNet模型的单一尺度后，我们现在评估测试时刻的抖动效应。它包括在一个测试图像的多个重新缩放版本上运行模型（对应于不同的Q值），然后对结果类别的后验进行平均。考虑到培训和测试规模之间的巨大差异导致性能下降，使用固定S进行培训的模型在三种测试图像尺寸下进行评估，接近培训内容：

image

image 。同时，在训练时刻缩放抖动允许网络在测试时间应用于更广泛的尺度范围，因此使用

image 变量训练的模型在

image

的更大范围内进行了评估。

The results, presented in Table 4, indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale, shown in Table 3). As before, the deepest conﬁgurations (D and E) perform the best, and scale jittering is better than training with a ﬁxed smallest side S. Our best single-network performance on the validation set is

image

top-1/top-5 error (highlighted in bold in Table 4). On the test set, the conﬁguration E achieves 7.3% top-5 error.

表4中所示的结果表明，在测试时刻的尺度抖动导致更好的性能（与在单个尺度上评估相同模型相比，如表3所示）。和以前一样，最深的配置（D和E）表现最好，比缩放抖动要好于固定最小边S的训练。我们在验证集上的最佳单网络性能是

image

top-1 / top-5错误（表4中以粗体突出显示）。在测试集上，配置E实现了7.3％的前5错误。

image

4.3 MULTI-CROP EVALUATION

4.3多作物评估

In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for details). We also assess the complementarity of the two evaluation techniques by averaging their softmax outputs. As can be seen, using multiple crops performs slightly better than dense evaluation, and the two approaches are indeed complementary, as their combination outperforms each of them. As noted above, we hypothesize that this is due to a different treatment of convolution boundary conditions.

在表5中，我们将密集的ConvNet评估与多作物评估进行比较（详情请参见3.2节）。我们还通过平均softmax输出来评估两种评估技术的互补性。可以看出，使用多种作物的表现略好于密集评估，而且这两种方法确实是互补的，因为它们的组合优于其中的每一种。如上所述，我们假设这是由于对卷积边界条件的不同处理。

image

4.4 CONVNET FUSION

4.4 CONVNET融合

Up until now, we evaluated the performance of individual ConvNet models. In this part of the experiments, we combine the outputs of several models by averaging their soft-max class posteriors. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014).

到目前为止，我们评估了单个ConvNet模型的性能。在这部分实验中，我们通过平均他们的soft-max类后辈来组合几个模型的输出。由于模型的互补性，这提高了性能，并在2012年（Krizhevsky等，2012）和2013年（Zeiler＆Fergus，2013; Sermanet等，2014）的顶级ILSVRC提交中使用。

The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by ﬁne-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 7 networks has 7.3% ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (conﬁgurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves 7.1% error (model E, Table 5).

结果如表6所示。到ILSVRC提交时，我们只训练单尺度网络以及多尺度模型D（通过仅调整完全连接层而不是所有层）。由此产生的7个网络的总体具有7.3％的ILSVRC测试错误。提交之后，我们考虑了只有两个表现最好的多尺度模型（配置D和E）的集合，它使用密集评估将测试误差降低到7.0％，使用组合密集和多作物评估将测试误差降低到6.8％。作为参考，我们表现最佳的单模型实现了7.1％的误差（模型E，表5）。

4.5 COMPARISON WITH THE STATE OF THE ART

4.5与现有技术的比较

Finally, we compare our results with the state of the art in Table 7. In the classiﬁcation task of ILSVRC-2014 challenge (Russakovsky et al., 2014), our “VGG” team secured the 2nd place with

最后，我们将我们的结果与表7中的现有技术进行比较。在ILSVRC-2014挑战赛的分类任务（Russakovsky等，2014）中，我们的“VGG”团队获得了第二名

image

5 CONCLUSION

5结论

In this work we evaluated very deep convolutional networks (up to 19 weight layers) for largescale image classiﬁcation. It was demonstrated that the representation depth is beneﬁcial for the classiﬁcation accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again conﬁrm the importance of depth in visual representations.

在这项工作中，我们评估了非常深的卷积网络（高达19个重量层）用于大规模图像分类。已经证明，表示深度对于分类准确性是有利的，并且可以使用传统的ConvNet架构来实现ImageNet挑战数据集上的最新性能（LeCun等人，1989; Krizhevsky等人， 2012），深度大幅增加。在附录中，我们还展示了我们的模型很好地适用于广泛的任务和数据集，可以匹配或优于围绕较深图像表示形成的更复杂的识别流水线。我们的结果再一次证实了视觉表示的深度的重要性。

ACKNOWLEDGEMENTS

致谢

This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.

这项工作得到ERC授予VisRec no。的支持。 228180。我们非常感谢NVIDIA公司对此次研究所用GPU的支持。

REFERENCES

参考

Bell, S., Upchurch, P., Snavely, N., and Bala, K. Material recognition in the wild with the materials in context database. CoRR, abs/1412.0623, 2014.

Bell，S.，Upchurch，P.，Snavely，N.和Bala，K.在上下文数据库中材料在野外的材料识别。 CoRR，abs / 1412.0623,2014。

Chatﬁeld, K., Simonyan, K., Vedaldi, A., and Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.

Chat field，K.，Simonyan，K.，Vedaldi，A.和Zisserman，A。细节中的魔鬼回归：深入卷积网。在Proc。 BMVC，2014。

Cimpoi, M., Maji, S., and Vedaldi, A. Deep convolutional ﬁlter banks for texture recognition and segmentation. CoRR, abs/1411.6836, 2014.

Cimpoi，M.，Maji，S.和Vedaldi，A.用于纹理识别和分割的深度卷积滤波器组。 CoRR，abs / 1411.6836,2014。

Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. Flexible, high performance convolutional neural networks for image classiﬁcation. In IJCAI, pp. 1237–1242, 2011.

Ciresan，D.C.，Meier，U.，Masci，J.，Gambardella，L.M。和Schmidhuber，J.Flexible，high performance convolutional neural networks for image classi fi cation。在IJCAI，第1237-1242页，2011年。

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. Large scale distributed deep networks. In NIPS, pp. 1232–1240, 2012.

Dean，J.，Corrado，G.，Monga，R.，Chen，K.，Devin，M.，Mao，M.，Ranzato，M.，Senior，A.，Tucker，P.，Yang，K.， Le，QV和Ng，AY大规模分布式深度网络。在NIPS，第1232-1240页，2012年。

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L.Imagenet：一个大规模的分层图像数据库。在Proc。 CVPR，2009。

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013.

Donahue，J.，Jia，Y.，Vinyals，O.，Hoffman，J.，Zhang，N.，Tzeng，E.和Darrell，T.Decaf：用于通用视觉识别的深层卷积激活功能。 CoRR，abs / 1310.1531,2013。

Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C., Winn, J., and Zisserman, A. The Pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.

Everingham，M.，Eslami，S. M. A.，Van Gool，L.，Williams，C.，Winn，J。和Zisserman，A。帕斯卡视觉对象类挑战：回顾。 IJCV，111（1）：98-136，2015。

Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE CVPR Workshop of Generative Model Based Vision, 2004.

Fei-Fei，L.，Fergus，R.和Perona，P.从少数训练实例学习生成视觉模型：在101个对象类别上测试的增量贝叶斯方法。在IEEE CVPR生成模型基础视觉研讨会上，2004。

Girshick, R. B., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524v5, 2014. Published in Proc. CVPR, 2014.

Girshick，R. B.，Donahue，J.，Darrell，T。和Malik，J. Rich精确对象检测和语义分割的特征层次结构。 CoRR，abs / 1311.2524v5，2014.发表在Proc。 CVPR，2014年。

Gkioxari, G., Girshick, R., and Malik, J. Actions and attributes from wholes and parts. CoRR, abs/1412.2604, 2014.

Gkioxari，G.，Girshick，R.，和Malik，J.整体和零件的行为和属性。 CoRR，abs / 1412.2604,2014。

Glorot, X. and Bengio, Y. Understanding the difﬁculty of training deep feedforward neural networks. In Proc. AISTATS, volume 9, pp. 249–256, 2010.

Glorot，X。和Bengio，Y.了解训练深度前馈神经网络的困难。在Proc。 AISTATS，第9卷，第249-256页，2010年。

Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In Proc. ICLR, 2014.

Goodfellow，I. J.，Bulatov，Y.，Ibarz，J.，Arnoud，S.和Shet，V.使用深度卷积神经网络从街景图像中识别多位数字。在Proc。 ICLR，2014年。

Grifﬁn, G., Holub, A., and Perona, P. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007.

Grif fi n，G.，Holub，A.和Perona，P. Caltech-256对象类别数据集。技术报告7694，加州理工学院，2007。

He, K., Zhang, X., Ren, S., and Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. CoRR, abs/1406.4729v2, 2014.

他，K.，张，X，任，S。，和太阳，J。空间金字塔池在深度卷积网络视觉识别。 CoRR，abs / 1406.4729v2,2014。

Hoai, M. Regularized max pooling for image categorization. In Proc. BMVC., 2014.

Hoai，M.正则化图像分类的最大池。在Proc。 BMVC，2014。

Howard, A. G. Some improvements on deep convolutional neural network based image classiﬁcation. In Proc. ICLR, 2014.

Howard，A.G.对基于深度卷积神经网络的图像分类的一些改进。在Proc。 ICLR，2014年。

Jia, Y. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.

Jia，Y. Caffe：用于快速特征嵌入的开源卷积体系结构。 http://caffe.berkeleyvision.org/，2013。

Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.

Karpathy，A.和Fei-Fei，L。用于生成图像描述的深度视觉语义对齐。 CoRR，abs / 1412.2306,2014。

Kiros, R., Salakhutdinov, R., and Zemel, R. S. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014.

Kiros，R.，Salakhutdinov，R.和Zemel，R. S.用多模态神经语言模型统一视觉语义嵌入。 CoRR，abs / 1411.2539,2014。

Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.

Krizhevsky，A。一种用于平行化卷积神经网络的奇怪技巧。 CoRR，abs / 1404.5997,2014。

Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classiﬁcation with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.

Krizhevsky，A.，Sutskever，I.和Hinton，G. E.用深度卷积神经网络分类ImageNet。在NIPS，第1106-1114页，2012年。

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.

LeCun，Y.，Boser，B.，Denker，J. S.，Henderson，D.，Howard，R. E.，Hubbard，W.和Jackel，L. D.反向传播应用于手写邮政编码识别。 Neural Computation，1（4）：541-551,1989。

Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.

Lin，M.，Chen，Q.和Yan，S.网络中的网络。在Proc。 ICLR，2014年。

Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.

Long，J.，Shelhamer，E.，and Darrell，T. Fully convolutional networks for semantic segmentation。 CoRR，abs / 1411.4038,2014。

Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. In Proc. CVPR, 2014.

Oquab，M.，Bottou，L.，Laptev，I.，和Sivic，J.使用卷积神经网络学习和传送中间层图像表示。在Proc。 CVPR，2014年。

Perronnin, F., S´anchez, J., and Mensink, T. Improving the Fisher kernel for large-scale image classiﬁcation. In Proc. ECCV, 2010.

Perronnin，F.，S'anchez，J.和Mensink，T.改进用于大规模图像分类的Fisher核函数。在Proc。 ECCV，2010。

Razavian, A., Azizpour, H., Sullivan, J., and Carlsson, S. CNN Features off-the-shelf: an Astounding Baseline for Recognition. CoRR, abs/1403.6382, 2014.

Razavian，A.，Azizpour，H.，Sullivan，J.和Carlsson，S。现成的CNN特性：令人震惊的基准识别。 CoRR，abs / 1403.6382,2014。

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.

我们的研究结果表明，在这些研究中，我们发现了一些新的研究方法，其中包括： Berg，AC和Fei-Fei，L. ImageNet大规模视觉识别挑战。 CoRR，abs / 1409.0575,2014。

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In Proc. ICLR, 2014.

Sermanet，P.，Eigen，D.，Zhang，X.，Mathieu，M.，Fergus，R.和LeCun，Y.OverFeat：使用卷积网络的集成识别，本地化和检测。在Proc。 ICLR，2014年。

Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. CoRR, abs/1406.2199, 2014. Published in Proc. NIPS, 2014.

Simonyan，K.和Zisserman，A.视频中动作识别的双流卷积网络。 CoRR，abs / 1406.2199，2014.发表在Proc。 NIPS，2014。

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.

Szegedy，C.，Liu，W.，Jia，Y.，Sermanet，P.，Reed，S.，Anguelov，D.，Erhan，D.，Vanhoucke，V.和Rabinovich，A.卷积更深入。 CoRR，abs / 1409.4842,2014。

Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., and Yan, S. CNN: Single-label to multi-label. CoRR, abs/1406.5726, 2014.

Wei，Y.，Xia，W.，Huang，J.，Ni，B.，Dong，J.，Zhao，Y.和Yan，S. CNN：单标签到多标签。 CoRR，abs / 1406.5726,2014。

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901,

Zeiler，M.D。和Fergus，R。可视化和理解卷积网络。 CoRR，abs / 1311.2901，

2013. Published in Proc. ECCV, 2014.

发表在Proc。 ECCV，2014。

文章引用于http://tongtianta.site/paper/122
编辑 Lornatang
校准 Lornatang

Very Deep Convolutional Networks

猜你喜欢

热点阅读