Going Deeper With Convolutions翻译

2019-03-27 本文已影响0人 Lornatang

Going Deeper with Convolutions

越来越深入的卷积

论文：http://arxiv.org/pdf/1409.4842v1.pdf

Abstract

摘要

We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classiﬁcation and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classiﬁcation and detection.

我们提出了一个代号为Inception的深度卷积神经网络架构，该架构负责设置2014年ImageNet大规模视觉识别挑战（ILSVRC14）中的分类和检测的新技术。这种架构的主要特点是提高了网络内部计算资源的利用率。这是通过精心设计的设计实现的，可以在保持计算预算不变的同时增加网络的深度和宽度。为了优化质量，架构决策基于Hebbian原理和多尺度处理的直觉。我们在提交ILSVRC14时使用的一个特定化身称为GoogLeNet，这是一个22层深的网络，其质量在分类和检测方面进行评估。

In the last three years, mainly due to the advances of deep learning, more concretely convolutional networks [10], the quality of image recognition and object detection has been progressing at a dramatic pace. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classiﬁcation dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses

image

fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being signiﬁcantly more accurate. The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].

在过去的三年里，主要由于深度学习的进展，更具体的卷积网络[10]，图像识别和目标检测的质量一直在飞速发展。一个令人鼓舞的消息是，这些进展大部分不仅是更强大的硬件，更大的数据集和更大的模型的结果，而且主要是新思想，算法和改进网络架构的结果。例如，除了用于检测目的的同一竞争的分类数据集外，没有使用新的数据源，例如，ILSVRC 2014竞争对手的最佳条目。我们提交给ILSVRC 2014的GoogLeNet提交的参数实际上比两年前Krizhevsky等[9]获胜的体系结构使用的参数更少，同时显着更准确。目标检测的最大收益不是来自单独使用深度网络或更大模型，而是来自深层架构和经典计算机视觉的协同作用，如Girshick等[6]的R-CNN算法。

Another notable factor is that with the ongoing traction of mobile and embedded computing, the efﬁciency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer ﬁxation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.

另一个值得注意的因素是，随着移动和嵌入式计算的不断发展，我们算法的效率 - 特别是其功耗和内存使用 - 越来越重要。值得注意的是，本文提出的深层架构设计的考虑因素包括这个因素，而不是精确度数字的纯粹定义。对于大多数实验来说，这些模型的设计目的是在推理时保持15亿次乘加的计算预算，以便它们不会成为纯粹的学术好奇心，但可以投入到现实世界的使用中，甚至在大数据集上，以合理的成本。

1 Introduction

1引言

In this paper, we will focus on an efﬁcient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: ﬁrst of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The beneﬁts of the architecture are experimentally veriﬁed on the ILSVRC 2014 classiﬁcation and detection challenges, on which it signiﬁcantly outperforms the current state of the art.

在本文中，我们将重点介绍一种代号为Inception的高效深层神经网络计算机视觉体系结构，其名称来源于Lin et al [12]在网络论文中与着名的“我们需要进一步深入”互联网meme [1]。在我们的例子中，“深”这个词有两个不同含义：首先，我们引入了一个新的组织层次，以“先启模块”的形式出现，并且更直接的意义上是增加了网络深度。总的来说，我们可以从Arora等[2]的理论工作中获得启发和指导，将Inception模型视为[12]的逻辑顶点。该体系结构的优势通过ILSVRC 2014分类和检测挑战进行了实验验证，并且明显优于当前的技术水平。

Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and maxpooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classiﬁcation literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classiﬁcation challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overﬁtting.

从LeNet-5开始[10]，卷积神经网络（CNN）通常具有标准结构 - 堆叠卷积层（可选地随后进行对比归一化和最大化卷积）之后是一个或多个完全连接的层。这种基本设计的变体在图像分类文献中很普遍，并且在MNIST，CIFAR和最显着的ImageNet分类挑战方面取得了最好的结果[9,21]。对于较大的数据集，如Imagenet，最近的趋势是增加层数[12]和层大小[21,14]，同时使用丢失[7]来解决过度配合问题。

Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19]. Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] use a series of ﬁxed Gabor ﬁlters of different sizes in order to handle multiple scales, similarly to the Inception model. However, contrary to the ﬁxed 2-layer deep model of [15], all ﬁlters in the Inception model are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.

尽管担心最大汇聚层会导致精确的空间信息丢失，但与[9]相同的卷积网络架构也已经成功地用于定位[9,14]，对象检测[6,14,18,5]和人类姿态估计[19]。受灵长类动物视觉皮层的神经科学模型的启发，Serre等人[15]使用一系列不同大小的固定Gabor滤波器来处理多个尺度，与Inception模型类似。然而，与[15]的固定的2层深模型相反，Inception模型中的所有滤波器都是学习的。此外，初始层重复多次，在GoogLeNet模型的情况下导致一个22层深的模型。

Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. When applied to convolutional layers, the method could be viewed as additional

image convolutional layers followed typically by the rectiﬁed linear activation [9]. This enables it to be easily integrated in the current CNN pipelines. We use this approach heavily in our architecture. However, in our setting,

image convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without signiﬁcant performance penalty. 网络网络是Lin等人提出的一种方法。 [12]为了增加神经网络的表现力。当应用于卷积层时，该方法可以看作是附加的

image 卷积层，通常是经过整数线性激活[9]。这使它可以很容易地集成到当前的CNN管道中。我们在架构中大量使用这种方法。然而，在我们的设置中，

image 卷积具有双重目的：最关键的是，它们主要用作降维模块以消除计算瓶颈，否则会限制我们网络的规模。这不仅可以增加深度，还可以增加我们网络的宽度，而不会影响性能。

The current leading approach for object detection is the Regions with Convolutional Neural Networks (R-CNN) proposed by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: to ﬁrst utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion, and to then use CNN classiﬁers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classiﬁcation power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.

目前领先的物体检测方法是由Girshick等人提出的带卷积神经网络的区域（R-CNN）。 [6]。R-CNN将整体检测问题分解为两个子问题：首先利用类别不可知方式的潜在对象提议的低级提示，如颜色和超像素一致性，然后使用CNN分类器确定这些位置的对象类别。这种两阶段方法利用低水平线索的边界框分割的准确性，以及最先进的CNN的高度强大的分类能力。我们在检测提交中采用了类似的流水线，但在两个阶段探索了增强功能，例如多框[5]预测更高的对象边界框回忆，以及更好地对边界框提案进行分类的集合方法。

3 Motivation and High Level Considerations

3动机和高层次考虑

The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth – the number of levels – of the network and its width: the number of units at each level. This is as an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However this simple solution comes with two major drawbacks.

提高深度神经网络性能的最直接的方法是增加它们的大小。这包括增加网络的深度 - 层数 - 宽度：每层的单元数量。这是训练更高质量模型的一种简单而安全的方式，尤其是考虑到大量标记的训练数据的可用性。但是，这个简单的解决方案有两个主要缺点。

Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overﬁtting, especially if the number of labeled examples in the training set is limited. This can become a major bottleneck, since the creation of high quality training sets can be tricky

更大的尺寸通常意味着更多的参数，这使得扩大的网络更容易过度配合，特别是如果训练集中标记示例的数量有限。这可能会成为一个主要的瓶颈，因为创建高质量的训练集可能会很棘手

2 Related Work

2相关工作

image

(a) Siberian husky (b) Eskimo dog Figure 1: Two distinct classes from the 1000 classes of the ILSVRC 2014 classiﬁcation challenge.

（a）西伯利亚雪橇犬（b）爱斯基摩犬图1：2014年ILSVRC分类挑战1000个级别中的两个不同类别。

image

and expensive, especially if expert human raters are necessary to distinguish between ﬁne-grained visual categories like those in ImageNet (even in the 1000-class ILSVRC subset) as demonstrated by Figure 1.

而且价格昂贵，特别是如果需要专业人员评估者来区分像ImageNet中那样的细粒度视觉类别（即使在1000级ILSVRC子集中），如图1所示。

Another drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their ﬁlters results in a quadratic increase of computation. If the added capacity is used inefﬁciently (for example, if most weights end up to be close to zero), then a lot of computation is wasted. Since in practice the computational budget is always ﬁnite, an efﬁcient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of results.

统一增加网络大小的另一个缺点是计算资源的使用大大增加。例如，在深度视觉网络中，如果两个卷积层链接在一起，任何均匀增加的滤波器数目都会导致计算的二次增加。如果增加的容量使用不当（例如，如果大多数权重最终接近零），那么大量的计算就会被浪费。由于在实践中计算预算总是有限的，即使主要目标是提高结果质量，计算资源的有效分配也优于不加区分地增加规模。

The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of ﬁrmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle – neurons that ﬁre together, wire together – suggests that the underlying idea is applicable even under less strict conditions, in practice.

解决这两个问题的根本途径是最终从完全连接转向稀疏连接的体系结构，即使在卷积内也是如此。除了模仿生物系统外，由于Arora等人的开创性工作，这也将具有较为理想的理论基础。 [2]。他们的主要结果表明，如果数据集的概率分布可以用一个大的，非常稀疏的深度神经网络表示，那么可以通过分析最后一层的激活的相关统计来逐层构建最佳网络拓扑，聚类与高度相关输出的神经元。尽管严格的数学证明需要非常强的条件，但是这个陈述与众所周知的赫布斯原理（将这些原理融合在一起并联在一起的神经元）共鸣的事实表明，即使在不太严格的条件下，实践中也可以使用该基本思想。

On the downside, todays computing infrastructures are very inefﬁcient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by

image

, the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off. The gap is widened even further by the use of steadily improving, highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the trend changed back to full connections with [9] in order to better optimize parallel computing. The uniformity of the structure and a large number of ﬁlters and greater batch size allow for utilizing efﬁcient dense computation.

不利的一面是，当涉及到对非均匀稀疏数据结构的数值计算时，今天的计算基础结构是非常不足的。即使

image

减少了算术运算次数，查找和缓存未命中的开销也非常大，以至于切换到稀疏矩阵都无济于事。通过使用经过高度调整的稳定改进的数值库，可以进一步扩大差距，这些库允许极快的密集矩阵乘法，利用底层CPU或GPU硬件的细节[16,9]。而且，非均匀稀疏模型需要更复杂的工程和计算基础设施。大多数当前面向视觉的机器学习系统仅利用卷积来利用空间域中的稀疏性。但是，卷积是作为较早层中的补丁的密集连接的集合来实现的。ConvNets传统上在特征维度中使用了随机和稀疏连接表[11]，以便打破对称性并提高学习效果，为了更好地优化并行计算，趋势变回了与[9]的完全连接。结构的均匀性和大量的过滤器以及更大的批量大小使得可以利用有效的密集计算。

This raises the question whether there is any hope for a next, intermediate step: an architecture that makes use of the extra sparsity, even at ﬁlter level, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.

这就产生了一个问题：是否有希望进行下一个中间步骤：即使在滤波器级别上也使用额外稀疏性的架构，正如理论所暗示的那样，但是通过利用密集矩阵上的计算来利用我们当前的硬件。关于稀疏矩阵计算的大量文献（例如[3]）表明，将稀疏矩阵聚类成相对密集的子矩阵倾向于给出稀疏矩阵乘法的现有技术的实际性能。认为在不久的将来自动化构建非统一的深度学习架构将使用类似的方法似乎并不遥远。

The Inception architecture started out as a case study of the ﬁrst author for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, only after two iterations on the exact choice of topology, we could already see modest gains against the reference architecture based on [12]. After further tuning of learning rate, hyperparameters and improved training methodology, we established that the resulting Inception architecture was especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly, they turned out to be at least locally optimal.

Inception架构开始作为第一作者的案例研究，用于评估复杂的网络拓扑构建算法的假设输出，该算法试图逼近[2]为视觉网络所隐含的稀疏结构，并通过密集的，容易获得的假设覆盖假设的结果组件。尽管是一个高度推测性的工作，但只有在精确选择拓扑的两次迭代之后，我们才可以看到基于[12]的参考架构的适度增益。在进一步调整学习速率，超参数和改进的训练方法之后，我们确定，由此产生的Inception架构在本地化和目标检测作为[6]和[5]的基础网络方面特别有用。有趣的是，尽管大部分原始架构选择都经过了彻底的质疑和测试，但他们证明至少在本地是最佳的。

One must be cautious though: although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction. Making sure would require much more thorough analysis and veriﬁcation: for example, if automated tools based on the principles described below would ﬁnd similar, but better topology for the vision networks. The most convincing proof would be if an automated system would create network topologies resulting in similar gains in other domains using the same algorithm but with very differently looking global architecture. At very least, the initial success of the Inception architecture yields ﬁrm motivation for exciting future work in this direction.

但必须谨慎：尽管提出的架构已经成为计算机视觉的成功，但它的质量是否可以归因于导致其构建的指导原则仍然值得怀疑。确保需要更彻底的分析和验证：例如，如果基于下述原理的自动化工具将为视觉网络找到类似但更好的拓扑结构。最有说服力的证据是，如果一个自动化系统能够创建网络拓扑结构，从而在使用相同算法的其他领域获得类似的收益，但在全球架构上看起来却不一样。至少，初始架构的最初成功为未来在这个方向上开展的工作产生了积极的动力。

The main idea of the Inception architecture is based on ﬁnding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to ﬁnd the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by layer construction in which one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from the earlier layer corresponds to some region of the input image and these units are grouped into ﬁlter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. This means, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of

image convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patchalignment issues, current incarnations of the Inception architecture are restricted to ﬁlter sizes

image ,

image and

image

, however this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output ﬁlter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success in current state of the art convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneﬁcial effect, too (see Figure 2(a)).

初始体系结构的主要思想是基于发现卷积视觉网络中的最佳局部稀疏结构如何能够通过容易获得的密集分量来近似和覆盖。请注意，假设转换不变意味着我们的网络将从卷积构建块构建。我们所需要的是找到最佳的局部结构并在空间上重复它。Arora等人[2]提出了一个分层结构，其中应该分析最后一层的相关统计量并将它们聚类为具有高度相关性的单位组。这些簇形成下一层的单元并连接到前一层的单元。我们假设来自较早层的每个单元对应于输入图像的一些区域，并且这些单元被分组为滤波器组。在较低层（接近输入的那些层）相关单元将集中在本地区域。这意味着，我们最终将有大量的集群集中在一个单独的区域中，并且它们可以在下一层被一层

image 卷积覆盖，如[12]中所述。然而，人们也可以预期，在更大的斑块上卷积可以覆盖的空间分布更多的簇的数量将会减少，并且越来越大的区域将会有越来越多的斑块。为了避免patchalignment问题，Inception体系结构的当前版本被限制在

image ，

image 和

image

的过滤器大小，但是这个决定更多的是基于便利性而非必要性。这也意味着建议的架构是所有这些层的组合，其输出滤波器组连接成一个输出向量，形成下一阶段的输入。此外，由于汇集操作对于当前现有技术水平的卷积网络的成功至关重要，因此它建议在每个这样的阶段添加替代的并行汇集路径也应该具有额外的有益效果（参见图2（a））。

As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of

image and

image convolutions should increase as we move to higher layers. 由于这些“初始模块”堆叠在一起，因此它们的输出相关统计量必然会发生变化：由于较高层抽取的特征被较高层所捕获，因此它们的空间浓度预计会减少，这表明

image 和

image 卷积的比率应该随着我们移动到更高层而增加。 One big problem with the above modules, at least in this na¨ıve form, is that even a modest number of

image

convolutions can be prohibitively expensive on top of a convolutional layer with a large number of ﬁlters. This problem becomes even more pronounced once pooling units are added to the mix: their number of output ﬁlters equals to the number of ﬁlters in the previous stage. The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable

上述模块的一个大问题，至少在这种初始形式下，即使是数量适中的

image

卷积，在具有大量过滤器的卷积层的顶部也可能过于昂贵。一旦将混合单元添加到混合中，这个问题变得更加明显：它们的输出滤波器数量等于前一阶段的滤波器数量。合并层的输出与卷积层的输出的合并将导致不可避免的

4 Architectural Details

4建筑细节

image

(a) Inception module, na¨ıve version

（a）先启模块，na？版本

5 GoogLeNet

image

(b) Inception module with dimension reductions

（b）减少尺寸的先启模块

Figure 2: Inception module

图2：启动模块

increase in the number of outputs from stage to stage. Even while this architecture might cover the optimal sparse structure, it would do it very inefﬁciently, leading to a computational blow up within a few stages.

从阶段到阶段的产出数量增加。即使这种架构可能覆盖最佳的稀疏结构，它也会非常低效，导致在几个阶段内发生计算性爆炸。

This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to model. We would like to keep our representation sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse. That is,

image convolutions are used to compute reductions before the expensive

image and

image convolutions. Besides being used as reductions, they also include the use of rectiﬁed linear activation which makes them dual-purpose. The ﬁnal result is depicted in Figure 2(b). 这导致了所提出的体系结构的第二个想法：在计算需求增加太多的情况下，明智地应用降维和预测。这是基于嵌入成功的：甚至低维嵌入可能包含大量关于较大图像补丁的信息。但是，嵌入以密集的压缩形式表示信息，并且压缩的信息很难建模。我们希望在大多数地方保持我们的表示稀疏（如[2]的条件所要求的那样），并且只有在必须集中汇总信号时才压缩信号。也就是说，

image 卷积用于在昂贵的

image 和

image 卷积之前计算减少量。除了用作缩减外，还包括使用整型线性激活，这使得它们成为双重用途。最终结果如图2（b）所示。

In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efﬁciency during training), it seemed beneﬁcial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reﬂecting some infrastructural inefﬁciencies in our current implementation.

一般来说，Inception网络是由上述类型的模块组成的一个网络，彼此堆叠在一起，偶尔使用步长为2的最大池层将网格的分辨率减半。出于技术原因（训练期间的记忆效率），开始仅在较高层使用Inception模块，同时以传统卷积方式保持较低层似乎是有益的。这不是绝对必要的，只是反映了我们当前实施中的一些基础设施效率低下问题。

One of the main beneﬁcial aspects of this architecture is that it allows for increasing the number of units at each stage signiﬁcantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input ﬁlters of the last stage to the next layer, ﬁrst reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.

这种架构的主要优势之一是，它允许在每个阶段增加单元的数量，而不会在计算复杂性方面造成不受控制的爆炸。无处不在的减少尺寸允许将最后一个阶段的大量输入滤镜屏蔽到下一个图层，首先减小它们的尺寸，然后再以较大的色块尺寸对它们进行卷积。这种设计的另一个实际有用的方面是它符合直觉，即视觉信息应该以不同的尺度进行处理，然后进行聚合，以便下一阶段可以同时从不同尺度提取特征。

The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difﬁculties. Another way to utilize the inception architecture is to create slightly inferior, but computationally cheaper versions of it. We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are

image

faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.

计算资源的改进使用允许增加每个阶段的宽度以及阶段的数量，而不会陷入计算困难。利用初始架构的另一种方法是创建稍微低劣但计算更便宜的版本。我们发现，所有包含旋钮和控制杆的控制器都允许对计算资源进行控制平衡，这可能会导致

image

网络比采用非Inception架构的类似执行网络更快，但是这需要谨慎的手动设计。

We chose GoogLeNet as our team-name in the ILSVRC14 competition. This name is an homage to Yann LeCuns pioneering LeNet 5 network [10]. We also use GoogLeNet to refer to the particular incarnation of the Inception architecture used in our submission for the competition. We have also used a deeper and wider Inception network, the quality of which was slightly inferior, but adding it to the ensemble seemed to improve the results marginally. We omit the details of that network, since our experiments have shown that the inﬂuence of the exact architectural parameters is relatively minor. Here, the most successful particular instance (named GoogLeNet) is described in Table 1 for demonstrational purposes. The exact same topology (trained with different sampling methods) was used for 6 out of the 7 models in our ensemble.

我们选择了GoogLeNet作为我们在ILSVRC14比赛中的团队名称。这个名字是对Yann LeCuns开创LeNet 5网络[10]的致敬。我们还使用GoogLeNet来指代我们在竞赛中使用的初始架构的特定化身。我们还使用了更深入，更广泛的Inception网络，其质量稍差，但将其添加到合奏中似乎稍微改善了结果。我们省略了该网络的细节，因为我们的实验已经表明，确切的架构参数的影响相对较小。这里，为了示范目的，在表1中描述了最成功的特定实例（名为GoogLeNet）。在我们的集合中，7个模型中的6个使用了完全相同的拓扑（用不同的采样方法训练）。

image

Table 1: GoogLeNet incarnation of the Inception architecture

表1：初始架构的GoogLeNet化身

All the convolutions, including those inside the Inception modules, use rectiﬁed linear activation. The size of the receptive ﬁeld in our network is

image taking RGB color channels with mean subtraction. “

image reduce” and “

image reduce” stands for the number of

image ﬁlters in the reduction layer used before the

image and

image convolutions. One can see the number of

image ﬁlters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectiﬁed linear activation as well. 所有的卷积，包括Inception模块中的卷积，都使用了整齐的线性激活。我们网络中接收区域的大小是

image ，RGB色彩通道具有均值减法。 “

image 减少”和“

image 减少”代表在

image 和

image 卷积之前使用的还原层中的

image 过滤器的数量。可以在pool proj列中的内置最大池之后看到投影层中

image 滤镜的数量。所有这些缩减/投影层都使用了整齐的线性激活。

文章引用于 http://tongtianta.site/paper/237
编辑 Lornatang
校准 Lornatang

Going Deeper With Convolutions翻译

猜你喜欢

热点阅读