A Comprehensive Study on Deep Le

2019-10-16 本文已影响0人慎独墨写

Abstract

1.本文研究深度学习库中bug的一些特性；

这些深度学习库包括 Caffe, Keras, Tensorflow, Theano, Torch

2.本文的数据来源：

2716 high-quality posts from Stack Overflow
500 bug fix commits from Github

都来自带与上述库相关的post 和 commits中

3.本文研究的bug特征：

bug的种类
引发bug的根本原因
bug的影响
深度学习过程中容易引发bug的阶段
buggy software中的antipattern

在软件工程中，一个反面模式（anti-pattern或antipattern）指的是在实践中明显出现但又低效或是有待优化的设计模式，是用来解决问题的带有共同性的不良方法。

Introduction

1.关于深度学习库的bug一般分为两类：

DL library自身的错误
使用DL library时的错误

本文聚焦的是使用时遇到的错误

2.在深度学习中，每个库的目的是不一样的；

Tensorflow : low-level, highly configurable facilities
Keras: aims to provide high-level abstractions hiding the low-level details
Theano and Torch : focus on easing the use of GPU computing

3.数据的意义

Stack Overflow: 开发者在使用DL库时遇到的问题
Github: 开源软件中bug被发现和修复的情况

本文的RQ

RQ1: (Bug Type) What type of bugs are more frequent?
RQ2: (Root cause) What are the root causes of bugs?
RQ3: (Bug Impact) What are the frequent impacts of bugs?
RQ4: (Bug prone stages) Which deep learning pipeline stages are more vulnerable to bugs?
RQ5: (Commonality) Do the bugs follow a common pattern?
RQ6: (Bug evolution) How did the bug pattern change over time?

Methodology

Data collection

高质量post的标准：difference between the number of its upvotes and its downvotes, were greater than 5

需要包含code

Classification

将很多传统软件工程的概念迁移到的AI testing中，赋予了比较新的意义。

本段介绍了如何分类bug的种类和引起bug的原因。

深度学习的seven stage pipeline: The stages are data collection, data preparation, choice of model, training, evaluation, hyper parameter tuning and prediction.

Labeling the bugs

两个作者的标注需要做一致性检验

使用 Cohen's Kappa Coefficient作一致性检验

一开始很低，再经过培训后，一致性变高

Cohen's Kappa Coefficient

在做数据分析时，我们经常会面临一致性检验问题，即判断不同的模型或者分析方法在预测结果上是否具有一致性、模型的结果与实际结果是否具有一致性等。另外，一致性检验在临床实验中也有着广泛的应用。对于两个或多个医务工作者对同一病人给出的诊断结论进行一致性检验，英文叫 interrater reliability; 对同一医务工作者多次诊断结论的一致性检验，英文叫 intrarater reliability。

Bug Types for DL

1.API bug

The prime causes for triggering of deep learning API bugs can be because of the change of API definition with different versions, lack of inter-API compatibility and sometimes wrong or confused documentation.

接口版本变化，缺乏兼容性，文档模糊的问题

2.Coding bug

编程的语法错误，通常引发错误结果和run time error

3.Data bug

输入数据不规范或者清洗不干净引发，在data输入模型之前就会发生；

4.Structure bug

绝大多数的深度学习错误是由于对深度学习模型的结构定义不正确而导致的。这些包括深度学习模型的不同层之间的size不匹配，训练和测试数据集之间存在异常，在实现特定功能时使用不正确的数据结构等。

Control and Sequence bug

在许多情况下，由于错误的if-else或loop引发，模型无法按预期执行。
Data Flow bug

如果在输入到深度学习模型后由于输入数据的类型或形状不匹配而导致发生错误，则将其称为数据流错误。
Initialization Bug

在深度学习中，Initialization Bug表示参数或函数在使用之前未正确初始化。
Logic Bug

通常由模型结构有误导致，这些错误通常是在代码中没有适当保护条件的情况下生成的，或者是试图实现在深度学习模型的给定结构中无法实现的功能。
Processing Bug

每层的数据类型需要遵循它们之间的Contracts。由于违反这些Contracts或错误选择算法，会发生处理错误。

5.NMSB

发生在模型外的错误。与SB类似，但是发生在modeling之外的部分。

Control and Sequence bug
Initialization Bug
Logic Bug
Processing Bug

Root causes for bugs

Absence of inter API compatibility : 库之间不兼容。eg.无法在keras上直接使用Numpy函数。
Absence of type checking：Type 不匹配，尤其是在调用API时容易出现。
API Change. API版本改变；
API Misuse. 错误使用了某个API
Confusion with Computation Model：弄错模型的阶段
Incorrect Model Parameter or Structure (IPS) ：构建模型的错误，例如结构错误或者参数错误
Others. 与模型无关的一些错误
Structure Inefficiency (SI). 模型结构的问题会导致模型表现差。SI引起的是bad performance，IPS

引起的是crash
Unaligned Tensor (UT)：Tensor 向量维度出现问题
Wrong Documentation. 文档错误引发的结果

Effects of bugs

Bad performance:训练效果不好
Crash ：模型崩溃
Data Corruption：数据在使用过程中崩溃
Hang：长时间训练但是精度并未得到提升
Incorrect Functionality：功能性错误
Memory out of bound：内存资源不够用

FREQUENT BUG TYPES

Data bugs

Statistics of Bug Types in Stack Overflow and Github

F1: Data Bugs appear more than 26% of the times.

Data bugs常见于预处理的阶段

Structural Logic Bugs

F2: Caffe has 43% Structural Logic Bugs.

大部分Caffe的bug都是在构建模型结构的时候发生的

API Bugs

F3: API Bugs：Torch, Keras, Tensorflflow have 16%, 11% and 11% API bugs respectively.

这个bug具有一定的广泛性。Keras and Tensorflow 相比更严重。

Bugs in Github

F4: All the bug types have a similar pattern in Github and Stack Overflow for all the libraries.

对于每一种类型的bug，我们把其在github和stackoverflflow上对于五个后端的分布进行T检验，对于NMSLB以外的所有错误类型，P值均大于5％，这表明它们具有相似的分布。

ROOT CAUSE

Statistics of the Root Causes of Bugs

IPS

后果最严重

Finding 5: IPS is the most malicious root cause resulting in average 24% of the bugs across the libraries.

SI

Finding 6: Keras, Caffe have 25% and 37% bugs that are resulted from SI.

SI问题通常影响的是QoS和非功能性需求，对服务质量有很大的影响。

UT

Finding 7: Torch has 28% of the bugs due to UT.

Absence of Type checking

Finding 8: Theano has 30% of the bugs due to the absence of type checking.

API Change

Finding 9: Tensorflow and Keras have 9% and 7% bugs due to API change.

Root Causes in Github data

Finding 10: Except API Misuse all other root causes have similar patterns in both Github and Stack Overflow root causes of bugs.

Relation of Root Cause with Bug Type

Finding 11: SI contributes 3% - 52% and IPS contirbutes 24% - 62% of the bugs related to model.

IMPACTS FROM BUGS

Effects of Bugs in Stack Overflow and Github

Crash

最重大的影响

Finding 12: In average more than 66% of the bugs cause crash of the programs.

Bad Performance

Finding 13: In Caffe, Keras, Tensorflow, Theano, Torch 31%, 16%, 8%, 11%, and 8% bugs caused bad performance respectively.

Incorrect Functionality

Finding 14: 12% of the bugs in average in the libraries cause Incorrect Functionality .

Effects of Bugs in Github

Finding 15: For all the libraries the P value for Stack Overflow and Github bug effects reject the null hypothesis to confirm that the bugs have similar effects from Stack Overflow as well as Github bugs.

对于所有库，Stack Overflow和Github Bug效果的P值都将拒绝原假设，以确认这些Bug与Stack Overflow和Github Bug具有相似的效果。

DIFFICULT DEEP LEARNING STAGES

Data Preparation

Finding 16: 32% of the bugs are in the data preparation stage of the deep learning pipeline.

占比最大

Training stage

Finding 17: 27% of the bugs are seen during the training stage.

很多IPS 和 SI 错误都来自于这一阶段

Choice of model

Finding 18: Choice of model stage shows 23% of the bugs.

IPS，SI , UT都来自这一阶段

COMMONALITY OF BUG

Correlation of Bug Types among the libraries

这些library的强相关系数接近1。对于type of bug，Torch与其他库的相关性非常弱。

研究方式：随机研究了每个library中含有代码的30个posts，寻找可能导致这种错误类型强烈相关的antipattern。

Distribution of different antipatterns

Finding 19: Tensorflow and Caffe have a similar distribution of antipatterns while Torch has different distributions of antipatterns.

Tensorflow and Caffe 的分布类似， Torch与之截然不同。

在Tensorflow和Caffe中，30％+的antipatterns是Input Kludge。另一方面，在“Torch”中，有40％的错误是由于“Cut-and-Paste Programming”导致的。这可以分析correlation 之间的关系。

anti pattern的类型：

Continuous Obsolescence：API弃用

Cut-and-Paste Programming：复制粘贴编程

Dead Code：无效代码

Golden Hammer：开发人员和管理人员对现有的方法感到满意，不愿意学习和应用更适合的方法

Input Kludge：意外输入

Mushroom Management：一些情况下，为了完成他们的工作，开发人员必须做出假设，这可能导致伪分析，也就是说，在没有最终用户参与的情况下进行面向对象的分析。一些蘑菇管理项目完全消除了分析，并直接从高层需求进行设计和编码。

Spaghetti Code：软件结构也不够清晰。

EVOLUTION OF BUGS

Positive growth of Structural Logic Bugs

Finding 20: In Keras, Caffe, Tensorflow Structural logic bugs are showing increasing trend.

在Keras, Caffe, Tensorflow中， Structural logic bugs 有增长的趋势。

Decreasing trend of Data Bugs

Finding 21: Data Bugs slowly decreased since 2015 except Torch.

除了Torch之外，Data bugs都在减少

感受

1.这篇文章重新定义了：bug type 和 root cause, 应用到了DL领域；总结了现有的bug impact;

2.深度学习的seven stage pipeline: The stages are data collection, data preparation, choice of model, training, evaluation, hyper parameter tuning and prediction.

可以作为分析条件

3.可以看出 Tensorflow and Caffe 的bug分布是类似的，如果在今后的实验中，Tensorflow and Caffe 和 Torch 可能会出现差异的话，这里的结论可以用来解释；

4.这种类型的分析可以迁移到特定领域（自动驾驶，NLP等）

5.IPS SI 检测修复（某种特定类型的bug）