Deep Learning | 2 Improving Deep

2018-03-13  本文已影响0人  shawn233

0 Intrduction

Best thanks to Professor Andrew Ng and his learning materials for deep learning. This note is based on the videos of Andrew Ng.

If you are a deep learning starter, I strongly recommend you to watch these videos, they are easy to understand, and totally free.

The rest part of this note is a brief summary of the materials, take a look if you like.

1 Setting Up Your ML Application

1.1 Train/Dev/Test Sets


Data = Training set + Hold-out Cross Validation Set / Development Set + Test Set

Workflow

Data split

1.2 Bias and Variance


In machine learning, by looking at the training set error and dev set error, we could get the sense of bias and variance.

Example Consume the optimal error is 0, and the training sets and dev sets come from the same distribution, then we have the following cases:

Case 1 2 3 4
Training set error 1% 15% 15% 0.5%
Dev set error 11% 16% 30% 1%
Conclusion high variance high bias high bias and variance low bias and low variance

In the 2-dimensional case, we could draw the following graphs:

1.3 Basic recipe for machine learning


1.4 Regularization


Prevent over-fitting, and reduce the variance.

1) L2 Regularization (or Weight Decay) in neural network

2) How does regularization prevent from over-fitting?

用下图简单说明。

当我们把正则化参数lambda设置得很小时,系统趋于没有正则化的情况,即右图的过度拟合(over-fitting)情况;

当我们把lambda设置得很大时,系统中的大部分节点权重极小,导致网络退化为一个简单的形式,类似左图中欠拟合(underfitting)的情况;

因此,存在一个lambda的中间值,使得系统落在中间图的情况。

因此我们要做的是寻找恰当的lambda,既能减小方差(variance),又能保持较小的偏置(bias)。

简单来说,正则化的作用就是使过度拟合的网络向线性的方向退化,以期减少过拟合的情况。

3) Dropout (随机失活) Regularization

Dropout Regularization: 在训练神经网络的过程中,对每一个样本,在每一层的运算间随机关闭掉(失活)一些节点。也就是说,对每个样本,它经过的网络都是原网络的一个随机产生的子网络

# Illustrate with layer l = 3

keep_prob = 0.8

d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob 
#matrix d3 of type bool,value false marks the nodes to be dropped out

a3 = np.multiply (a3, d3)       
# element-wise multiplication, true is converted to 1 while false is converted to 0

a3 /= keep_prob
# so that the dropped out nodes won't affect the expectation of a3

4) Other Regularization Methods

(For more detail, please watch the video.)

1.5 Normalizing Input Features


Two steps

  1. zero out the means (零均值化)
  1. normalize the variances (归一化方差)

Why normalize inputs?

归一化输入的好处在于,代价函数更加均匀。因此我们不论选取怎样的学习率alpha,最终都能顺利到达最优点。

反之,如果输入没有归一化,则代价函数的图形如左图所示(以二维输入为例),则我们不得不选取很小的学习率alpha,以期在多轮迭代后不会错过最优点。

1.6 Vanishing / Exploding Gradients 梯度消失与梯度爆炸


1.7 Weight Initialization for Deep Networks


# weight matrix initialization
W[l] = np.random.randn(<shape>) * np.sqrt(2/n[l-1])

# Here is some notes about the expression in the sqrt parenthesis
#
# if the activation function is ReLU, 
#    then you should use: np.sqrt(2/n[l-1])
#
# and if the activation function is tanh, 
#     then you should use np.sqrt(1/n[l-1]) or np.sqrt(2 / (n[l-1] + n[l]))

1.8 Gradient Checking 梯度检验


2 Optimization Algorithms

2.1 Mini-batch Gradient Descent


Notation

notation name explanation
(i) round brackets index of different training examples
[i] square brackets index of different layers of the neural network
{t} curly brackets index of different mini batches

Intuition

Further Understanding

(Watch the whole fruitful video for more details.)

2.2 Exponentially Weighted Average


Bias Correction

2.3 Gradient Descent With Momentum


Implementation

首先初始化矩阵 vdW, vdb 为全0矩阵。

Gradient descent with momentum will almost always work better than the straightforward gradient descent without momentum.

2.4 RMSprop Optimization Algorithm


RMS stands for "Root mean square".

Implementation

初始化 SdW, Sdb 为 零矩阵。

epsilon是防止分母为0引入的一个小值,可以选择如10e-8 。

2.5 Adam Optimization Algorithm


Adam represents "Adaptive Moment Estimation".

Algorithm

Default Parameter Setting

2.6 Learning Rate Decay


Implementation


吴教授指出了深度学习优化算法的问题,不在于落入局部最优点。这是因为在高维条件下,出现局部最优的条件是所有维度的凹凸性一致,所以大部分梯度为0的点是鞍点(saddle)。

而优化算法真正的问题,在于平缓段(plateau)对学习效率的影响。当学习进入到平缓阶段时,就需要用好的优化算法离开平缓段。

3 Hyperparameter Tunning

3.1 Tuning Process


#实现在对数坐标上的随机取样
r = -4 * np.random.rand()
alpha = 10**r

3.2 Batch Norm (BN)


归一化输出层的推广:每一层都进行归一化。

我们通常只需要在框架里直接使用BN就可以了,若想了解详细的原理和实现,请参考下面几个视频:

给出某一层的计算式:

参数betagamma需要用梯度下降来学习。值得一提的是,如果使用了BN,每一层的参数b可以省略了。因为无论加上了什么常数,在归一化时都会被减去。

  • It limits the amount to which updating the parameters in the earlier layers can effect the distribution of values that a later layer sees and learns on.

  • Batch norm reduces the problem of the input values changing, it really causes these values to become more stable.

  • Batch norm adds some noise to each hidden layer's activations. And so similar to dropout, batch norm therefore has a slight regularization effect.

3.3 Softmax Regression


Math of the Softmax Layer

Softmax Examples

Training a Softmax Classifier

And for the gradient descent, the first expression should be:

3.4 Deep Learning Frameworks


上一篇 下一篇

猜你喜欢

热点阅读