【机器学习快速入门教程5】DIY神经网络

2018-08-07 本文已影响66人 Doooer

章节5：DIY神经网络

神经网络是本教程最主要的内容，理解神经网络的工作原理十分有必要。
本文我们将介绍和实现一个具有有两层结构和sigmoid激活函数的简单的神经网络，我们将从上一章节提到的线性回归的例子的基础上开始构建。

线性回归的回顾

假如我们有一组点的集合(X,Y)，我们要找到一条直线y=mx+b来拟合这些数据点。我们期望找到参数m和b，使得误差平方和J最小。

使用梯度下降方法来寻找适合的参数m和b，我们需要计算损失函数J的梯度，如下：

接着我们随机选一组初始化参数m和b，然后定义一个更新规则，用来调整参数m和b，直到损失函数收敛。

神经网络

我们试着把线性回归问题改造成一个多层的神经网络。首先，我们增加一个输入变量，这样我们的输入就变成二维的了，回顾下我们之前的Iris数据集，将问题转化为：给定花萼长度和宽度，预测花瓣长度。让我们先导入数据集，

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris = load_iris()
data, labels = iris.data[:,0:2], iris.data[:,2]

num_samples = len(labels)  # size of our dataset

# shuffle the dataset
shuffle_order = np.random.permutation(num_samples)
data = data[shuffle_order, :]
labels = labels[shuffle_order]

就像之前的一维(一个输入变量)问题一样，我们还是可以做线性回归，只不过现在我们有两个变量(x1和x2)，因此就有两个权重(系数变量)。所以，我们得到以下函数方程，

现在，我们有3个参数w1，w2和b。我们可以用神经元图表示我们的函数方程f(x)，如下：

假设，w=[0.2,0.6]，b=-0.3，然后将着函数方程定义成weighted_sum，

def weighted_sum(x, w, b):
    return b + np.dot(w, x)

# set our paramters
w = [0.2, 0.6]
b = -0.3

# for example, let's use the first data point
X, y = data, labels

pred_y = [weighted_sum(x, w, b) for x in X]

# let's print out the first prediction
print("for x=[%0.2f, %0.2f], predicted = %0.2f, actual = %0.2f" % (X[0][0], X[0][1], pred_y[0], y[0]))

for x=[4.70, 3.20], predicted = 2.56, actual = 1.60

使用误差平方和函数评估我们预测的效果，

# sum squared error
def cost(y_pred, y_actual):
    return 0.5 * np.sum((y_actual-y_pred)**2)

error = cost(pred_y, y)
print(error)

313.0096

在本问题中，我们不能将损失函数的损失平面画出来了，因为我们的函数方程中有3个参数(w1,w2,b)，所以如果要画出损失平面，则是4维的。但我们可以像上章节中的方法一样，使用微积分计算代价函数针对所有参数的偏导数，来优化我们的3个参数。

因为我们有两个w和x，所以我们使用下标来区分它们。我们用上标来区分当前数据是数据集中的第几个数据。所以我们的偏导数如下，

我们使用以下的更新规则，来计算调整参数，

代码实现如下，

import random

# grab our data
X, y = data, labels

# always a good idea to normalize
X = X / np.amax(X, axis=0)
y = y / np.amax(y, axis=0)

# choose a random initial m, b
w, b = [random.random(), random.random()], random.random()

# our function w1 * x1 + w2 * x2 + b
def F(X, w, b):
    return np.sum(w*X, axis=1) + b

# what is our error?
y_pred = F(X, w, b)
init_cost = cost(y_pred, y)

print("initial parameters: w1=%0.3f, w2=%0.3f, b=%0.3f"%(w[0], w[1], b))
print("initial cost = %0.3f" % init_cost)

# implement partial derivatives of our parameters
def dJdw1(X, y, w, b):
    return -np.dot(X[:,0], y - F(X, w, b))

def dJdw2(X, y, w, b):
    return -np.dot(X[:,1], y - F(X, w, b))

def dJdb(X, y, w, b):
    return -np.sum(y - F(X, w, b))

# choose the alpha parameter and number of iterations
alpha = 0.001
n_iters = 2000

# run through gradient descent
errors = []
for i in range(n_iters):
    w[0] = w[0] - alpha * dJdw1(X, y, w, b)
    w[1] = w[1] - alpha * dJdw2(X, y, w, b)
    b = b - alpha * dJdb(X, y, w, b)
    y_pred = F(X, w, b)
    j = cost(y_pred, y)
    errors.append(j)
    
# plot the error
plt.figure(figsize=(16, 3))
plt.plot(range(n_iters), errors, linewidth=2)
plt.title("Cost by iteration")
plt.ylabel("Cost")
plt.xlabel("iterations")

# what is our final error rate
y_pred = F(X, w, b)
final_cost = cost(y_pred, y)

print("final parameters: w1=%0.3f, w2=%0.3f, b=%0.3f"%(w[0], w[1], b))
print("final cost = %0.3f" % final_cost)

initial parameters: w1=0.080, w2=0.220, b=0.259
initial cost = 5.384
final parameters: w1=1.902, w2=-0.923, b=-0.221
final cost = 0.665

我们可以画出我们预测的函数f(X)，输入变量为[x1,x2]，如下：

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.gca(projection='3d')
x1, x2 = np.meshgrid(np.arange(-10, 10, 1), np.arange(-10, 10, 1))
y = b + w[0]*x1 + w[1]*x2
ax.plot_surface(x1, x2, y, rstride=1, cstride=1, cmap=plt.cm.coolwarm, linewidth=0, antialiased=False)

至此，在该数据集上应用线性回归，我们找到了一个可以接受的代价，但我们仍然可以找到更小的代价。问题在于，线性回归是线性的，但有时候，我们的数据不能很好的像上面这张图一样拟合出一个平面。在真实的数据集中，数据是不规则的，明显分布在不同的曲面上。为了使我们的方程f(X) 更“灵活”，我们引入一个非线性转换方程，Sigmoid函数：

我们来看下Sigmoid函数和它的函数图像，

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

x = np.arange(-10.0, 10.0, 0.2)
sig = sigmoid(x)
plt.plot(x, sig)
plt.title('Sigmoid function')

基本上，一个sigmoid函数能将任何输入都转换成0到1之间的输出。所以，现在我们将修改f(X)函数，不单是简单的权重值相加，而是将它传入sigmoid函数，如下图：

因此，函数变为，

假如我们有跟之前一样的参数：

我们可以计算出预测值和误差，

def weighted_sum(x, w, b):
    return b + np.dot(w, x)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# reset our parameters
w = [0.2, 0.6]
b = -0.3

X, y = data, labels

# get weighted sum like before
Z = [weighted_sum(x, w, b) for x in X]

# now transform the weighted sums with a sigmoid
y_pred = [sigmoid(z) for z in Z]

# evaluate error
error = cost(y_pred, y)
print(error)

829.4187739184515

这结果看起来很糟糕，但我们还没进行优化！
在上述所给的权重值下，我们可以画出我们的神经元活动函数：

fig = plt.figure()
ax = fig.gca(projection='3d')
x1, x2 = np.meshgrid(np.arange(-10, 10, 1), np.arange(-10, 10, 1))
y = sigmoid(b + w[0]*x1 + w[1]*x2)
ax.plot_surface(x1, x2, y, rstride=1, cstride=1, cmap=plt.cm.coolwarm, linewidth=0, antialiased=False)

上述函数已经可以考虑成一个简单的神经网络。让我们增加一个隐藏层，使我们的神经网络更复杂一些。我们在输入和输出层之间，添加一个有3个神经元的隐藏层：

跟之前的步骤一样，我们从左边的输入层开始计算中间的每个神经元，当中间的神经元全部计算完成，我们就可以在输出层执行相同的计算，得到最后的结果。注意：通常不对输出层进行非线性转换，所以我们对三个隐藏层的神经元进行sigmoid变化，而不是输出层的神经元。
同之前的线性回归的神经网络比较，我们在该神经网络增加了两个变化，1）添加了对每个神经元激活函数的非线性sigmoid处理（除了输出神经元）；2）添加了中间的隐藏层。变化后的神经网络比之前更“灵活”，根据参数值的不同，它的函数图像可以各种弯曲变化，但在这网络上进行梯度下降也会变得更复杂。
代码实现如下，我们有两层权重值，分别为W1，W2。W1有2x3个权重系数，W2有3x1个权重系数，

W1 = np.random.randn(2, 3)
W2 = np.random.randn(3, 1)

print("W1=", W1)
print("W2=", W2)

W1= [[ 0.55604015 -1.30425457 -1.51214943]
 [-0.06828511 -0.63519787 -0.59453199]]
W2= [[-0.09451191]
 [-2.21387541]
 [-0.4512179 ]]

为了简单起见，我们把偏差项b先初始化为0。从输入层到输出层传递的计算
叫做“前向传播”，我们先实现“前向传播”：

X, y = data, labels

# first layer weighted sum z
z = np.dot(X, W1)

# project z through non-linear sigmoid
z = sigmoid(z)

# do another dot product at end (sigmoid is omitted)
y_pred = np.dot(z, W2)

# what is our cost
error = cost(y_pred, y)

print('predicted %0.2f for example 0, actual %0.2f, total cost %0.2f'%(pred_y[0], y[0], error))

predicted 2.56 for example 0, actual 1.60, total cost 201434.60

我们把这些操作整合成一个类：

class Neural_Network(object):
    def __init__(self, n0, n1, n2):        
        self.n0 = n0
        self.n1 = n1
        self.n2 = n2
        
        # initialize weights
        self.W1 = np.random.randn(self.n0, self.n1)
        self.W2 = np.random.randn(self.n1 ,self.n2)
        
    def predict(self, x):
        z = np.dot(x, self.W1)
        z = sigmoid(z)
        y = np.dot(z, self.W2)
        return y

实例化一个有2个输入神经元，2个隐藏神经元和1个输出神经元的神经网络。

net = Neural_Network(2, 3, 1)

调用predict函数就能执行神经网络的前向传播：

X, y = data, labels
y_pred = net.predict(X)
error = cost(y_pred, y)

print('predicted %0.2f for example 0, actual %0.2f, total cost %0.2f'%(pred_y[0], y[0], error))

predicted 2.56 for example 0, actual 1.60, total cost 211597.43

梯度下降

回顾下，我们有一个2x3x1的神经网络，有9个权重系数和4个偏差系数，共13个参数。那怎样优化这些参数使得代价函数最小呢？
通过上一章线性回归的例子，我们知道可以通过梯度下降的方法来解决该问题。实际上，对于神经网络我们现在只能通过梯度下降来解决了。因为不像线性回归，除了梯度下降方法还可以用分析法。而神经网络是非线性的，所以分析法已经不适用了。
梯度下降的目标就是寻找损失函数J对于每个参数w（此处的w包含偏差系数b）的梯度。

在此基础上，对每个参数应用可迭代使用的更新规则：

让我们用最简单的方法——数值法（定义法）来计算梯度，梯度的定义如下：

因此，损失函数对于某个参数的偏导数计算如下：

注意该计算方法是最简单的，也是计算最慢、效率最低的。因为该方法要求每次更新时对网络中每个参数都做一次前向传播。如果我们的网络有1,000,000个参数，则至少要进行1,000,000次前向传播，每次传播至少运行1,000,000次乘法。后面我们会介绍一个更快、更有效率的计算梯度的方法——反向传播算法。现在，为了更容易理解梯度下降，我们用数值计算的方法来实现。

import itertools

def get_gradient(net, X, y):
    w_delta = 1e-8

    # get the current value of the loss, wherever the parameters are
    y_pred_current = net.predict(X)
    error_current = cost(y_pred_current, y)

    # grab the current weights and copy them (so we can restore them after modification)
    dw1, dw2 = np.zeros((net.n0, net.n1)), np.zeros((net.n1, net.n2))
    W1, W2 = np.copy(net.W1), np.copy(net.W2) 
    
    # for the first layer, iterate through each weight, 
    # perturb it slightly, and calculate the numerical 
    # slope between that loss and the original loss
    for i, j in itertools.product(range(net.n0), range(net.n1)):
        net.W1 = np.copy(W1)
        net.W1[i][j] += w_delta
        y_pred = net.predict(X)
        error = cost(y_pred, y)
        dw1[i][j] = (error - error_current) / w_delta

    # do the same thing for the second layer
    for i, j in itertools.product(range(net.n1), range(net.n2)):
        net.W2 = np.copy(W2)
        net.W2[i][j] += w_delta
        y_pred = net.predict(X)
        error = cost(y_pred, y)
        dw2[i][j] = (error - error_current) / w_delta

    # restore the original weights
    net.W1, net.W2 = np.copy(W1), np.copy(W2)

    return dw1, dw2

上述代码中的get_gradient计算在数据集X，y下的2个网络层之间的梯度，接下来，我们将演示网络中的学习过程，首先，导入数据集，实例化网络，然后用梯度下降方法训练数据集，

# load the data and labels
X, y = data, labels.reshape((len(labels),1))

# it's always a good idea to normalize the data between 0 and 1
X = X/np.amax(X, axis=0)
y = y/np.amax(y, axis=0)

# create a 2x3x1 neural net
net = Neural_Network(2, 3, 1)    

# what is the current cost?
y_orig = net.predict(X)
init_cost = cost(y_orig, y)
print("initial cost = %0.3f" % init_cost)

# Set the learning rate, and how many epochs (updates) to try
n_epochs = 2000
learning_rate = 0.01

# for each epoch, calculate the gradient, then subtract it from the parameters, and save the cost
errors = []
for i in range(n_epochs):
    dw1, dw2 = get_gradient(net, X, y)
    net.W1 = net.W1 - learning_rate * dw1
    net.W2 = net.W2 - learning_rate * dw2
    y_pred = net.predict(X)
    error = cost(y_pred, y)
    errors.append(error)
        
# plot it
plt.plot(range(0, len(errors)), errors, linewidth=2)

# what is the final cost?
y_pred = net.predict(X)
final_cost = cost(y_pred, y)
print("final cost = %0.3f" % final_cost)

initial cost = 85.579
final cost = 0.562

至此，我们使用数值梯度下降方法实现了一个神经网络。在下一章节中，我们将学习Keras，一个高级的机器学习库，并在接下来的教程中会一直使用它。

【机器学习快速入门教程5】DIY神经网络

章节5：DIY神经网络

线性回归的回顾

神经网络

梯度下降

猜你喜欢

热点阅读