动手学梯度下降

2020-03-05 本文已影响0人老居搞机

梯度下降是机器学习中使用最广泛的优化方法(没有之一), 今天我们用一个线性回归作为例子, 一边撸代码, 一边认识一下梯度下降究竟是怎么工作的吧

线性回归

所谓回归就是对连续性的变量做出预测, 比如年龄、房屋价格、花费时间, 都是有数值大小的, 如果像男、女这样没有数值大小的叫分类, 就不是回归了

线性回归 是回归中的一种, 可以通俗的理解为我们中学时候学过的一元多次方程:

h(x) = θx + b

新建线性回归函数:

import numpy as np
def liner(x, theta):
    return np.dot(x, theta.T)

看一下效果:

x = np.array([[0],[1],[2],[4],[9],[15]])
theta = np.array([[2]])
y = liner(x, theta)

损失函数

如果我们有了x和y的数据，需要求出参数θ的时候，为此需要定义一个损失函数, 线性回归的损失函数一般使用最小二乘损失函数:

m是有多少条数据集
i是第i条数据
y是第i条数据对应的值
前面加1/2是为了后面求导数的时候方便和二乘做抵消, 没有任何实际意义

我们的目标是最小化这个损失函数： minimize L(θ)

最小二乘是一个凸函数, 我们可以采用梯度下降的方式来优化它

新建损失函数:

def loss(x, y, theta, f):
    square = 0.5  * (f(x, theta) - y) ** 2 / x.shape[0]
    return square.sum()

梯度下降(Batch Gradient Descent)

想象一下当我们站在山顶上, 要去山脚的时候怎么办? 我们会先环顾一下四周, 找一个最快下山的方向, 往这个方向上走一步, 再停下来接着看一下找到目前位置最快下山的方向再走一步, 如此重复直到山脚下

同样道理, 所谓梯度就是函数变化最快的方向(即导数), 沿着梯度相反的方向, 那就是最快变小的方向, 函数的梯度方向(假设有2个参数θ0, θ1)为:

通过不停的迭代梯度下降，最终来找到各个参数的最优解:

α 是走的梯度步长, 如果太小会收敛的很慢(步子太小到山脚可能几年以后了),如果太大容易越过最小值的点走到对面去(步子太大跨到另一座山上去了, 容易扯到蛋), 所以α是需要根据实际数据不断调参来获得的

根据链式法则对最小二乘求导, 第j个参数梯度为:

翻译成代码:

def grad(x, y, theta, axis, f):
    x_index = np.array([x[:, axis]]).T
    return ((f(x, theta) - y) * x_index).sum() / float(x.shape[0])

新建梯度下降迭代方法:

def BGD(input_x, input_y, f, loss_fun, grad, epochs=1000, alpha=1e-3, loss_min=1e-3):
    x = input_x
    y = input_y
    x_size = x.shape
    theta_num = x_size[1]
    theta = np.zeros((1, theta_num))
    for epoch in range(epochs):
        for axis, theta_val in enumerate(theta[0]):
            theta[0][axis] = theta_val - alpha * grad(x, y, theta, axis, f)

        loss_sum = loss_fun(input_x, input_y, theta, f)
        if (epoch + 1) % 100 == 0:
            print "BGD epoch: %s, theta: %s, loss: %s" % (epoch+1, theta[0], loss_sum)
        if loss_sum < loss_min:
            break

    return theta[0]

测试一下这个梯度下降, 随机生成50条数据:

def get_data(size=50):
    x = np.random.randint(1, 50, size=size).reshape(size, 1)
    # 参数为10, 2, 一会我们就是寻找这个参数
    theta_data = np.array([[10, 2]])
    x_0 = np.ones((x.shape[0], 1))
    x_with_x0 = np.concatenate((x_0, x), axis=1)
    # 对Y做一些随机扰动
    y = liner(x_with_x0, theta_data) + np.random.randn(size, 1) * 5
    return x, y
    
x, y = get_data()

对数据做梯度下降:

x_0 = np.ones((x.shape[0], 1))
input_x = np.concatenate((x_0, x), axis=1)
theta = BGD(input_x, y, f=liner, loss_fun=loss, grad=grad, epochs=10000)
print "theta:", theta

梯度下降的拟合过程:

BGD epoch: 100, theta: [0.30646073 2.27916268], loss: 24.931265083939444
BGD epoch: 200, theta: [0.54211823 2.27247087], loss: 24.375932686582402
BGD epoch: 300, theta: [0.77209253 2.26594044], loss: 23.847062499995722
...
BGD epoch: 9900, theta: [9.18488859 2.02704782], loss: 13.374519777851393
BGD epoch: 10000, theta: [9.20643074 2.0264361 ], loss: 13.369879239216598

theta: [9.20643074 2.0264361 ]

可以看到获得的参数结果theta跟我们初始化的[10, 2]已经比较接近了, 蓝点为我们随机生成的数据, 红线就是是我们拟合参数的线性回归

随机梯度下降(Stochastic Gradient Descent)

刚才我们的梯度下降每次迭代都需要使用全量的数据, 如果在数据量很大的时候(比如100w条), 每次都计算一次所有的数据会非常慢, 可以每次迭代的时候随机取一条数据(随机梯度下降), 或者随机取少量的几条数据来做梯度计算(小批量梯度下降), 来提高我们的迭代效率

随机梯度下降代码:


def SGD(input_x, input_y, f, loss_fun, grad, epochs=1000, alpha=1e-3, loss_min=1e-3):
    x_size = input_x.shape
    theta_num = x_size[1]
    theta = np.zeros((1, theta_num))
    data_num = x_size[0]
    for epoch in range(epochs):
        index = random.randint(0, data_num - 1)
        x = np.array([input_x[index]])
        y = np.array([input_y[index]])
        for axis, theta_val in enumerate(theta[0]):
            theta[0][axis] = theta_val - alpha * grad(x, y, theta, axis, f)

        loss_sum = loss_fun(input_x, input_y, theta, f)
        if (epoch + 1) % 100 == 0:
            print "SGD epoch: %s, theta: %s, loss: %s" % (epoch+1, theta[0], loss_sum)
        if loss_sum < loss_min:
            break

    return theta[0]

小批量梯度下降(Mini-batch Gradient Descent)

def MBGD(input_x, input_y, f, loss_fun, grad, bsize=10, epochs=1000, alpha=1e-3, loss_min=1e-3):
    x_size = input_x.shape
    theta_num = x_size[1]
    theta = np.zeros((1, theta_num))
    start = 0
    data_num = x_size[0]
    for epoch in range(epochs):
        end = start + bsize
        if end <= data_num:
            x = input_x[start:end]
            y = input_y[start:end]
        else:
            x = np.append(input_x[start:data_num], input_x[0:end - data_num])
            y = np.append(input_y[start:data_num], input_y[0:end - data_num])
        start += bsize
        start = start % data_num
        for axis, theta_val in enumerate(theta[0]):
            theta[0][axis] = theta_val - alpha * grad(x, y, theta, axis, f)

        loss_sum = loss_fun(input_x, input_y, theta, f)
        if (epoch + 1) % 100 == 0:
            print "MBGD epoch: %s, theta: %s, loss: %s" % (epoch + 1, theta[0], loss_sum)
        if loss_sum < loss_min:
            break

    return theta[0]

后记

随机梯度下降虽然加快的训练速度, 但是因为数据量少在梯度方向上就不一定正确了, 这会引起梯度方向上来回震荡(小批量梯度下降稍微好一点, 但也会震荡), 有没有一种方法既能够像随机梯度下降一样速度快, 又能够像全量梯度下降一样保证正确的方向呢?

答案是有的, 这是我们下期要分享的带动量的梯度下降, 敬请期待

参考:

[1] 吴恩达机器学习视频
[2] https://www.cnblogs.com/geo-will/p/10468253.html

关注公众号