机器学习之线性回归

2019-05-28 本文已影响0人倔犟的贝壳

线性回归

场景:使用线性回归来预测波士顿房价。

说明

线性回归，即学习一个线性方程来拟合特征X与结果Y。
如根据房屋面积x1，房间数量x2，地理位置x3等来预测房屋的价格y。
所以我们要学习一个方程:
$y=w_1x_1+w_2x_2+w_3x_3 + b$
这个方程就是线性回归的 $模型函数$ ，就是最终我们用来预测y值的函数
其中 $w_1,w_2,w_3,b$ 就是我们要学习的参数。

如何学习 $w_1,w_2,w_3,b$ 呢，我们要学到怎样的 $w_1,w_2,w_3,b$ 才能证明这个模型ok呢？
我们的目标是让预测值尽可能地接近真实值。设预测值为 $y'$ ,真实值为y，我们当然是希望|y- $y'$ |的值越小越好。
所以我们引入一个代价函数，用来衡量整体的预测值与真实值的整体差距。代价函数如下:
J(W,b) = $\frac{1}{2m}\sum_{i=1}^{m}{} (y'^{(i)}-y^{(i)})^2=\frac{1}{2m}\sum_{i=1}^{m}{} (W·X^{(i)}+b-y^{(i)})^2$

我们的目标就是要最小化J(W,b)。最小化J(W,b)的方法就是梯度下降法。

变量说明

对所用到的变量做一个统一说明，方便检查。

将 $y=w_1x_1+w_2x_2+w_3x_3 + b$ 改写为:
$y=w_0x_0+w_1x_1+w_2x_2+w_3x_3$

设:
m: 样本个数
n_x：特征维度
θ：( $w_0,w_1,w_2,w_3 ...)$
则：
X的shape 为:(m,n_x+1)
y的shape为：(m,1)
θ 的shape = (n_x+1,1)

实现

Package

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
import seaborn as sb

加载数据

X,y = datasets.load_boston(return_X_y=True)
y = y.reshape(-1,1)

#将数据分为训练集和测试集
train_X,test_X,train_y,test_y = train_test_split(X,y,test_size = 0.15,random_state = 1)
print(f"train_X的大小为：{train_X.shape}")
print(f"tain_y的大小为：{train_y.shape}")
print(f"test_X的大小为：{test_X.shape}")
print(f"test_y的大小为：{test_y.shape}")

train_X的大小为：(430, 13)
tain_y的大小为：(430, 1)
test_X的大小为：(76, 13)
test_y的大小为：(76, 1)

#标准化
def nomalize(X,axis):
    mean = np.mean(X,axis)
    std = np.std(X,axis)
    print(mean.shape)
    return (X-mean)/std, mean,std

#将数据标准化
train_X,mean,std = nomalize(train_X,axis=0)
test_X = (test_X-mean)/std

#插入一列全为1的表示x0
train_X = np.insert(train_X,0,1,axis=1)
test_X = np.insert(test_X,0,1,axis=1)
print(train_X.shape)
print(test_X.shape)

(13,)
(430, 14)
(76, 14)

初始化参数

def init_parameters(n):
    theta = np.random.randn(n,1)
    return theta

定义损失函数

def compute_cost(y_,y):
    m = y.shape[0]
    cost = np.sum(np.square(y_-y))/(2*m)
    return cost

梯度下降

损失函数J(·)是一个凸函数。存在极小值。
梯度下降所做的就是在损失函数上沿着导数方向下降，从而靠近极小值。
所以实现梯度下降的步骤为:
1.对θ求偏导:
$d_θ = \frac{d_{J(θ)}}{d_θ} = \frac{1}{m}X.T·(X·θ-y)$
2.根据 $d_θ$ 更新θ的值:
$θ = θ-αd_θ$
α为学习速率，人为指定。

def gradient_desent(X,y,theta,learning_rate):
    m = y.shape[0]
    y_ = np.dot(X,theta)
    d_theta = np.dot(X.T,y_-y)/m
    theta = theta - learning_rate*d_theta
    return theta

预测

使用模型函数进行预测

def predict(X,theta):
    return  np.dot(X,theta)

优化

def optimizer(train_X,train_y,theta,learning_rate,steps):
    costs = []
    for step in range(steps):
        theta = gradient_desent(train_X,train_y,theta,learning_rate)
        y_ = predict(train_X,theta)
        loss = compute_cost(y_,train_y)
        costs.append(loss)
        if step % 100 == 0:
            print(f"\nAfter {step} step(s),cost is :{loss}")
    return theta,costs

计算正确率

给定一个误差范围，如果预测值与真实值之差在该范围内，则表示预测准确

def calc_accuracy(y_pred,y,error_ratio):   
    '''
    y_pred---预测值
    y -- 真实值
    error_ratio ---误差范围，相比于真实值的百分比，如0.1，0.05
    '''
    y = y.reshape(-1,1)
    m = y.shape[0]
    correct_num = np.sum(np.fabs(y_pred-y) < error_ratio*y)
    return correct_num/m

组合到一起，训练模型

def model(train_X,train_y,test_X,test_y,learning_rate=0.05,steps=1):
    m,n_x = train_X.shape
    print(learning_rate)
    #初始化参数
    theta = init_parameters(n_x)
    theta,costs = optimizer(train_X,train_y,theta,learning_rate,steps)
    
    error_ratio = 0.30 # 即误差不能超过30%
    print("==== 训练集验证 ====")
    y_pred = predict(train_X,theta)
    corr_ratio = calc_accuracy(y_pred,train_y,error_ratio)
    print(f"训练集的正确率为：{corr_ratio}")
    
    print("==== 验证集验证 ====")
    y_pred = predict(test_X,theta)
    corr_ratio = calc_accuracy(y_pred,test_y,error_ratio)
    print(f"验证集的正确率为：{corr_ratio}")
    cost = compute_cost(y_pred,test_y)
    print(f"验证集的损失为：{cost}")

    # 绘制损失函数
    plt.xlim(0,steps)
    plt.plot(costs)
    plt.xlabel("step(s)")
    plt.ylabel("costs")
    plt.show()

model(train_X,train_y,test_X,test_y,learning_rate=0.05,steps=800)

After 600 step(s),cost is :11.010287620444073

After 700 step(s),cost is :11.008066076099043
==== 训练集验证 ====
训练集的正确率为：0.872093023255814
==== 验证集验证 ====
验证集的正确率为：0.8289473684210527
验证集的损失为：10.975677786706013

损失函数曲线

源码地址：https://github.com/huanhuang/housePrices.git