Python与机器学习

1 线性回归模型的原理与 scikit-learn 实现

2018-12-02  本文已影响22人  水之心

设有数据集 \{ x^{(1)}, x^{(2)}, \ldots, x^{(m)} \},对于每一个样本 x^{(i)} \in \mathbb{R}^{n},令

\begin{cases} X = \begin{pmatrix} x^{(1)}\\ x^{(2)}\\ \vdots\\ x^{(m)} \end{pmatrix}\\ Y = \begin{pmatrix} y^{(1)}\\ y^{(2)}\\ \vdots\\ y^{(m)} \end{pmatrix}\\ \end{cases}

X 为数据集 \{(x^{(i)}, y^{(i)})\}_{i=1}^m设计矩阵。其中,y^{(i)}x^{(i)} 对应的标签。

注意

线性回归模型

我们先考虑一个样本 x^{(i)},且 w \in \mathbb{R}^{n}, b \in \mathbb{R},有

\begin{aligned} &\hat{y}^{(i)} = x^{(i)} w + b\\ &\ell_i = \frac{1}{2} (y^{(i)} -\hat{y}^{(i)})^2 \end{aligned}

再考虑所有样本,有

\ell = \frac{1}{2m} \sum_{i=1}^m \ell_i = \frac{1}{2m} ||Xw + b \cdot \mathbb{1} - Y||_2^2

下面我们来看看如何更新参数的?(最小化 \ell

1. 梯度下降

我们先求下梯度:

\begin{aligned} &\nabla_{w} = \frac{\partial \ell}{\partial w} = \frac{1}{m} X^T (Xw + b \cdot \mathbb{1} - Y)\\ &\nabla_b = \frac{\partial \ell}{\partial b} = \frac{1}{m} \mathbb{1}^T \cdot (Xw + b \cdot \mathbb{1} - Y) \end{aligned}

再更新参数:

\begin{aligned} &w = w - \alpha \nabla w\\ &b = b - \alpha \nabla b \end{aligned}

其中,\alpha 被称为学习率步长

scikit-learn 的实现方式不是梯度下降法,而是最小二乘法。

2 最小二乘法

\theta = \begin{pmatrix} w \\ b \end{pmatrix}\overline{X} = \begin{pmatrix} X & 1 \end{pmatrix},则

\ell = \frac{1}{m} ||\overline{X} \theta - Y||_F^2

\frac{\partial \ell}{\partial \theta} = 0 可得最小二乘解

\theta^{*} = (\overline{X}^T \overline{X})^{\dagger} Y

sklearn 实现

本文,利用 Kaggle 上数据集 USA_Housing 做线性回归任务来预测房价。

import pandas as pd
import numpy as np
name = '../dataset/USA_Housing.csv'
dataset = pd.read_csv(name)

train = dataset.iloc[:3000,:]
test = dataset.iloc[3000:,:]

print(train.shape)
print(test.shape)
(3000, 7)
(2000, 7)

查看有无缺失值:

print(np.unique(train.isnull().any()))
print(np.unique(test.isnull().any()))
[False]
[False]
dataset.columns  # 查看所有特征名称
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

我们不考虑 'Address' 特征。通过特征 'Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population' 来预测 'Price'

features_column = [
    name for name in dataset.columns if name not in ['Price', 'Address']
]
label_column = ['Price']

x_train = train[features_column]
y_train = train[label_column]
x_test = test[features_column]
y_test = test[label_column]

训练线性模型

from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(x_train,y_train) 
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

预测:

pred_test = regr.predict(x_test)
print('预测:\n', pred_test)
预测:
 [[1448272.69412661]
 [1301968.70364148]
 [1355317.66349181]
 ...
 [1026886.81679043]
 [1261208.34730572]
 [1301748.28071761]]

计算损失:

from sklearn.metrics import mean_squared_error,r2_score
print('mean square error:', mean_squared_error(y_test,pred_test))
mean square error: 10305125663.199516

计算 r2_score:

print(r2_score(y_test,pred_test))
0.9174114106593986

关于如何使用梯度下降法实现,具体参考:线性回归模型的 MXNet 与 TensorFlow 实现

上一篇 下一篇

猜你喜欢

热点阅读