Linear Regression（线性回归）

2020-01-02 本文已影响0人没天赋的学琴

Linear Regression

线性回归(Linear Regression)，就是希望通过构建线性模型 $h_\theta(x) = \theta_0+\theta_1x_1+...+\theta_nx_n$ 来对连续值进行预测。例如有一些房子面积与其售价的数据，下面图中红色交叉点代表房子的数据，然后通过这些数据，学习到一个线性函数 $h_\theta(x)=\theta_0+\theta_1x$ ，然后当我输入不同的房子面积( $feet^2$ )就会得到相应的房价预测。

房价预测
回归就是对一个连续值进行预测，而由于我们最后得到的模型是一个线性模型，因此也称之为线性回归。
从优化角度来看，要得到一个线性回归模型主要是三步：

设立模型形式
设置损失函数
利用优化方法迭代更新得到模型

1. 模型

线性回归中，模型的表示可以将偏置项也归纳进 $\theta$ 中，得到： $h(x) = \theta^Tx$ 这就是线性回归的模型表示，数据属性上的线性组合。

2. 损失函数

利用迭代的方法来计算求解得到模型参数 $\theta$ ，本质上就是求解一个最优化问题： $\begin{aligned} & {arg min}_\theta f(\theta) \quad || \quad {arg max}_\theta f(\theta) \\ s.t. & \quad h_i (\theta) = 0 (i = 1, ..., k) \\ & \quad g_i (\theta) \leq 0 (i = 1, ..., z) \end{aligned}$
损失函数(cost function 或loss function)是用来衡量模型得到的预测值 $h(x)$ 与真值 $y$ 之间的差距。是在通过迭代法求解模型参数时，优化问题的目标函数 $f(\theta)$ 。线性回归中，常用的损失函数有：

Mean Square Error / L2 loss	$\sum _{i=1} ^m {(y^{(i)} - h(x^{(i)}))}^2$
Mean Absolute Error / L1 loss	$\sum _{i=1} ^m { \| y^{(i)} - h(x^{(i)}) \| }$
Huber Loss	$l = \begin{cases} {1 \over 2} {(y - h(x))}^2, & \| y - h(x) \| \leq \delta \\ \delta(\| y - h(x) \| - {1 \over 2} \delta), & O.W. \end{cases}$
Log cosh Loss	$\sum_{i=1} ^{m} {log(cosh(h(x^{(i)}) - y^{(i)}))}$
Quantile Loss	$\sum_{i: y^{(i)} < h(x^{(i)}) } {(1 - \gamma) \| y^{(i)} - h(x^{(i)}) \| } + \sum_{i: y^{(i)} \geq h(x^{(i)}) } {\gamma \| y^{(i)} - h(x^{(i)}) \| }$

不同的损失函数，具体会有不同的效用；引入损失函数后，模型训练就变成了下述最优化问题： $\begin{aligned} {arg min} _\theta \: loss(\theta) \end{aligned}$ 一个无约束最优化问题，该问题的意义是，尽可能地减少模型的预测值和真值的差距。

3. 迭代优化

模型参数的求解是一个无约束最优化问题，因此可以使用一些迭代法来求解；梯度下降法是较为常用的训练方法。迭代的规则如下： $\theta = \theta - \alpha * {\partial L \over \partial \theta}$ $\alpha$ 为学习率，即每次迭代的步长。下面以MSE作为例子，损失函数为： $\begin{aligned} L(\theta) & = {1 \over 2} (y - \theta^Tx)^2 \\ {\partial L \over \partial \theta} & = (y - \theta^Tx) * (-x) \end{aligned}$ 具体实现代码如下：

def SGD(X, y, MaxIter, alpha):
    
    """
    compute theta by stochastic gradient descent
    
    Inputs:
    - X: A numpy array containing datas(with bias), of shape(m, n)
    - y: A numpy array containing result value, of shape(m, 1)
    - MaxIter: A int of maximum of iteration
    - alpha: A float, learning rate
    
    Return:
    - theta: A numpy array of weight, of shape(n, 1)
    """
    m, n = X.shape
    theta = np.zeros( (n, 1) )
    
    for i in range(MaxIter):
        theta = theta - alpha * np.dot(X.T, np.dot(X, theta) - y) 
        
    return theta

代码中实现的是SGD(Stochastic Gradient Descent)，而BGD(Batch Gradient Descent)或MBGD(Mini-Batch Gradient Descent)的实现只需要除以相应的样本容量即可。
上述就是通过梯度下降法来得到线性回归模型的参数的整个过程；整个过程正如上述所说：设置模型、设置损失函数和利用迭代法求参数。其他机器学习算法也大致遵守上述流程。

概率角度

下面将从一个参数估计的角度，来说明上述迭代法为什么可以求得线性回归模型的参数。
任何一个预测值与真值之间可以有这样的表示： $y^{(i)}=\theta^Tx^{(i)}+\xi^{(i)}$ 预测值和真值之间会有一个误差的存在，而这里 $\xi^{(i)} \sim N(0, \sigma^2)$ ；因此就有： $p(\xi^{(i)})=\frac{1}{\sqrt{2\pi} \sigma}e^{-\frac{(\xi^{(i)})^2}{2\sigma^2}}$ 又因为 $\xi^{(i)}=y^{(i)}-\theta^Tx^{(i)}$ 等价于： $y^{(i)} | x^{(i)};\theta \sim N(0, \sigma^2)$ 服从于正态分布，对该正态分布进行参数估计，其极大似然方程为： $L(\theta)=\prod_{i=1}^{m} p(y^{(i)} | x^{(i)};\theta) = \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi} \sigma}e^{-\frac{y^{(i)}-\theta^Tx^{(i)}}{2\sigma^2}}$ 其对数形式：
$l(\theta)=logL(\theta) \\ =log \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi} \sigma}e^{-\frac{y^{(i)}-\theta^Tx^{(i)}}{2\sigma^2}} \\ =\sum_{i=1}^{m}log\frac{1}{\sqrt{2\pi} \sigma}e^{-\frac{y^{(i)}-\theta^Tx^{(i)}}{2\sigma^2}} \\ =mlog\frac{1}{\sqrt{2\pi}} - \frac{1}{\sigma^2}.\frac{1}{2}\sum_{i=1}^{m} {({y^{(i)}} - \theta^Tx^{(i)})}^2$
当我们需要最大化 $l(\theta)$ 时，等价于最小化 $\frac{1}{2}\sum_{i=1}^{m} {({y^{(i)}} - \theta^Tx^{(i)})}^2$ ；这部分就是上述所使用的MSE，就是在做线性回归的优化目标函数。而可以选择其他损失函数作为线性回归的损失函数，是因为其他损失函数的图像趋势与 $\frac{1}{2}\sum_{i=1}^{m} {({y^{(i)}} - \theta^Tx^{(i)})}^2$ 是一致的。

Normal Equation

对于上述的 $\partial L \over \partial \theta$ ，若严格按照矩阵形式来写： ${\partial L \over \partial \theta} = -x^T * (y - \theta^Tx)$ 由于需要计算损失函数 $L$ 的最小值，除了使用上述迭代法之外，根据数学知识，令其导数为0，求到相应的驻点则可确定使得 $L$ 取最小的 $\theta$ 的值，即： $\begin{aligned} {\partial L \over \partial \theta} = -X^T * (y - \theta^TX) & = 0 \\ -X^T * y + X^T \theta^TX & = 0 \\ X^T \theta^T X & = X^T * y \\ \theta & = {(X^TX)}^{-1} X^T y \end{aligned}$ 这就直接算得参数 $\theta$ 的值；这种方法被称为Normal Equation
具体代码如下：

def NormalEquation(X, y):
    
    """
    compute theta by Matrix equation
    
    Inputs:
    - X: numpy array containing datas, of shape(m, n)
    - y: numpy array containing result value, of shape(m, 1)
    
    Return:
    - theta: numpy array about model parameter, of shape(n, 1)
    """
    
    xTx = np.dot(X.T, X)
    
    if np.linalg.det( xTx ) == 0:
        raise Exception('xTx is not sigular')
        
    temp = np.dot(np.linalg.inv(xTx), X.T)
    theta = np.dot(temp, y)
    
    return  theta

这里有个需要注意的地方，就是要确保矩阵 $X^TX$ 是非奇异矩阵，这样可以求得该矩阵的逆。

局部线性加权回归(Locally weighted linear regression)

局部线性加权回归简单来说就是在训练的时候，给予每个样本点相应的权重，在训练的时候会带上其权重大小一起训练，因此参数 $\theta$ 的更新形式如下：
$\sum_{i=1}^{m}\omega^{(i)}{(y^{(i)} - \theta^Tx^{(i)})}^2$
而权重 $\omega^{(i)}$ 的其中一种计算方法为：
$\omega^{(i)} = e^{-\frac{{(x^{(i)} - x)}^2} {2k^2} }$
其中 $x$ 为需要进行预测的点；上式有点类似于svm中的kernel trick，若 $x^{(i)}$ 离参考点 $x$ 越近，则其权重越大；若 $x^{(i)}$ 离参考点 $x$ 越远，则其权重越小。下面是加入lwlr后的梯度下降法

def lwlrBatchGD(testPoint, X, Y, alpha, k):

    Xmat = np.mat(X);
    Ymat = np.mat(Y);
    Tmat = np.mat(testPoint);
    m = Xmat.shape[0];
    n = Xmat.shape[1];

    weights = np.mat( np.zeros((m,n)) );
    for i in range(m):
        weights[i] = np.exp( (Xmat[i] - Tmat) * (Xmat[i] - Tmat).T / (-2 * k * k) );
    wXmat = np.multiply(weights, Xmat);

    theta = np.mat( np.zeros((n,1)) );

    for i in range(1000):
      theta = theta - alpha * ((Xmat * theta - Ymat).T * wXmat).T / m;

    return Tmat * theta;

加入权重后，参数 $\theta$ 的解析解也可以写成：
$\theta = (X^T \omega X)^{-1}X^T\omega y$
下面是加入了lwlr的Normal Equation的代码：

def WeightNE(X, y, query, k):
    
    """
    implement locally weighted linear regression by Matrix equation
    
    Inputs:
    - X: numpy array containing datas, of shape(m, n)
    - y: numpy array containing result value, of shape(m, 1)
    - query: numpy array of query point, of shape(1, n)
    
    Return:
    - res: float, predict value about query,
    """
    
    weight = kernel(X, query, k)
    xTx = np.dot(X.T, weight * X)
    
    if np.linalg.det( xTx ) == 0:
        raise Exception('xTx is not sigular')
        
    temp = np.dot( np.linalg.inv(xTx), weight.T * X.T)
    theta = np.dot(temp, y)
    res = np.dot(query.T, theta)
    
    return res

局部线性加权回归这种方法，每次进行预测时，都需要带入待预测的点作为参考点，然后重新训练模型，再计算出相应的预测值。这种方法是为了解决使用线性回归时，模型欠拟合的问题。
个人觉得，局部线性加权回归与微积分中的以直代曲的思想类似；就是使用待预测点附近的样本来训练，得到相应的直线模型后再作预测。如下图，是在梯度下降法的基础上，使用局部线性加权回归得到了，其中k的值为0.025，可以看到最后得到的回归模型与数据分布有一个较好的相似趋势。

局部线性加权回归

通过控制k值来决定以参考点“附近”范围多大的点来做线性回归训练，然后通过一段段小直线来拼接成整体的曲线。
局部线性加权回归可以解决模型欠拟合的问题，但是需要注意的是，当k取得过小时，有可能造成模型的过拟合问题。

L2正则化

正则化也是解决模型过拟合常用的方法，进行L2正则化后，损失函数变为： $\begin{aligned} L(\theta) &= {1 \over 2} (X \theta - y)^2 + {\lambda \over 2} || \theta ||_2 ^2 \\ {\partial L \over \partial \theta} &= X^T (X\theta - y) + \lambda \theta \end{aligned}$ 加入L2正则化后，这种线性回归也叫做岭回归(Ridge Regression)。

L1正则化

L1正则化使用的是 $\theta$ 的1范数，损失函数为： $L(\theta) = {1 \over 2} (X \theta - y)^2 + {\lambda \over 2} || \theta ||_1$ 上式关于 $\theta$ 的求导，前半部分保持不变，后半部分根据 $\theta$ 的值生成同尺寸，取值范围为 $\{ 1, -1 \}$ 的矩阵，再乘以 $\lambda \over 2$ 相加即可得到。加入L1正则化后得到的参数 $\theta$ 通常是一个稀疏向量，因此L1正则化也有一种特征选择的作用。加入L1正则化后，这种线性回归也叫做LASSO回归(least absolute shrinkage and selection operator)。
正则化是一种缓解过拟合的一种技术，并不能说完全根除过拟合这一问题。