多项式回归,方差和偏差

2019-02-26  本文已影响0人  此间不留白

前言

在之前的学习中,已经详细了解了正则化线性回归,方差和偏差的相关内容,在这个练习中,将会用python实现正则化线性回归并用它来研究不同的方差与偏差模型的相关属性。

正则化线性回归

问题引入:利用水库中水位的变化量来预测大坝中水的流出量,用变量x表示水库中水位变化量的历史数据,而用变量y表示大坝中水的流出量。

导入相关模块,和初始化数据之后,画出训练集的散点图如下所示:

import matplotlib.pyplot as plt
import numpy as np
import scipy.io as scio
# 绘制散点图
plt.ion()
np.set_printoptions(formatter={'float': '{: 0.6f}'.format})
print('Loading and Visualizing data ...')
data = scio.loadmat('ex5data1.mat')
X = data['X']
y = data['y'].flatten()
Xval = data['Xval']
yval = data['yval'].flatten()
Xtest = data['Xtest']
ytest = data['ytest'].flatten()

m = y.size
plt.figure()
plt.scatter(X, y, c='r', marker="*")
plt.xlabel('Change in water level (x)')
plt.ylabel('Water folowing out of the dam (y)')
def linear_reg_cost_function(theta, x, y, lmd):
    # Initialize some useful values
    m = y.size

    # You need to return the following variables correctly
    cost = 0
    grad = np.zeros(theta.shape)
    hypo = np.dot(x,theta)
    cost = 1/(2*m)*np.sum(np.subtract(hypo,y)**2) + lmd/(2*m)*np.sum(theta[1:]**2)
    reg = theta*(lmd/m)
    reg[0] = 0
    grad = np.dot(x.T,(np.subtract(hypo,y))**2) + reg

    return cost, grad
theta = np.ones(2)
cost, _ = linear_reg_cost_function(theta, np.c_[np.ones(m), X], y, 1)

print('Cost at theta = [1  1]: {:0.6f}\n(this value should be about 303.993192'.format(cost))

根据以上代码,最后计算出的结果如下图所示:



def train_linear_reg(x, y, lmd):
    initial_theta = np.ones(x.shape[1])

    def cost_func(t):
        return linear_reg_cost_function(t, x, y, lmd)[0]

    def grad_func(t):
        return linear_reg_cost_function(t, x, y, lmd)[1]

    theta, *unused = opt.fmin_cg(cost_func, initial_theta, grad_func, maxiter=200, disp=False,full_output=True)

    return theta

#绘制线性回归直线
lmd = 0
theta = train_linear_reg(np.c_[np.ones(m), X], y, lmd)
plt.scatter(X, y, c='r', marker="x")
# Plot fit over the data
plt.plot(X, np.dot(np.c_[np.ones(m), X], theta))

运行以上代码,最后得到图像结果如下所示,可以清楚地看到,散点图与直线拟合效果并不理想。


方差和偏差

机器学习中非常重要的概念是偏差和方差问题,高偏差会导致欠拟合,而高方差会导致过拟合,我们将会在学习曲线上绘制训练误差和测试误差用来诊断偏差和方差问题。

注意:当计算训练误差时,选择的数据并不是整个训练集,而是其中一个子集,可以用x[1:n]y[1:n]表示

具体代码如下所示:

def learning_curve(X, y, Xval, yval, lmd):
    # Number of training examples
    m = X.shape[0]

    error_train = np.zeros(m)
    error_val = np.zeros(m)
    for i in range(m):
        x_i = X[:i+1]
        y_i = y[:i+1]
        theta = train_linear_reg(x_i, y_i, lmd)

        error_train[i] = linear_reg_cost_function(theta, x_i, y_i, 0)[0]
        error_val[i] = linear_reg_cost_function(theta, Xval, yval, 0)[0]

    return error_train, error_val

lmd = 0
error_train, error_val = learning_curve(np.c_[np.ones(m), X], y, np.c_[np.ones(Xval.shape[0]), Xval], yval, lmd)

plt.figure()
plt.plot(np.arange(m), error_train, np.arange(m), error_val)
plt.title('Learning Curve for Linear Regression')
plt.legend(['Train', 'Cross Validation'])
plt.xlabel('Number of Training Examples')
plt.ylabel('Error')
plt.axis([0, 13, 0, 150])

运行以上代码,结果如下所示:


多项式回归

通过以上代码实践,发现采用线性回归的假设函数回导致欠拟合(高偏差)问题,通过散点图,发现数据集并不是一种线性关系,采用多项式回归可以避免出现这种问题,多项式回归的假设模型如下公式所示:


# 生成mXp矩阵
def poly_features(X, p):
    X_poly = np.zeros((X.size, p))
    P = np.arange(1, p + 1)
    X_poly = X.reshape((X.size, 1)) ** P
 return X_poly

# 变量标准化(参考多变量线性回归)
def feature_normalize(X):
    mu = np.mean(X, 0)
    sigma = np.std(X, 0, ddof=1)
    X_norm = (X - mu) / sigma
 return X_norm, mu, sigma

# 生成多变量矩阵
p = 5 #初始化为5次方

#训练集生成和标准化
X_poly = poly_features(X, p)
X_poly, mu, sigma = feature_normalize(X_poly)
X_poly = np.c_[np.ones(m), X_poly]

# 测试集生成和标准化
X_poly_test = poly_features(Xtest, p)
X_poly_test -= mu
X_poly_test /= sigma
X_poly_test = np.c_[np.ones(X_poly_test.shape[0]), X_poly_test]

# 验证集生成和标准化
X_poly_val = poly_features(Xval, p)
X_poly_val -= mu
X_poly_val /= sigma
X_poly_val = np.c_[np.ones(X_poly_val.shape[0]), X_poly_val]

print('Normalized Training Example 1 : \n{}'.format(X_poly))

通过以上代码,最后运行结果如下所示:


def plot_fit(min_x, max_x, mu, sigma, theta, p):
    x = np.arange(min_x - 15, max_x + 25, 0.05)

    X_poly = poly_features(x, p)
    X_poly -= mu
    X_poly /= sigma

    X_poly = np.c_[np.ones(x.size), X_poly]

    plt.plot(x, np.dot(X_poly, theta))


lmd = 0
theta = train_linear_reg(X_poly, y, lmd)

# Plot trainint data and fit
plt.figure()
plt.scatter(X, y, c='r', marker="x")
plot_fit(np.min(X), np.max(X), mu, sigma, theta, p)
plt.xlabel('Change in water level (x)')
plt.ylabel('Water folowing out of the dam (y)')
plt.ylim([0, 60])
plt.title('Polynomial Regression Fit (lambda = {})'.format(lmd))

error_train, error_val = learning_curve(X_poly, y, X_poly_val, yval, lmd)
plt.figure()
plt.plot(np.arange(m), error_train, np.arange(m), error_val)
plt.title('Polynomial Regression Learning Curve (lambda = {})'.format(lmd))
plt.legend(['Train', 'Cross Validation'])
plt.xlabel('Number of Training Examples')
plt.ylabel('Error')
plt.axis([0, 13, 0, 150])

print('Polynomial Regression (lambda = {})'.format(lmd))
print('# Training Examples\tTrain Error\t\tCross Validation Error')
for i in range(m):
    print('  \t{}\t\t{}\t{}'.format(i, error_train[i], error_val[i]))

\lambda初始值设置为0,运行以上代码可以看到如下结果:


可以看出多项式回归出现过拟合问题,而交叉验证误差与训练误差验证误差差值较大。通过不断增加
def validation_curve(X, y, Xval, yval):  
    lambda_vec = np.array([0., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10])   
    error_train = np.zeros(lambda_vec.size)
    error_val = np.zeros(lambda_vec.size)

    for i in range(lambda_vec.size):
        lmd = lambda_vec[i]
        theta = train_linear_reg(X, y, lmd)

        error_train[i] = linear_reg_cost_function(theta, X, y, 0)[0]
        error_val[i] = linear_reg_cost_function(theta, Xval, yval, 0)[0]
    return lambda_vec, error_train, error_val


lambda_vec, error_train, error_val = vc.validation_curve(X_poly, y, X_poly_val, yval)

plt.figure()
plt.plot(lambda_vec, error_train, lambda_vec, error_val)
plt.legend(['Train', 'Cross Validation'])
plt.xlabel('lambda')
plt.ylabel('Error')

运行以上代码,结果如下所示,可以看出,训练集误差与验证集误差的交点处即是合适的\lambda值。

上一篇下一篇

猜你喜欢

热点阅读