一元（多元）线性回归分析之Python语言实现

2018-06-12 本文已影响14人天善智能

欢迎关注天善智能，我们是专注于商业智能BI，人工智能AI，大数据分析与挖掘领域的垂直社区，学习，问答、求职一站式搞定！

对商业智能BI、大数据分析挖掘、机器学习，python，R等数据领域感兴趣的同学加微信：tsaiedu，并注明消息来源，邀请你进入数据爱好者交流群，数据爱好者们都在这儿。

618狂欢，天善学院全场课程6.18折等你来！投资知识，赋能自我，蝶变在即！ https://www.hellobi.com/618

本文来自天善智能社区专栏作者：okajun
相关文章链接：《一元（多元）线性回归分析之Excel实现》
《一元（多元）线性回归分析之R语言实现》

写《一元（多元）线性回归分析之Excel实现》的时候就说还要写《一元（多元）线性回归分析之R语言实现》和在Python中的实现，其实本篇的文档早就准备好，但是一直没有找到关于模型的检验方法，所以一直迟迟没有发布，今天先把我知道的分享出来，希望看到的各位多多指导，不吝赐教。

本文案例依然使用women数据集和salary数据集，请查阅上篇博文下载。

1.一元线性回归

1.1 使用sklearn，全部样本数据

先导入要使用的库和模块

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

导入数据集

women = pd.read_csv('C:\\Users\\HuZhanPeng\\Desktop\\Regression\\women.csv')
women.head()

Clipboard Image.png

先做数据探索，做散点图

X = women[['height']] # 注意这种写法，得到的X仍然是dataframe类型
y = women['weight'].values # 得到的y是numpy.ndarray类型

# 数据探索：画散点图查看分布情况
%pylab inline
plt.scatter(X, y, color='black')
plt.xlabel('height')
plt.ylabel('weight')

Clipboard Image.png

可以看到有weight和height有很明显的线性正相关关系，下面我们来建立线性模型。

regr = linear_model.LinearRegression()
regr.fit(X, y)

查看模型结果：

print('Intercept:{}'.format(regr.intercept_))
print('Coefficient:{}'.format(regr.coef_))

得到一元线性方程：weight=-87.52+3.45height

查看预测值(直线) vs. 真实值(散点)

plt.scatter(X, y, color='black')
plt.plot(X, regr.predict(X), linewidth=3, color='blue')
plt.xlabel('height')
plt.ylabel('weight')

Clipboard Image.png

1.2使用sklearn，训练集测试集

上文1.1我们使用了全部样本数据建模，得到了回归方程，通过上面的散点图来看模型效果不错，但是如何精确评估方程的效果呢？我并没有找到合适的方法。

下面我们使用sklearn官网上的案例，将women数据集拆分成训练集和测试集，并查看模型的效果如何。

建立模型：

# 使用测试集、训练集
trian_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=123)
regr1 = linear_model.LinearRegression()
regr1.fit(trian_X, train_y)

输出模型结果：

print('Intercept:{}'.format(regr1.intercept_))
print('Coefficient:{}'.format(regr1.coef_))

得到一元线性方程：weight=-87.77+3.46height，和1.1中的模型结果稍有区别，因为我们只用了80%的数据建模。

使用测试集来验证模型效果：

# 使用test_X预测y，得到y_pred1
y_pred1 = regr1.predict(test_X)
# Variance score越接近1越好
print("Mean squared error: %.2f"      % mean_squared_error(test_y, y_pred1))
print('Variance score: %.2f' % r2_score(test_y, y_pred1))

得到：Variance score: 0.97，说明模型效果很好。

1.3改进模型（增加二次项）

引入一个二次项x²进入模型：

from sklearn.preprocessing import PolynomialFeatures
ploy_reg = PolynomialFeatures(degree=2)
X_ = ploy_reg.fit_transform(X)
regr2 = linear_model.LinearRegression()
regr2.fit(X_, y)

X2 = X.sort_values(['height'])
X2_ = ploy_reg.fit_transform(X2)

作图查看模型拟合效果：

plt.scatter(X, y, color='black')
plt.plot(X2, regr2.predict(X2_), color='blue', linewidth=3)

Clipboard Image.png

查看模型检验结果：

y_pred2 = regr2.predict(ploy_reg.fit_transform(test_X))

print("Mean squared error: %.2f"      % mean_squared_error(test_y, y_pred2))
print('Variance score: %.2f' % r2_score(test_y, y_pred2))

得到：Variance score: 1.00，模型效果进一步提升。

1.4使用statsmodels

以上1.1-1.3都是使用的sklearn建模，下面我们使用statsmodels建模，使用此module建模的输出结果跟R类似。

import statsmodels.api as sm
X3 = sm.add_constant(X)
est = sm.OLS(y, X3)
est2 =est.fit()
print(est2.summary())

Clipboard Image.png

可以看到输出结果非常类似于R，结果不再解读，参见上一篇在R中实现的博文。

同样：将sm.add_constant(X)中的X，换成变换后的X_，可以得到加入二项式的回归方程，不再展示。

2.多元线性回归

同样适用salary数据集，导入数据并查看：

salary = pd.read_csv('C:\\Users\\HuZhanPeng\\Desktop\\Regression\\salary.csv')
salary.head()

Clipboard Image.png

哑变量处理，将gend转化为数值型：

salary = pd.concat([salary, pd.get_dummies(salary['gend'])], axis=1)del 
salary['gend']del 
salary['female']
salary.head()

Clipboard Image.png

在R中我们直接主观地把gend这边变量剔除了，现在Python中我们先保留，让模型选择是否留下此变量。

y = salary['salary'].values
X = salary[['age', 'company_age', 'education', 'male']]

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

Clipboard Image.png

可以看到male没有通过t检验。

我们使用AIC法则，来选择最优的模型：

predictorcols = ['age', 'company_age', 'education', 'male']
import itertools
AICs = {}
for k in range(1, len(predictorcols)+1):    
    for variables in itertools.combinations(predictorcols, k):        
        predictors = X[list(variables)]
        predictors2 = sm.add_constant(predictors)
        est = sm.OLS(y, predictors2)
        res = est.fit()
        AICs[variables] = res.aic

按照AIC值有小到大输出前10种模型组合：

from collections import Counter
c = Counter(AICs)
c.most_common()[::-10]

输出如下：

Clipboard Image.png

可以看到：最优的模型是将'age', 'company_age', 'education'这3个变量引入模型。

重新建模：

X_new = salary[['age', 'company_age', 'education']]
X3 = sm.add_constant(X_new)
est = sm.OLS(y, X3)
est2 = est.fit()
print(est2.summary())

Clipboard Image.png

可以看到新方程通过了F检验，系数都通过了t检验。

得到回归方程： salary = -4.463e+04 + 2303.8365age + 1952.7198company_age + 8052.9691*education

如果想使用岭回归，可以查看sklearn官网中岭回归的介绍。

补充：Python中如何进行正太分布检验、齐方差检验、共线性检验等我暂时还没有找到很好的方法。如果哪位知道，还请不吝赐教，谢谢~

一元（多元）线性回归分析之Python语言实现

1.一元线性回归

1.1 使用sklearn，全部样本数据

1.2使用sklearn，训练集测试集

1.3改进模型（增加二次项）

1.4使用statsmodels

2.多元线性回归

猜你喜欢

热点阅读