Pythoner集中营机器学习与数据挖掘大数据,机器学习,人工智能

python数据挖掘预测Boston房价

2018-08-24  本文已影响56人  在做算法的巨巨

一、数据介绍
此数据是一份源于美国某经济学杂志上,分析研究波士顿房价( Boston House
Price)的数据集。数据集中的每一行数据都是对波士顿周边或城镇房价的描述:
CRIM: 城镇人均犯罪率
ZN: 住宅用地所占比例
INDUS: 城镇中非住宅用地所占比例
CHAS: CHAS 虚拟变量,用于回归分析
NOX: 环保指数
RM: 每栋住宅的房间数
AGE: 1940 年以前建成的自住单位的比例
DIS: 距离 5 个波士顿的就业中心的加权距离。
RAD: 距离高速公路的便利指数
TAX: 每一万美元的不动产税率
PRTATIO: 城镇中的教师学生比例
B: 城镇中的黑人比例
LSTAT: 地区中有多少房东属于低收入人群
MEDV: 自住房屋房价中位数(也就是均价)

二、任务介绍
1、通过数据挖掘对影响波士顿房价的因素进行分析。
2、搭建一个波士顿房价预测模型。


数据预处理与特征分析

MEDV作为目标变量,也就是因变量,其他变量作为自变量。

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
data = pd.read_csv('C:\\RAM\\housing.csv')

检查有没有数据中有没有空值

data.isnull().any().sum()

反馈是0,说明该数据不需要对其进行空值处理,可以放心的进行后续工作了。

查看各个特征的散点分布

pd.plotting.scatter_matrix(data, alpha=0.7, figsize=(10,10), diagonal='kde')

scatter_matrix


corr = data.corr()

结合相关性变量,大体了解下数据的整体分布。


特征选择

特征维度较大,为了保证模型的高效预测,需要进行特征选择。每种特征都有自己含义和数据量级,单纯地依靠方差来判断可能效果不好,直接使用与目标变量的相关性强的变量作为最终的特征变量。

通过相关系数法进行特征选择

x = data[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT ']]
y = data[['MEDV']]
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
SelectKBest = SelectKBest(f_regression, k=3)
bestFeature = SelectKBest.fit_transform(x,y)
SelectKBest.get_support()
x.columns[SelectKBest.get_support()]

这里我们看出和波士顿房价相关性最强的三个因素,分别是,RM(每栋住宅的房间数),PTRATIO(城镇中的教师学生比例),LSTAT(地区中有多少房东属于低收入人群)。
还是具备一定逻辑性的,首先,房子越大房价自然高(不管在哪个地域),其次,师生比与房价成反比,教育的重视,教育资源越是富裕的地方,生源就会大,师生比自然会降低,周边的房价会升高,这就是所谓的“学区房”概念吧。关于有多少房东属于低收入人群和房价的负相关关系,这个也比较好理解,各种原因吧。

features = data[['RM', 'PTRATIO', 'LSTAT ']]
pd.plotting.scatter_matrix(features, alpha=0.7, figsize=(6,6), diagonal='hist')

特征归一化

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
for feature in features.columns:
    features['标准化'+feature] = scaler.fit_transform(features[[feature]])

#散点可视化,查看特征归一化后的数据
font={
      'family':'SimHei'
      }
matplotlib.rc('font', **font)
pd.plotting.scatter_matrix(features[['标准化RM', '标准化PTRATIO', '标准化LSTAT ']], alpha=0.7, figsize=(6,6), diagonal='hist')

数据集拆分

将数据拆分为训练数据及测试数据。

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features[['标准化RM', '标准化PTRATIO', '标准化LSTAT ']], y, test_size=0.3,random_state=33)

模型选择

采用交叉验证的方法对模型进行评估

from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score

房价预测方面,打算尝试以下方法:

  1. 线性回归
    这应该是最简单也是最好理解的一种方法。
  2. 使用支持向量回归模型SVR
    之前仅仅是用SVM做过一些二分类问题(SVC),这次尝试下解决回归问题。
  3. KNN模型
    思路上感觉KNN可以做,但是不确定效果,可以尝试下。
  4. 决策树模型
    和KNN一样,感觉是可以做,但是具体效果不太确定。
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr_predict = cross_val_predict(lr,x_train, y_train, cv=5)
lr_score = cross_val_score(lr, x_train, y_train, cv=5)
lr_meanscore = lr_score.mean()
#SVR
from sklearn.svm import SVR
linear_svr = SVR(kernel = 'linear')
linear_svr_predict = cross_val_predict(linear_svr, x_train, y_train, cv=5)
linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)
linear_svr_meanscore = linear_svr_score.mean()

poly_svr = SVR(kernel = 'poly')
poly_svr_predict = cross_val_predict(poly_svr, x_train, y_train, cv=5)
poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)
poly_svr_meanscore = poly_svr_score.mean()

rbf_svr = SVR(kernel = 'rbf')
rbf_svr_predict = cross_val_predict(rbf_svr, x_train, y_train, cv=5)
rbf_svr_score = cross_val_score(rbf_svr, x_train, y_train, cv=5)
rbf_svr_meanscore = rbf_svr_score.mean()
from sklearn.neighbors import KNeighborsRegressor
score=[]
for n_neighbors in range(1,21):
    knn = KNeighborsRegressor(n_neighbors, weights = 'uniform' )
    knn_predict = cross_val_predict(knn, x_train, y_train, cv=5)
    knn_score = cross_val_score(knn, x_train, y_train, cv=5)
    knn_meanscore = knn_score.mean()
    score.append(knn_meanscore)
plt.plot(score)
plt.xlabel('n-neighbors')
plt.ylabel('mean-score')
image.png

从上图可以发现,随着n_neighbors的逐渐增大,模型预测能力逐渐增强,但是当n_neighbors大于2以后,模型评分趋于下降。因此我们选取n_neighbors=2。

n_neighbors=2
knn = KNeighborsRegressor(n_neighbors, weights = 'uniform' )
knn_predict = cross_val_predict(knn, x_train, y_train, cv=5)
knn_score = cross_val_score(knn, x_train, y_train, cv=5)
knn_meanscore = knn_score.mean()
#Decision Tree
from sklearn.tree import DecisionTreeRegressor
score=[]
for n in range(1,11):
    dtr = DecisionTreeRegressor(max_depth = n)
    dtr_predict = cross_val_predict(dtr, x_train, y_train, cv=5)
    dtr_score = cross_val_score(dtr, x_train, y_train, cv=5)
    dtr_meanscore = dtr_score.mean()
    score.append(dtr_meanscore)
plt.plot(np.linspace(1,10,10), score)
plt.xlabel('max_depth')
plt.ylabel('mean-score')

发现当max_depth为[3, 4, 5]时,决策时模型评分处于极值的样子。
因此取max_depth为4。

n=4
dtr = DecisionTreeRegressor(max_depth = n)
dtr_predict = cross_val_predict(dtr, x_train, y_train, cv=5)
dtr_score = cross_val_score(dtr, x_train, y_train, cv=5)
dtr_meanscore = dtr_score.mean()

模型评估

接下来,汇总下评分。

evaluating = {
        'lr':lr_y_score,
        'linear_svr':linear_svr_score,
        'poly_svr':poly_svr_score,
        'rbf_svr':rbf_svr_score,
        'knn':knn_score,
        'dtr':dtr_score
        }
evaluating = pd.DataFrame(evaluating)
evaluating.plot.kde(alpha=0.6,figsize=(8,7))
所有模型的得分分布密度
evaluating.hist(color='k',alpha=0.6,figsize=(8,7))
所有模型的得分分布柱状图

从以上两张图发现,kernerl为poly的SVR得分受数据影响明显,而且得分偏低,其他的几个模型类似linear/rbf的SVR,dtr,都呈现出相同的趋势,KNN模型应该是算是截至在现在得分最高的模型。

模型优化

接下来想想办法,看看SVR还能不能被拯救。
首先对线性核进行最优化。
线性核我们通过更改惩罚系数C来查看对模型的影响。

lSVR_score=[]
for i in [1,10,1e2,1e3,1e4]:
    linear_svr = SVR(kernel = 'linear', C=i)
    linear_svr_predict = cross_val_predict(linear_svr, x_train, y_train, cv=5)
    linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)
    linear_svr_meanscore = linear_svr_score.mean()
    lSVR_score.append(linear_svr_meanscore)
plt.plot(lSVR_score)
线性核随C的变化而变化

观察C为[1,10,100,1000]时对模型的影响,发现当C为10时,模型评分处于极值状态,随着C的增大,模型得分趋于稳定变化不大,因此认为C为10时模型最优。而sklearn关于线性核的默认状态C为1。

linear_svr = SVR(kernel = 'linear', C=10)
linear_svr_predict = cross_val_predict(linear_svr, x_train, y_train, cv=5)
linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)
linear_svr_meanscore = linear_svr_score.mean()

'poly'核优化,通过尝试修改惩罚系数C和degree。

polySVR_score=[]
for i in [1,10,1e2,1e3,1e4]:
    for j in np.linspace(1,10,10):
        poly_svr = SVR(kernel = 'poly', C=i, degree=j)
        poly_svr_predict = cross_val_predict(poly_svr, x_train, y_train, cv=5)
        poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)
        poly_svr_meanscore = poly_svr_score.mean()
        polySVR_score.append(poly_svr_meanscore)
plt.plot(polySVR_score)
poly核最优化

从上图看出,对变量C来说,当C>10时,poly核模型得分普遍较高,当degree=2时,模型得分最高。我们查看SVR的API,发现,poly核默认的degree为3,C为1,可以从图中看出,优化后的模型得分增加不少。

poly_svr = SVR(kernel = 'poly', C=1000, degree=2)
poly_svr_predict = cross_val_predict(poly_svr, x_train, y_train, cv=5)
poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)
poly_svr_meanscore = poly_svr_score.mean()

‘rbf'核优化,优化惩罚系数C和gamma。

for i in [1,10,1e2,1e3,1e4]:
    rbfSVR_score=[]
    for j in np.linspace(0.1,1,10):
        rbf_svr = SVR(kernel = 'rbf', C=i, gamma=j)
        rbf_svr_predict = cross_val_predict(rbf_svr, x_train, y_train, cv=5)
        rbf_svr_score = cross_val_score(rbf_svr, x_train, y_train, cv=5)
        rbf_svr_meanscore = rbf_svr_score.mean()
        rbfSVR_score.append(rbf_svr_meanscore)
    plt.plot(np.linspace(0.1,1,10),rbfSVR_score,label='C='+str(i))
    plt.legend()
plt.xlabel('gamma')
plt.ylabel('score')
rbf核最优化

从上图发现,gamma从0.1递增到1.0,步长为0.1,模型得分有递增的趋势,当C>=10时,gamma对模型的影响较小,当C=100时,gamma=0.5时,’rbf‘模型评分更高。

rbf_svr = SVR(kernel = 'rbf', C=100, gamma=0.5)
rbf_svr_predict = cross_val_predict(rbf_svr, x_train, y_train, cv=5)
rbf_svr_score = cross_val_score(rbf_svr, x_train, y_train, cv=5)
rbf_svr_meanscore = rbf_svr_score.mean()

接下来,对最优化的模型进行二次归类。

optimizer = {
        'lr':lr_score,
        'linear_svr':linear_svr_score,
        'poly_svr':poly_svr_score,
        'rbf_svr':rbf_svr_score,
        'knn':knn_score,
        'dtr':dtr_score
        }
optimizer= pd.DataFrame(optimizer)

可视化

optimizer.plot.kde(alpha=0.6,figsize=(8,7))
plt.xlabel('score')
最优化后的模型分数密度分布曲线
optimizer.hist(color='k',alpha=0.6,figsize=(8,7))
最优化后的模型得分分布柱状图

最优模型确定

对每个经过优化后的模型进行得分对比

此时发现,rbf核的SVR模型在优化后变成了最好的模型。线性核的SVR和线性回归因为策略的局限性,模型能力排在最后。

接下来对测试数据集进行预测。

#rbf
rbf_svr.fit(x_train,y_train)
rbf_svr_y_predict = rbf_svr.predict(x_test)
rbf_svr_y_predict_score=rbf_svr.score(x_test, y_test)
#KNN
knn.fit(x_train,y_train)
knn_y_predict = knn.predict(x_test)
knn_y_predict_score = knn.score(x_test, y_test)
#poly_svr
poly_svr.fit(x_train,y_train)
poly_svr_y_predict = poly_svr.predict(x_test)
poly_svr_y_predict_score = poly_svr.score(x_test, y_test)
#dtr
dtr.fit(x_train, y_train)
dtr_y_predict = dtr.predict(x_test)
dtr_y_predict_score = dtr.score(x_test, y_test)
#lr
lr.fit(x_train, y_train)
lr_y_predict = lr.predict(x_test)
lr_y_predict_score = lr.score(x_test, y_test)
#linear_svr
linear_svr.fit(x_train, y_train)
linear_svr_y_predict = linear_svr.predict(x_test)
linear_svr_y_predict_score = linear_svr.score(x_test, y_test)
predict_score = {
        'lr':lr_y_predict_score,
        'linear_svr':linear_svr_y_predict_score,
        'poly_svr':poly_svr_y_predict_score,
        'rbf_svr':rbf_svr_y_predict_score,
        'knn':knn_y_predict_score,
        'dtr':dtr_y_predict_score
        }
predict_score = pd.DataFrame(predict_score, index=['score']).transpose()
predict_score.sort_values(by='score',ascending = False)

预测结果排名


对各个模型的预测值整理

plt.scatter(np.linspace(0,151,152),y_test, label='predict data')
labelname=[
        'rbf_svr_y_predict',
        'knn_y_predict',
        'poly_svr_y_predict',
        'dtr_y_predict',
        'lr_y_predict',
        'linear_svr_y_predict']
y_predict={
        'rbf_svr_y_predict':rbf_svr_y_predict,
        'knn_y_predict':knn_y_predict[:,0],
        'poly_svr_y_predict':poly_svr_y_predict,
        'dtr_y_predict':dtr_y_predict,
        'lr_y_predict':lr_y_predict[:,0],
        'linear_svr_y_predict':linear_svr_y_predict
        }
y_predict=pd.DataFrame(y_predict)
for name in labelname:
    plt.plot(y_predict[name],label=name)
plt.xlabel('predict data index')
plt.ylabel('target')
plt.legend()

综合表现来看,rbf_svr的鲁棒性更强一点。

上一篇 下一篇

猜你喜欢

热点阅读