数据蛙数据分析每周作业呆鸟的Python数据分析机器学习与数据挖掘

案例分析一kaggle房价预测

2020-05-30  本文已影响0人  粉红狐狸_dhf

资料出处:https://www.bilibili.com/video/BV19b411z73K?p=2
数据集:链接:https://pan.baidu.com/s/1yZ1QuLaO6lz7sic40UHGvg
提取码:0wbt

1读取数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df_train=pd.read_csv('E:/jupyter_lab/leetcode/data/house-prices-kaggle/train.csv',index_col=0)
df_test=pd.read_csv('E:/jupyter_lab/leetcode/data/house-prices-kaggle/test.csv',index_col=0)  
#index_col=0设定第一列为索引列

print('train.shape:',df_train.shape)
print('test.shape:',df_test.shape)

2 y的处理

回归模型尽量处理成正态分布。y变成正态分布log1p=log(x+1),用expm1变回来。

df_y=pd.DataFrame({'y_':df_train.SalePrice,'y_log1p':np.log1p(df_train.SalePrice)})
df_y.hist()
image.png

3 将train和test合并做预处理

(1)处理category变量

y=df_train.pop('SalePrice')
all_df=pd.concat((df_train,df_test),axis=0)
#axis=0 纵向拼接
all_df.shape #(2919, 79)
# MSSubClass 是一个等级标签,属于category ,将其变为str类型
print('MSSubClass.type:',all_df.MSSubClass.dtypes)
all_df.MSSubClass=all_df.MSSubClass.astype(str)
all_df.MSSubClass.value_counts() #统计个数
#处理所有类别变量
dummy_all_df=pd.get_dummies(all_df)
dummy_all_df.head()

(2)处理numerical变量

用均值填满空值

#将空值降序排列
dummy_all_df.isnull().sum().sort_values(ascending=False).head()
#用平均值填满空值
cols_mean=dummy_all_df.mean()
dummy_all_df=dummy_all_df.fillna(cols_mean)
dummy_all_df.isnull().sum().sum()

对numerical类标准化,此步非必须,但回归模型最好标准化。

#取出所有numerical的变量
numerical_cols=dummy_all_df.columns[dummy_all_df.dtypes != 'object']

numerical_cols_mean=dummy_all_df.loc[:,numerical_cols].mean()
numerical_cols_std=dummy_all_df.loc[:,numerical_cols].std()
dummy_all_df.loc[:,numerical_cols]=(dummy_all_df.loc[:,numerical_cols]-numerical_cols_mean)/numerical_cols_std

4 建立模型

把数据集再分开

dummy_train=dummy_all_df.loc[df_train.index]
dummy_test=dummy_all_df.loc[df_test.index]
dummy_train.shape

(1)Ridge Regression 岭回归

是最小二乘法+L2范数的一种有偏估计,是为了损失一部分精度来降低模型复杂度的一种防止过拟合的方法,也就是通常所说的正则化。 正则化项与模型复杂度是单调递增关系,当损失函数较小也就是拟合精度较高时,模型的复杂度也会相应增加,加入正则化项是为了让模型复杂度降低,此时会损失一部分精度,防止过拟合。(L2范数:参数平方求平方根;L1范数:参数的绝对值的和)

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score #划分数据,交叉验证
import matplotlib.pyplot as plt
%matplotlib inline

y_train=df_y.y_log1p
x_train=dummy_train.values
x_test=dummy_test.values
#将DataFrame数据转成Numpy Array数据,与sklearn相匹配

#尝试调参
alphas=np.logspace(-2,3,50)# 0.01~1000
test_scores=[]
for alpha in alphas:
    clf=Ridge(alpha)
    test_score=np.sqrt(-cross_val_score(clf,x_train,y_train,cv=10,scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

plt.plot(alphas,test_scores)
plt.title('Alpha vs CV Erroe')
plt.show()
(2)RandomForestRegressor 随机森林回归
from sklearn.ensemble import RandomForestRegressor


max_features=[.1,.3,.5,.7,.9,.99]
test_scores=[]
for max_feature in max_features :
    rlf=RandomForestRegressor(n_estimators=200,max_features=max_feature)
    test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(max_features,test_scores)
plt.title('Max_features vs CV Erroe')
plt.show()

这些模型不一定是损失越来越小的,文章重点在于分析思路。模型可以自己替换。

5 Ensemble 集成模型

利用Staking 的思想将调好参数的模型集成

ridge=Ridge(alpha=400)
rlf=RandomForestRegressor(max_features=0.3)

ridge.fit(x_train,y_train)
rlf.fit(x_train,y_train)

y_ridge=np.expm1(ridge.predict(x_test))
y_rlf=np.expm1(rlf.predict(x_test))

#ensemble将模型预测结果作为新的输入,再预测。这里直接平均化
y_en=(y_ridge+y_rlf)/2 #伪集成
#提交形式
submission_df=pd.DataFrame({'Id':df_test.index,'SalePrice':y_en})
submission_df.head()

6 更高级的 Ensemble 集成模型

(1)Bagging:平行进行,投票决定 。

小分类器:ridge(15)是之前调参好的模型。

from sklearn.ensemble import BaggingRegressor  
from sklearn.model_selection import cross_val_score

ridge=Ridge(15)#小分类器

params=[1,10,15,20,25,30,40]
test_scores=[]

for param in params :
    rlf=BaggingRegressor(n_estimators=param,base_estimator=ridge)#base_estimator默认是Decision Tree模型
    test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(params,test_scores)
plt.title('params vs CV Erroe')
plt.show()
(2) Boosting 前后关联的,把难以预测的继续喂给下一个小分类器预测。
from sklearn.ensemble import AdaBoostRegressor  
from sklearn.model_selection import cross_val_score

ridge=Ridge(15)#小分类器

params=[1,3,5,7,9,10,11,12,15]
test_scores=[]

for param in params :
    rlf=AdaBoostRegressor(n_estimators=param,base_estimator=ridge)#base_estimator默认是Decision Tree模型
    test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

plt.plot(params,test_scores)
plt.title('Prams vs CV Erroe')
plt.show()
(3) XGBoost :改进版的Boosting
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings("ignore")

params=[1,3,5,7,9,10,11,12,15]
test_scores=[]

for param in params :
    rlf=XGBRegressor(param)
    test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

plt.plot(params,test_scores)
plt.title('Prams vs CV Erroe')
plt.show()
image.png

用xgboost进行预测的这个结果是最好的,可以看出集成方法的威力。

上一篇下一篇

猜你喜欢

热点阅读