案例分析一kaggle房价预测
2020-05-30 本文已影响0人
粉红狐狸_dhf
资料出处:https://www.bilibili.com/video/BV19b411z73K?p=2
数据集:链接:https://pan.baidu.com/s/1yZ1QuLaO6lz7sic40UHGvg
提取码:0wbt
1读取数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df_train=pd.read_csv('E:/jupyter_lab/leetcode/data/house-prices-kaggle/train.csv',index_col=0)
df_test=pd.read_csv('E:/jupyter_lab/leetcode/data/house-prices-kaggle/test.csv',index_col=0)
#index_col=0设定第一列为索引列
print('train.shape:',df_train.shape)
print('test.shape:',df_test.shape)
2 y的处理
回归模型尽量处理成正态分布。y变成正态分布log1p=log(x+1),用expm1变回来。
df_y=pd.DataFrame({'y_':df_train.SalePrice,'y_log1p':np.log1p(df_train.SalePrice)})
df_y.hist()
image.png
3 将train和test合并做预处理
- category数据:变成one-hot编码,pd.get_dummies()
- numerical数据:处理缺失和偏差等等
(1)处理category变量
y=df_train.pop('SalePrice')
all_df=pd.concat((df_train,df_test),axis=0)
#axis=0 纵向拼接
all_df.shape #(2919, 79)
# MSSubClass 是一个等级标签,属于category ,将其变为str类型
print('MSSubClass.type:',all_df.MSSubClass.dtypes)
all_df.MSSubClass=all_df.MSSubClass.astype(str)
all_df.MSSubClass.value_counts() #统计个数
#处理所有类别变量
dummy_all_df=pd.get_dummies(all_df)
dummy_all_df.head()
(2)处理numerical变量
用均值填满空值
#将空值降序排列
dummy_all_df.isnull().sum().sort_values(ascending=False).head()
#用平均值填满空值
cols_mean=dummy_all_df.mean()
dummy_all_df=dummy_all_df.fillna(cols_mean)
dummy_all_df.isnull().sum().sum()
对numerical类标准化,此步非必须,但回归模型最好标准化。
#取出所有numerical的变量
numerical_cols=dummy_all_df.columns[dummy_all_df.dtypes != 'object']
numerical_cols_mean=dummy_all_df.loc[:,numerical_cols].mean()
numerical_cols_std=dummy_all_df.loc[:,numerical_cols].std()
dummy_all_df.loc[:,numerical_cols]=(dummy_all_df.loc[:,numerical_cols]-numerical_cols_mean)/numerical_cols_std
4 建立模型
把数据集再分开
dummy_train=dummy_all_df.loc[df_train.index]
dummy_test=dummy_all_df.loc[df_test.index]
dummy_train.shape
(1)Ridge Regression 岭回归
是最小二乘法+L2范数的一种有偏估计,是为了损失一部分精度来降低模型复杂度的一种防止过拟合的方法,也就是通常所说的正则化。 正则化项与模型复杂度是单调递增关系,当损失函数较小也就是拟合精度较高时,模型的复杂度也会相应增加,加入正则化项是为了让模型复杂度降低,此时会损失一部分精度,防止过拟合。(L2范数:参数平方求平方根;L1范数:参数的绝对值的和)
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score #划分数据,交叉验证
import matplotlib.pyplot as plt
%matplotlib inline
y_train=df_y.y_log1p
x_train=dummy_train.values
x_test=dummy_test.values
#将DataFrame数据转成Numpy Array数据,与sklearn相匹配
#尝试调参
alphas=np.logspace(-2,3,50)# 0.01~1000
test_scores=[]
for alpha in alphas:
clf=Ridge(alpha)
test_score=np.sqrt(-cross_val_score(clf,x_train,y_train,cv=10,scoring='neg_mean_squared_error'))
test_scores.append(np.mean(test_score))
plt.plot(alphas,test_scores)
plt.title('Alpha vs CV Erroe')
plt.show()
(2)RandomForestRegressor 随机森林回归
from sklearn.ensemble import RandomForestRegressor
max_features=[.1,.3,.5,.7,.9,.99]
test_scores=[]
for max_feature in max_features :
rlf=RandomForestRegressor(n_estimators=200,max_features=max_feature)
test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
test_scores.append(np.mean(test_score))
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(max_features,test_scores)
plt.title('Max_features vs CV Erroe')
plt.show()
这些模型不一定是损失越来越小的,文章重点在于分析思路。模型可以自己替换。
5 Ensemble 集成模型
利用Staking 的思想将调好参数的模型集成
ridge=Ridge(alpha=400)
rlf=RandomForestRegressor(max_features=0.3)
ridge.fit(x_train,y_train)
rlf.fit(x_train,y_train)
y_ridge=np.expm1(ridge.predict(x_test))
y_rlf=np.expm1(rlf.predict(x_test))
#ensemble将模型预测结果作为新的输入,再预测。这里直接平均化
y_en=(y_ridge+y_rlf)/2 #伪集成
#提交形式
submission_df=pd.DataFrame({'Id':df_test.index,'SalePrice':y_en})
submission_df.head()
6 更高级的 Ensemble 集成模型
(1)Bagging:平行进行,投票决定 。
小分类器:ridge(15)是之前调参好的模型。
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import cross_val_score
ridge=Ridge(15)#小分类器
params=[1,10,15,20,25,30,40]
test_scores=[]
for param in params :
rlf=BaggingRegressor(n_estimators=param,base_estimator=ridge)#base_estimator默认是Decision Tree模型
test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
test_scores.append(np.mean(test_score))
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(params,test_scores)
plt.title('params vs CV Erroe')
plt.show()
(2) Boosting 前后关联的,把难以预测的继续喂给下一个小分类器预测。
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import cross_val_score
ridge=Ridge(15)#小分类器
params=[1,3,5,7,9,10,11,12,15]
test_scores=[]
for param in params :
rlf=AdaBoostRegressor(n_estimators=param,base_estimator=ridge)#base_estimator默认是Decision Tree模型
test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
test_scores.append(np.mean(test_score))
plt.plot(params,test_scores)
plt.title('Prams vs CV Erroe')
plt.show()
(3) XGBoost :改进版的Boosting
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings("ignore")
params=[1,3,5,7,9,10,11,12,15]
test_scores=[]
for param in params :
rlf=XGBRegressor(param)
test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
test_scores.append(np.mean(test_score))
plt.plot(params,test_scores)
plt.title('Prams vs CV Erroe')
plt.show()
image.png
用xgboost进行预测的这个结果是最好的,可以看出集成方法的威力。