使用sklearn进行kaggle案例泰坦尼克Titanic船员
Titanic
背景介绍: kaggle 泰坦尼克
- Competition Description
发生在1912年的泰坦尼克事件,导致船上2224名游客阵亡1502,作为事后诸葛亮,我们掌握船上乘客的一些数据以及一部分乘客是否获救的信息。我们希望能通过探索这些数据,发现一些不为人知的秘密- -,顺便预测下另外一部分乘客是否能够获救~!
# -*- coding: UTF-8 -*-
#from __future__ import division
#from __future__ import print_function
import os
#数据处理
import pandas as pd
import numpy as np
import random
import sklearn.preprocessing as preprocessing
#可视化
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
titanic = pd.read_csv('train.csv')
titanic_test = pd.read_csv('test.csv')
#submission_sample = pd.read_csv('gender_submission.csv')
print('_'*40)
print titanic.describe()
print titanic.info()
print titanic.head()
print titanic.tail()
print titanic_test.describe()
print titanic_test.info()
print titanic_test.head()
print titanic_test.tail()
print('_'*40)
print titanic.columns
print titanic_test.columns
所有的数据中一共包括12个变量,其中7个是数值变量,5个是属性变量,接下来我们具体来看一看。
PassengerId:这是乘客的编号,显然对乘客是否幸存完全没有任何作用,仅做区分作用,所以就不考虑它了。
Survived:乘客最后的生存情况,这个是我们预测的目标变量。不过从平均数可以看出,最后存活的概率大概是38%。
Pclass:社会经济地位,这个很明显和生存结果相关啊,有钱人住着更加高级船舱可能会享受着更加高级的服务,因此遇险时往往会受到优待。所以这显然是我们要考虑的一个变量。
Name:这个变量看起来好像是没什么用啊,因为毕竟从名字你也不能看出能不能获救,但是仔细观察数据我们可以看到,所有人的名字里都包括了Mr,Mrs和Miss,从中是不是隐约可以看出来一些性别和年龄的信息呢,所以干脆把名字这个变量变成一个状态变量,包含Mr,Mrs和Miss这三种状态,但是放到机器学习里面我们得给它一个编码啊,最直接的想法就是0,1,2,但是这样真的合理吗?因为从距离的角度来说,这样Mr和Mrs的距离要小于Mr和Miss的距离,显然不合适,因为我们把它看成平权的三个状态。
所以,敲黑板,知识点来了,对于这种状态变量我们通常采取的措施是one-hot编码,什么意思呢,有几种状态就用一个几位的编码来表示状态,每种状态对应一个一位是1其余各位是0的编码,这样从向量的角度来讲,就是n维空间的n个基准向量,它们相互明显是平权的,此例中,我们分别用100,010,001来表示Mr,Mrs和Miss。
Sex:性别这个属性肯定是很重要的,毕竟全人类都讲究Lady First,所以遇到危险的时候,绅士们一定会先让女士逃走,因此女性的生存几率应该会大大提高。类似的,性别也是一个平权的状态变量,所以到时候我们同样采取one-hot编码的方式进行处理。
Age:这个变量和性别类似,都是明显会发挥重要作用的,因为无论何时,尊老爱幼总是为人们所推崇,但年龄对是否会获救的影响主要体现在那个人处在哪个年龄段,因此我们选择将它划分成一个状态变量,比如18以下叫child,18以上50以下叫adult,50以上叫elder,然后利用one-hot编码进行处理。不过这里还有一个问题就是年龄的值只有714个,它不全!这么重要的东西怎么能不全呢,所以我们只能想办法补全它。
又敲黑板,知识点又来了,缺失值我们怎么处理呢?最简单的方法,有缺失值的样本我们就扔掉,这种做法比较适合在样本数量很多,缺失值样本舍弃也可以接受的情况下,这样虽然信息用的不充分,但也不会引入额外的误差。然后,假装走心的方法就是用平均值或者中位数来填充缺失值,这通常是最简便的做法,但通常会带来不少的误差。最后,比较负责任的方法就是利用其它的变量去估计缺失变量的值,这样通常会更靠谱一点,当然也不能完全这样说,毕竟只要是估计值就不可避免的带来误差,但心理上总会觉得这样更好……
SibSp:船上兄弟姐妹或者配偶的数量。这个变量对最后的结果到底有什么影响我还真的说不准,但是预测年纪的时候说不定有用。
Parch:船上父母或者孩子的数量。这个变量和上个变量类似,我确实没有想到特别好的应用它的办法,同样的,预测年龄时这个应该挺靠谱的。
Ticket:船票的号码。恕我直言,这个谜一样的数字真的是不知道有什么鬼用,果断放弃了。
Fare:船票价格,这个变量的作用其实类似于社会地位,船票价格越高,享受的服务越高档,所以遇难获救的概率肯定相对较高,所以这是一个必须要考虑进去的变量。
Cabin:船舱号,这个变量或许透露出了一点船舱等级的信息,但是说实话,你的缺失值实在是太多了,我要是把它补全引入的误差感觉比它提供的信息还多,所以忍痛割爱,和你say goodbye!
Embarked:登船地点,按道理来说,这个变量应该是没什么卵用的,但介于它是一个只有三个状态的状态变量,那我们就把它处理一下放进模型,万一有用呢对吧。另外,它有两个缺失值,这里我们就不大动干戈的去预测了,就直接把它定为登船人数最多的S吧。
好的,到这里我们对所有变量应该如何处理大致有谱了,状态变量进行one-hot编码,那数值变量呢,直接用吗?
好的,到这里我们对所有变量应该如何处理大致有谱了,状态变量进行one-hot编码,那数值变量呢,直接用吗?
知识点!对于数值变量,我们通常会先进行归一化处理,这样有利于我们加快收敛速度,将各个维度限制在差不多的区间内,对一些基于距离的分类器有着非常大的好处,但是对于决策树一类的算法其实就没有意义了,不过这边我们就对所有的数值变量都做一个归一化处理吧。
到了这里,想必思路已经很清晰了,下面我们再梳理一下过程:
1 剔除PassengerId,Ticket这两个个变量,我们不用。
2 将Embarked变量补全,然后对Survived,Name,Sex, Embarked进行one-hot编码。
3对Pclass,Fare,Sibsp和Parch进行归一化处理。
3 根据Name,Sex,SibSp,Parch预测age将其补全。
4 对age进行归一化处理。
5 将未编码的Survived提出当做目标变量。
观察下各数值变量的协方差
print('_'*40)
sns.set(context="paper", font="monospace")
sns.set(style="white")
f, ax = plt.subplots(figsize=(10,6))
train_corr = titanic.drop('PassengerId',axis=1).corr()
sns.heatmap(train_corr, ax=ax, vmax=.9, square=True)
ax.set_xticklabels(train_corr.index, size=15)
ax.set_yticklabels(train_corr.columns[::-1], size=15)
ax.set_title('train feature corr', fontsize=20)
Data Processing:缺失值填充,归一化处理,编码方式。
缺失值填充
缺失值我们怎么处理呢?最简单的方法,有缺失值的样本我们就扔掉,这种做法比较适合在样本数量很多,缺失值样本舍弃也可以接受的情况下,这样虽然信息用的不充分,但也不会引入额外的误差。然后,假装走心的方法就是用平均值或者中位数来填充缺失值,这通常是最简便的做法,但通常会带来不少的误差。最后,比较负责任的方法就是利用其它的变量去估计缺失变量的值,这样通常会更靠谱一点
print('_'*40)
print ('***********Train*************')
print ('test')
print (train.isnull().sum())
print ('***********test*************')
print (test.isnull().sum())
print('_'*40)
#对缺失值用平均值填充
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())#对缺失值用平均值填充
print titanic.describe()
预测年纪并补全
from sklearn.ensemble import RandomForestRegressor
def set_missing_age(data):
train=data[['Age','SibSp_scaled','Parch_scaled','Name_0','Name_1','Name_2','Sex_0','Sex_1']]
known_age=train[train.Age.notnull()].as_matrix()
unknown_age=train[train.Age.isnull()].as_matrix()
y=known_age[:,0]
x=known_age[:,1:]
rf=RandomForestRegressor(random_state=0,n_estimators=200,n_jobs=-1)
rf.fit(x,y)
print rf.score(x,y)
predictage=rf.predict(unknown_age[:,1:])
data.loc[data.Age.isnull(),'Age']=predictage
return data,rf
data,rf=set_missing_age(data)
Age_scale=StandardScaler().fit(data['Age'])
data['Age_scaled']=StandardScaler().fit_transform(data['Age'].reshape(-1,1),Age_scale)
train_x=data[['Sex_0','Sex_1','Embarked_0','Embarked_1','Embarked_2','Name_0','Name_1','Name_2','Pclass_scaled','Age_scaled','Fare_scaled']].as_matrix()
train_y=data['Survived'].as_matrix()
归一化处理
from sklearn.preprocessing import StandardScaler
Pclass_scale=StandardScaler().fit(data['Pclass'])
data['Pclass_scaled']=StandardScaler().fit_transform(data['Pclass'].reshape(-1,1),Pclass_scale)
Fare_scale=StandardScaler().fit(data['Fare'])
data['Fare_scaled']=StandardScaler().fit_transform(data['Fare'].reshape(-1,1),Fare_scale)
SibSp_scale=StandardScaler().fit(data['SibSp'])
data['SibSp_scaled']=StandardScaler().fit_transform(data['SibSp'].reshape(-1,1),SibSp_scale)
Parch_scale=StandardScaler().fit(data['Parch'])
data['Parch_scaled']=StandardScaler().fit_transform(data['Parch'].reshape(-1,1),Parch_scale)
one-hot编码
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
#ohe_pclass=OneHotEncoder(sparse=False).fit(data[['Pclass']])
#Pclass_ohe=ohe_pclass.transform(data[['Pclass']])
le_sex=LabelEncoder().fit(data['Sex'])
Sex_label=le_sex.transform(data['Sex'])
ohe_sex=OneHotEncoder(sparse=False).fit(Sex_label.reshape(-1,1))
Sex_ohe=ohe_sex.transform(Sex_label.reshape(-1,1))
le_embarked=LabelEncoder().fit(data['Embarked'])
Embarked_label=le_embarked.transform(data['Embarked'])
ohe_embarked=OneHotEncoder(sparse=False).fit(Embarked_label.reshape(-1,1))
Embarked_ohe=ohe_embarked.transform(Embarked_label.reshape(-1,1))
def replace_name(x):
if 'Mrs' in x:
return 'Mrs'
elif 'Mr' in x:
return 'Mr'
else:
return 'Miss'
data['Name']=data['Name'].map(lambda x:replace_name(x))
le_name=LabelEncoder().fit(data['Name'])
Name_label=le_name.transform(data['Name'])
ohe_name=OneHotEncoder(sparse=False).fit(Name_label.reshape(-1,1))
Name_ohe=ohe_name.transform(Name_label.reshape(-1,1))
data['Sex_0']=Sex_ohe[:,0]
data['Sex_1']=Sex_ohe[:,1]
data['Embarked_0']=Embarked_ohe[:,0]
data['Embarked_1']=Embarked_ohe[:,1]
data['Embarked_2']=Embarked_ohe[:,2]
data['Name_0']=Name_ohe[:,0]
data['Name_1']=Name_ohe[:,1]
data['Name_2']=Name_ohe[:,2]
将str进行数值转换
titanic_test.loc[titanic_test.Fare.isnull(), 'Fare']= titanic_test[(titanic_test.Pclass==1)&(titanic_test.Embarked=='S')&(titanic_test.Sex=='male')].dropna().Fare.mean()
print titanic_test['Fare'].unique()
titanic_test.loc[titanic_test['Sex'] == 'male','Sex'] = 0 #loc定位到哪一行,test['Sex'] == 'male'的样本Sex值改为0
titanic_test.loc[titanic_test['Sex'] == 'female','Sex'] = 1
print titanic_test['Sex'].unique()
print titanic_test['Embarked'].unique()
titanic_test['Embarked'] = titanic_test['Embarked'].fillna('S') #用最多的填
titanic_test.loc[titanic_test['Embarked'] == 'S','Embarked'] = 0
titanic_test.loc[titanic_test['Embarked'] == 'C','Embarked'] = 1
titanic_test.loc[titanic_test['Embarked'] == 'Q','Embarked'] = 2
print titanic_test['Embarked'].unique()
模型构造
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
x_tr,x_te,y_tr,y_te=train_test_split(train_x,train_y,test_size=0.3,random_state=0)
lr=LogisticRegression(C=1.0,tol=1e-6)
lr.fit(x_tr,y_tr)
print lr.score(x_te,y_te)
from sklearn.svm import SVC
svc=SVC(C=2, kernel='rbf', decision_function_shape='ovo')
svc.fit(x_tr,y_tr)
print svc.score(x_te,y_te)
from sklearn.ensemble import RandomForestClassifier
randomf=RandomForestClassifier(n_estimators=500,max_depth=5,random_state=0)
randomf.fit(x_tr,y_tr)
print randomf.score(x_te,y_te)
from sklearn.ensemble import GradientBoostingClassifier
gdbt=GradientBoostingClassifier(n_estimators=600,max_depth=5,random_state=0)
gdbt.fit(x_tr,y_tr)
print gdbt.score(x_te,y_te)
预测数据
data_test=pd.read_csv('test.csv')
data_test.drop(['Ticket'],axis=1,inplace=True)
data_test.loc[data_test.Embarked.isnull(),'Embarked']='S'
Sex_label_test=le_sex.transform(data_test['Sex'])
Sex_ohe_test=ohe_sex.transform(Sex_label_test.reshape(-1,1))
Embarked_label_test=le_embarked.transform(data_test['Embarked'])
Embarked_ohe_test=ohe_embarked.transform(Embarked_label_test.reshape(-1,1))
data_test['Name']=data_test['Name'].map(lambda x:replace_name(x))
Name_label_test=le_name.transform(data_test['Name'])
Name_ohe_test=ohe_name.transform(Name_label_test.reshape(-1,1))
data_test['Sex_0']=Sex_ohe_test[:,0]
data_test['Sex_1']=Sex_ohe_test[:,1]
data_test['Embarked_0']=Embarked_ohe_test[:,0]
data_test['Embarked_1']=Embarked_ohe_test[:,1]
data_test['Embarked_2']=Embarked_ohe_test[:,2]
data_test['Name_0']=Name_ohe_test[:,0]
data_test['Name_1']=Name_ohe_test[:,1]
data_test['Name_2']=Name_ohe_test[:,2]
data_test['Pclass_scaled']=StandardScaler().fit_transform(data_test['Pclass'].reshape(-1,1),Pclass_scale)
data_test.loc[data_test.Fare.isnull(),'Fare']=0
data_test['Fare_scaled']=StandardScaler().fit_transform(data_test['Fare'].reshape(-1,1),Fare_scale)
data_test['SibSp_scaled']=StandardScaler().fit_transform(data_test['SibSp'].reshape(-1,1),SibSp_scale)
data_test['Parch_scaled']=StandardScaler().fit_transform(data_test['Parch'].reshape(-1,1),Parch_scale)
train_test=data_test[['Age','SibSp_scaled','Parch_scaled','Name_0','Name_1','Name_2','Sex_0','Sex_1']]
unknown_age_test=train_test[train_test.Age.isnull()].as_matrix()
x_test=unknown_age_test[:,1:]
predictage=rf.predict(x_test)
data_test.loc[data_test.Age.isnull(),'Age']=predictage
data_test['Age_scaled']=StandardScaler().fit_transform(data_test['Age'].reshape(-1,1),Age_scale)
test_x=data_test[['Sex_0','Sex_1','Embarked_0','Embarked_1','Embarked_2','Name_0','Name_1','Name_2','Pclass_scaled','Age_scaled','Fare_scaled']].as_matrix()
predictions=model.predict(test_x).astype(np.int32)
result=pd.DataFrame({'PassengerId':data_test['PassengerId'].as_matrix(),'Survived':predictions})
result.to_csv('svc.csv',index=False)
LinearRegression
print('_'*40)
# Classifiers
from sklearn.linear_model import LinearRegression #线性回归
from sklearn.cross_validation import KFold #交叉验证库,将测试集进行切分交叉验证取平均
predictors = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked'] #用到的特征
alg = LinearRegression()
kf = KFold(titanic.shape[0],n_folds=3,random_state=1) #将m个样本平均分成3份进行交叉验证
predictions = []
for train, test in kf:
train_predictors = (titanic[predictors].iloc[train,:])#将predictors作为测试特征
train_target = titanic['Survived'].iloc[train]
alg.fit(train_predictors,train_target)
test_prediction = alg.predict(titanic[predictors].iloc[test,:])
#print test_prediction
predictions.append(test_prediction)
import numpy as np
#使用线性回归得到的结果是在区间【0,1】上的某个值,需要将该值转换成0或1
predictions = np.concatenate(predictions, axis=0)
predictions[predictions >.5] = 1
predictions[predictions <=.5] = 0
accury = sum(predictions[predictions == titanic['Survived'].values]) / len(predictions)#测试准确率
print accury
plot_learning_curve
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve
% matplotlib notebook
# -*- coding: UTF-8 -*-
# 用sklearn的learning_curve得到training_score和cv_score,使用matplotlib画出learning curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1,
train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
# """
# 画出data在某模型上的learning curve.
# 参数解释
# ----------
# estimator : 你用的分类器。
# title : 表格的标题。
# X : 输入的feature,numpy类型
# y : 输入的target vector
# ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
# cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)
# n_jobs : 并行的的任务数(默认1)
# """
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
if plot:
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel(u'train_num_of_samples')
plt.ylabel(u'score')
plt.gca().invert_yaxis()
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
alpha=0.1, color="b")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
alpha=0.1, color="r")
plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u'train score')
plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u'testCV score')
plt.legend(loc="best")
plt.draw()
plt.gca().invert_yaxis()
plt.show()
midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
return midpoint, diff
#plot_learning_curve(clf, u"learning_rate", X, y)
LogisticRegression
from sklearn.linear_model import LogisticRegression #逻辑回归
from sklearn import cross_validation
LR = LogisticRegression(random_state=1)
scores = cross_validation.cross_val_score(LR, titanic[predictors],titanic['Survived'],cv=3)
print scores.mean()
plot_learning_curve(LR, u'learning_rate', titanic[predictors], titanic['Survived'])
submission.csv
LR.fit(titanic[predictors],titanic['Survived'])
gender_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':LR.predict(test[predictors])})
gender_submission.to_csv('gender_submission.csv', index=None)
RandomForest
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
predictions = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
RFC = RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=4,min_samples_leaf=2)
kf = cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
scores = cross_validation.cross_val_score(RFC,titanic[predictors],titanic['Survived'],cv=kf)
print scores.mean()
plot_learning_curve(RFC, u"learning_rate", titanic[predictors], titanic['Survived'])
submission.csv
RFC.fit(titanic[predictors],titanic['Survived'])
gender_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':RFC.predict(test[predictors])})
gender_submission.to_csv('gender_submission.csv', index=None)
模型调优
分别考察逻辑回归、支持向量机、最近邻、决策树、随机森林、gbdt、xgbGBDT几类算法的性能。
lr = LogisticRegression()
svc = SVC()
knn = KNeighborsClassifier(n_neighbors = 3)
dt = DecisionTreeClassifier()
rf = RandomForestClassifier(n_estimators=300,min_samples_leaf=4,class_weight={0:0.745,1:0.255})
gbdt = GradientBoostingClassifier(n_estimators=500,learning_rate=0.03,max_depth=3)
xgbGBDT = XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)
clfs = [logreg, svc, knn, decision_tree, random_forest, gbdt, xgb]
kfold = 10
cv_results = []
for classifier in clfs :
cv_results.append(cross_val_score(classifier, X_all, y = Y_all, scoring = "accuracy", cv = kfold, n_jobs=4))
cv_means = []
cv_std = []
for cv_result in cv_results:
cv_means.append(cv_result.mean())
cv_std.append(cv_result.std())
cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,
"Algorithm":["LR","SVC",'KNN','decision_tree',"random_forest","GBDT","xgbGBDT"]})
g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")
观察发现不同的模型的feature importance 有比较大的差别,,,把他们组合再一起会不会更好呢?
class Ensemble(object):
def __init__(self,estimators):
self.estimator_names = []
self.estimators = []
for i in estimators:
self.estimator_names.append(i[0])
self.estimators.append(i[1])
self.clf = LogisticRegression()
def fit(self, train_x, train_y):
for i in self.estimators:
i.fit(train_x,train_y)
x = np.array([i.predict(train_x) for i in self.estimators]).T
y = train_y
self.clf.fit(x, y)
def predict(self,x):
x = np.array([i.predict(x) for i in self.estimators]).T
#print(x)
return self.clf.predict(x)
def score(self,x,y):
s = precision_score(y,self.predict(x))
return s
集成框架准备好了,我们把基分类器丢进去。
bag = Ensemble([('xgb',xgb),('lr',lr),('rf',rf),('svc',svc),('gbdt',gbdt)])
score = 0
for i in range(0,10):
num_test = 0.20
X_train, X_cv, Y_train, Y_cv = train_test_split(X_all, Y_all, test_size=num_test)
bag.fit(X_train, Y_train)
#Y_test = bag.predict(X_test)
acc_xgb = round(bag.score(X_cv, Y_cv) * 100, 2)
score+=acc_xgb
print(score/10) #0.8786
集成算法
from sklearn.ensemble import VotingClassifier
model=VotingClassifier(estimators=[('lr',lr),('svc',svc),('rf',randomf),('GDBT',gdbt)],voting='hard',weights=[0.5,1.5,0.6,0.6])
model.fit(x_tr,y_tr)
print model.score(x_te,y_te)
##############提特征######################
titanic['Familysize'] = titanic['SibSp'] + titanic['Parch'] #家庭总共多少人
titanic['NameLength'] = titanic['Name'].apply(lambda x: len(x)) #名字的长度
import re
def get_title(name):
title_reserch = re.search('([A-Za-z]+)\.',name)
if title_reserch:
return title_reserch.group(1)
return ""
titles = titanic['Name'].apply(get_title)
print pd.value_counts(titles)
#将称号转换成数值表示
title_mapping = {"Mr":1,"Miss":2,"Mrs":3,"Master":4,"Dr":5,"Rev":6,"Col":7,"Major":8,"Mlle":9,"Countess":10,"Ms":11,"Lady":12,"Jonkheer":13,"Don":14,"Mme":15,"Capt":16,"Sir":17}
for k,v in title_mapping.items():
titles[titles==k] = v
print (pd.value_counts(titles))
titanic["titles"] = titles #添加title特征
import numpy as np
from sklearn.feature_selection import SelectKBest,f_classif#引入feature_selection看每一个特征的重要程度
import matplotlib.pyplot as plt
predictors = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Familysize','NameLength','titles']
selector = SelectKBest(f_classif,k=5)
selector.fit(titanic[predictors],titanic['Survived'])
scores = -np.log10(selector.pvalues_)
plt.bar(range(len(predictors)),scores)
plt.xticks(range(len(predictors)),predictors,rotation='vertical')
plt.show()
##########集成分类器#############
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
algorithas = [
[GradientBoostingClassifier(random_state=1,n_estimators=25,max_depth=3),['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Familysize','NameLength','titles']],
[LogisticRegression(random_state=1),['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Familysize','NameLength','titles']]
]
kf = KFold(titanic.shape[0],n_folds=3,random_state=1)
predictions = []
for train, test in kf:
train_target = titanic['Survived'].iloc[train]
full_test_predictions = []
for alg,predictors in algorithas:
alg.fit(titanic[predictors].iloc[train,:],train_target)
test_prediction = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
full_test_predictions.append(test_prediction)
test_predictions = (full_test_predictions[0]*3 + full_test_predictions[1]) / 2
test_predictions[test_predictions >.5] = 1
test_predictions[test_predictions <=.5] = 0
predictions.append(test_predictions)
predictions = np.concatenate(predictions,axis=0)
accury = sum(predictions[predictions == titanic['Survived']]) / len(predictions)#测试准确率
print accury
参考文献
Kaggle_Titanic
机器学习系列(3)_逻辑回归应用之Kaggle泰坦尼克之灾
使用sklearn进行kaggle案例泰坦尼克Titanic船员获救预测
分分钟,杀入Kaggle TOP 5% 系列(1)
分分钟,杀入Kaggle TOP 5% 系列(2)
Jason Brownlee-加入 Kaggle 大数据竞赛,总共分几步?
Jason Brownlee-Applied Machine Learning
https://www.kaggle.com/c/titanic#tutorials