机器学习机器学习实践

使用sklearn进行kaggle案例泰坦尼克Titanic船员

2018-01-03  本文已影响74人  jiandanjinxin

Titanic
背景介绍: kaggle 泰坦尼克

发生在1912年的泰坦尼克事件,导致船上2224名游客阵亡1502,作为事后诸葛亮,我们掌握船上乘客的一些数据以及一部分乘客是否获救的信息。我们希望能通过探索这些数据,发现一些不为人知的秘密- -,顺便预测下另外一部分乘客是否能够获救~!

# -*- coding: UTF-8 -*-
#from __future__ import division
#from __future__ import print_function
import os
#数据处理
import pandas as pd
import numpy as np
import random
import sklearn.preprocessing as preprocessing
#可视化
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd  
%matplotlib inline
titanic = pd.read_csv('train.csv')
titanic_test = pd.read_csv('test.csv')
#submission_sample = pd.read_csv('gender_submission.csv')
print('_'*40)
print titanic.describe() 
print titanic.info()
print titanic.head() 
print titanic.tail()
print titanic_test.describe() 
print titanic_test.info()
print titanic_test.head() 
print titanic_test.tail()
print('_'*40)
print titanic.columns
print titanic_test.columns

所有的数据中一共包括12个变量,其中7个是数值变量,5个是属性变量,接下来我们具体来看一看。

PassengerId:这是乘客的编号,显然对乘客是否幸存完全没有任何作用,仅做区分作用,所以就不考虑它了。

Survived:乘客最后的生存情况,这个是我们预测的目标变量。不过从平均数可以看出,最后存活的概率大概是38%。

Pclass:社会经济地位,这个很明显和生存结果相关啊,有钱人住着更加高级船舱可能会享受着更加高级的服务,因此遇险时往往会受到优待。所以这显然是我们要考虑的一个变量。

Name:这个变量看起来好像是没什么用啊,因为毕竟从名字你也不能看出能不能获救,但是仔细观察数据我们可以看到,所有人的名字里都包括了Mr,Mrs和Miss,从中是不是隐约可以看出来一些性别和年龄的信息呢,所以干脆把名字这个变量变成一个状态变量,包含Mr,Mrs和Miss这三种状态,但是放到机器学习里面我们得给它一个编码啊,最直接的想法就是0,1,2,但是这样真的合理吗?因为从距离的角度来说,这样Mr和Mrs的距离要小于Mr和Miss的距离,显然不合适,因为我们把它看成平权的三个状态。
所以,敲黑板,知识点来了,对于这种状态变量我们通常采取的措施是one-hot编码,什么意思呢,有几种状态就用一个几位的编码来表示状态,每种状态对应一个一位是1其余各位是0的编码,这样从向量的角度来讲,就是n维空间的n个基准向量,它们相互明显是平权的,此例中,我们分别用100,010,001来表示Mr,Mrs和Miss。

Sex:性别这个属性肯定是很重要的,毕竟全人类都讲究Lady First,所以遇到危险的时候,绅士们一定会先让女士逃走,因此女性的生存几率应该会大大提高。类似的,性别也是一个平权的状态变量,所以到时候我们同样采取one-hot编码的方式进行处理。

Age:这个变量和性别类似,都是明显会发挥重要作用的,因为无论何时,尊老爱幼总是为人们所推崇,但年龄对是否会获救的影响主要体现在那个人处在哪个年龄段,因此我们选择将它划分成一个状态变量,比如18以下叫child,18以上50以下叫adult,50以上叫elder,然后利用one-hot编码进行处理。不过这里还有一个问题就是年龄的值只有714个,它不全!这么重要的东西怎么能不全呢,所以我们只能想办法补全它。

又敲黑板,知识点又来了,缺失值我们怎么处理呢?最简单的方法,有缺失值的样本我们就扔掉,这种做法比较适合在样本数量很多,缺失值样本舍弃也可以接受的情况下,这样虽然信息用的不充分,但也不会引入额外的误差。然后,假装走心的方法就是用平均值或者中位数来填充缺失值,这通常是最简便的做法,但通常会带来不少的误差。最后,比较负责任的方法就是利用其它的变量去估计缺失变量的值,这样通常会更靠谱一点,当然也不能完全这样说,毕竟只要是估计值就不可避免的带来误差,但心理上总会觉得这样更好……

SibSp:船上兄弟姐妹或者配偶的数量。这个变量对最后的结果到底有什么影响我还真的说不准,但是预测年纪的时候说不定有用。

Parch:船上父母或者孩子的数量。这个变量和上个变量类似,我确实没有想到特别好的应用它的办法,同样的,预测年龄时这个应该挺靠谱的。

Ticket:船票的号码。恕我直言,这个谜一样的数字真的是不知道有什么鬼用,果断放弃了。

Fare:船票价格,这个变量的作用其实类似于社会地位,船票价格越高,享受的服务越高档,所以遇难获救的概率肯定相对较高,所以这是一个必须要考虑进去的变量。

Cabin:船舱号,这个变量或许透露出了一点船舱等级的信息,但是说实话,你的缺失值实在是太多了,我要是把它补全引入的误差感觉比它提供的信息还多,所以忍痛割爱,和你say goodbye!

Embarked:登船地点,按道理来说,这个变量应该是没什么卵用的,但介于它是一个只有三个状态的状态变量,那我们就把它处理一下放进模型,万一有用呢对吧。另外,它有两个缺失值,这里我们就不大动干戈的去预测了,就直接把它定为登船人数最多的S吧。

好的,到这里我们对所有变量应该如何处理大致有谱了,状态变量进行one-hot编码,那数值变量呢,直接用吗?

好的,到这里我们对所有变量应该如何处理大致有谱了,状态变量进行one-hot编码,那数值变量呢,直接用吗?

知识点!对于数值变量,我们通常会先进行归一化处理,这样有利于我们加快收敛速度,将各个维度限制在差不多的区间内,对一些基于距离的分类器有着非常大的好处,但是对于决策树一类的算法其实就没有意义了,不过这边我们就对所有的数值变量都做一个归一化处理吧。

到了这里,想必思路已经很清晰了,下面我们再梳理一下过程:

1 剔除PassengerId,Ticket这两个个变量,我们不用。

2 将Embarked变量补全,然后对Survived,Name,Sex, Embarked进行one-hot编码。

3对Pclass,Fare,Sibsp和Parch进行归一化处理。

3 根据Name,Sex,SibSp,Parch预测age将其补全。

4 对age进行归一化处理。

5 将未编码的Survived提出当做目标变量。

观察下各数值变量的协方差

print('_'*40)

sns.set(context="paper", font="monospace")
sns.set(style="white")
f, ax = plt.subplots(figsize=(10,6))
train_corr = titanic.drop('PassengerId',axis=1).corr()
sns.heatmap(train_corr, ax=ax, vmax=.9, square=True)
ax.set_xticklabels(train_corr.index, size=15)
ax.set_yticklabels(train_corr.columns[::-1], size=15)
ax.set_title('train feature corr', fontsize=20)




Data Processing:缺失值填充,归一化处理,编码方式。

缺失值填充

缺失值我们怎么处理呢?最简单的方法,有缺失值的样本我们就扔掉,这种做法比较适合在样本数量很多,缺失值样本舍弃也可以接受的情况下,这样虽然信息用的不充分,但也不会引入额外的误差。然后,假装走心的方法就是用平均值或者中位数来填充缺失值,这通常是最简便的做法,但通常会带来不少的误差。最后,比较负责任的方法就是利用其它的变量去估计缺失变量的值,这样通常会更靠谱一点

print('_'*40)
print ('***********Train*************')
print ('test')
print (train.isnull().sum())
print ('***********test*************')
print (test.isnull().sum())

print('_'*40)

#对缺失值用平均值填充
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())#对缺失值用平均值填充 
print titanic.describe()  

预测年纪并补全

from sklearn.ensemble import RandomForestRegressor  
def set_missing_age(data):  
    train=data[['Age','SibSp_scaled','Parch_scaled','Name_0','Name_1','Name_2','Sex_0','Sex_1']]  
    known_age=train[train.Age.notnull()].as_matrix()  
    unknown_age=train[train.Age.isnull()].as_matrix()  
    y=known_age[:,0]  
    x=known_age[:,1:]  
    rf=RandomForestRegressor(random_state=0,n_estimators=200,n_jobs=-1)  
    rf.fit(x,y)  
    print rf.score(x,y)  
    predictage=rf.predict(unknown_age[:,1:])  
    data.loc[data.Age.isnull(),'Age']=predictage  
    return data,rf  

data,rf=set_missing_age(data)  
Age_scale=StandardScaler().fit(data['Age'])  
data['Age_scaled']=StandardScaler().fit_transform(data['Age'].reshape(-1,1),Age_scale)  

train_x=data[['Sex_0','Sex_1','Embarked_0','Embarked_1','Embarked_2','Name_0','Name_1','Name_2','Pclass_scaled','Age_scaled','Fare_scaled']].as_matrix()  

train_y=data['Survived'].as_matrix()  

归一化处理


from sklearn.preprocessing import StandardScaler  

Pclass_scale=StandardScaler().fit(data['Pclass'])  
data['Pclass_scaled']=StandardScaler().fit_transform(data['Pclass'].reshape(-1,1),Pclass_scale)  

Fare_scale=StandardScaler().fit(data['Fare'])  
data['Fare_scaled']=StandardScaler().fit_transform(data['Fare'].reshape(-1,1),Fare_scale)  

SibSp_scale=StandardScaler().fit(data['SibSp'])  
data['SibSp_scaled']=StandardScaler().fit_transform(data['SibSp'].reshape(-1,1),SibSp_scale)  

Parch_scale=StandardScaler().fit(data['Parch'])  
data['Parch_scaled']=StandardScaler().fit_transform(data['Parch'].reshape(-1,1),Parch_scale)  

one-hot编码

from sklearn.preprocessing import OneHotEncoder  
from sklearn.preprocessing import LabelEncoder  

#ohe_pclass=OneHotEncoder(sparse=False).fit(data[['Pclass']])  

#Pclass_ohe=ohe_pclass.transform(data[['Pclass']])  

le_sex=LabelEncoder().fit(data['Sex'])  

Sex_label=le_sex.transform(data['Sex'])  

ohe_sex=OneHotEncoder(sparse=False).fit(Sex_label.reshape(-1,1))  

Sex_ohe=ohe_sex.transform(Sex_label.reshape(-1,1))  

le_embarked=LabelEncoder().fit(data['Embarked'])  
Embarked_label=le_embarked.transform(data['Embarked'])  
ohe_embarked=OneHotEncoder(sparse=False).fit(Embarked_label.reshape(-1,1))  
Embarked_ohe=ohe_embarked.transform(Embarked_label.reshape(-1,1))  
def replace_name(x):  
    if 'Mrs' in x:  
        return 'Mrs'  
    elif 'Mr' in x:  
        return 'Mr'  
    else:
        return 'Miss'  

data['Name']=data['Name'].map(lambda x:replace_name(x))  

le_name=LabelEncoder().fit(data['Name'])  

Name_label=le_name.transform(data['Name'])  

ohe_name=OneHotEncoder(sparse=False).fit(Name_label.reshape(-1,1))  

Name_ohe=ohe_name.transform(Name_label.reshape(-1,1))  

data['Sex_0']=Sex_ohe[:,0]  
data['Sex_1']=Sex_ohe[:,1] 

data['Embarked_0']=Embarked_ohe[:,0]  
data['Embarked_1']=Embarked_ohe[:,1]  
data['Embarked_2']=Embarked_ohe[:,2]  

data['Name_0']=Name_ohe[:,0]  
data['Name_1']=Name_ohe[:,1]  
data['Name_2']=Name_ohe[:,2]  

将str进行数值转换

titanic_test.loc[titanic_test.Fare.isnull(), 'Fare']= titanic_test[(titanic_test.Pclass==1)&(titanic_test.Embarked=='S')&(titanic_test.Sex=='male')].dropna().Fare.mean()
print titanic_test['Fare'].unique()

titanic_test.loc[titanic_test['Sex'] == 'male','Sex'] = 0 #loc定位到哪一行,test['Sex'] == 'male'的样本Sex值改为0  
titanic_test.loc[titanic_test['Sex'] == 'female','Sex'] = 1             
print titanic_test['Sex'].unique()  
print titanic_test['Embarked'].unique()       
titanic_test['Embarked'] = titanic_test['Embarked'].fillna('S')     #用最多的填   
titanic_test.loc[titanic_test['Embarked'] == 'S','Embarked'] = 0    
titanic_test.loc[titanic_test['Embarked'] == 'C','Embarked'] = 1   
titanic_test.loc[titanic_test['Embarked'] == 'Q','Embarked'] = 2     
print titanic_test['Embarked'].unique()

模型构造

from sklearn.model_selection import train_test_split  

from sklearn.linear_model import LogisticRegression  

x_tr,x_te,y_tr,y_te=train_test_split(train_x,train_y,test_size=0.3,random_state=0)  

lr=LogisticRegression(C=1.0,tol=1e-6)  

lr.fit(x_tr,y_tr)  

print lr.score(x_te,y_te)  

from sklearn.svm import SVC  

svc=SVC(C=2, kernel='rbf', decision_function_shape='ovo')  

svc.fit(x_tr,y_tr)  

print svc.score(x_te,y_te)  

from sklearn.ensemble import RandomForestClassifier  

randomf=RandomForestClassifier(n_estimators=500,max_depth=5,random_state=0)  

randomf.fit(x_tr,y_tr)  

print randomf.score(x_te,y_te)  

from sklearn.ensemble import GradientBoostingClassifier  

gdbt=GradientBoostingClassifier(n_estimators=600,max_depth=5,random_state=0)  

gdbt.fit(x_tr,y_tr)  

print gdbt.score(x_te,y_te)

预测数据



data_test=pd.read_csv('test.csv')  

data_test.drop(['Ticket'],axis=1,inplace=True)  

data_test.loc[data_test.Embarked.isnull(),'Embarked']='S'  

Sex_label_test=le_sex.transform(data_test['Sex'])  

Sex_ohe_test=ohe_sex.transform(Sex_label_test.reshape(-1,1))  

Embarked_label_test=le_embarked.transform(data_test['Embarked'])  

Embarked_ohe_test=ohe_embarked.transform(Embarked_label_test.reshape(-1,1))  

data_test['Name']=data_test['Name'].map(lambda x:replace_name(x))  

Name_label_test=le_name.transform(data_test['Name'])  

Name_ohe_test=ohe_name.transform(Name_label_test.reshape(-1,1))  

data_test['Sex_0']=Sex_ohe_test[:,0]  

data_test['Sex_1']=Sex_ohe_test[:,1]  

data_test['Embarked_0']=Embarked_ohe_test[:,0]  

data_test['Embarked_1']=Embarked_ohe_test[:,1]  

data_test['Embarked_2']=Embarked_ohe_test[:,2]  

data_test['Name_0']=Name_ohe_test[:,0]  

data_test['Name_1']=Name_ohe_test[:,1]  

data_test['Name_2']=Name_ohe_test[:,2]  

data_test['Pclass_scaled']=StandardScaler().fit_transform(data_test['Pclass'].reshape(-1,1),Pclass_scale)  

data_test.loc[data_test.Fare.isnull(),'Fare']=0  

data_test['Fare_scaled']=StandardScaler().fit_transform(data_test['Fare'].reshape(-1,1),Fare_scale)  

data_test['SibSp_scaled']=StandardScaler().fit_transform(data_test['SibSp'].reshape(-1,1),SibSp_scale)  

data_test['Parch_scaled']=StandardScaler().fit_transform(data_test['Parch'].reshape(-1,1),Parch_scale)  

train_test=data_test[['Age','SibSp_scaled','Parch_scaled','Name_0','Name_1','Name_2','Sex_0','Sex_1']]  

unknown_age_test=train_test[train_test.Age.isnull()].as_matrix()  

x_test=unknown_age_test[:,1:]  

predictage=rf.predict(x_test)  

data_test.loc[data_test.Age.isnull(),'Age']=predictage  

data_test['Age_scaled']=StandardScaler().fit_transform(data_test['Age'].reshape(-1,1),Age_scale)  

test_x=data_test[['Sex_0','Sex_1','Embarked_0','Embarked_1','Embarked_2','Name_0','Name_1','Name_2','Pclass_scaled','Age_scaled','Fare_scaled']].as_matrix()  

predictions=model.predict(test_x).astype(np.int32)  

result=pd.DataFrame({'PassengerId':data_test['PassengerId'].as_matrix(),'Survived':predictions})  

result.to_csv('svc.csv',index=False)  

LinearRegression

print('_'*40)

# Classifiers
from sklearn.linear_model import LinearRegression #线性回归  
from sklearn.cross_validation import KFold #交叉验证库,将测试集进行切分交叉验证取平均  
predictors = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']   #用到的特征 
alg = LinearRegression()  
kf = KFold(titanic.shape[0],n_folds=3,random_state=1) #将m个样本平均分成3份进行交叉验证  
predictions = []  
for train, test in kf:  
    train_predictors = (titanic[predictors].iloc[train,:])#将predictors作为测试特征  
    train_target = titanic['Survived'].iloc[train]  
    alg.fit(train_predictors,train_target)    
    test_prediction = alg.predict(titanic[predictors].iloc[test,:])  
    #print test_prediction  
    predictions.append(test_prediction)         
import numpy as np 
#使用线性回归得到的结果是在区间【0,1】上的某个值,需要将该值转换成0或1  
predictions = np.concatenate(predictions, axis=0)

predictions[predictions >.5] = 1
predictions[predictions <=.5] = 0
accury = sum(predictions[predictions == titanic['Survived'].values]) / len(predictions)#测试准确率  
print accury  

plot_learning_curve

import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve
% matplotlib notebook
# -*- coding: UTF-8 -*-
# 用sklearn的learning_curve得到training_score和cv_score,使用matplotlib画出learning curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, 
                        train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
#     """
#     画出data在某模型上的learning curve.
#     参数解释
#     ----------
#     estimator : 你用的分类器。
#     title : 表格的标题。
#     X : 输入的feature,numpy类型
#     y : 输入的target vector
#     ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
#     cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)
#     n_jobs : 并行的的任务数(默认1)
#     """
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    if plot:
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel(u'train_num_of_samples')
        plt.ylabel(u'score')
        plt.gca().invert_yaxis()
        plt.grid()
    
        plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, 
                         alpha=0.1, color="b")
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, 
                         alpha=0.1, color="r")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u'train score')
        plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u'testCV score')
    
        plt.legend(loc="best")
        
        plt.draw()
        plt.gca().invert_yaxis()
        plt.show()
    
    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
    return midpoint, diff

#plot_learning_curve(clf, u"learning_rate", X, y)

LogisticRegression

from sklearn.linear_model import LogisticRegression #逻辑回归  
from sklearn import cross_validation  
LR = LogisticRegression(random_state=1)  
scores = cross_validation.cross_val_score(LR, titanic[predictors],titanic['Survived'],cv=3)  
print scores.mean()  
plot_learning_curve(LR, u'learning_rate', titanic[predictors], titanic['Survived'])

submission.csv

LR.fit(titanic[predictors],titanic['Survived'])
gender_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':LR.predict(test[predictors])})
gender_submission.to_csv('gender_submission.csv', index=None)

RandomForest

from sklearn.ensemble import RandomForestClassifier  
from sklearn import cross_validation  
predictions = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']  
RFC = RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=4,min_samples_leaf=2)  
kf = cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)  
scores = cross_validation.cross_val_score(RFC,titanic[predictors],titanic['Survived'],cv=kf)  
print scores.mean() 
plot_learning_curve(RFC, u"learning_rate", titanic[predictors], titanic['Survived'])

submission.csv

RFC.fit(titanic[predictors],titanic['Survived'])
gender_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':RFC.predict(test[predictors])})
gender_submission.to_csv('gender_submission.csv', index=None)

模型调优

分别考察逻辑回归、支持向量机、最近邻、决策树、随机森林、gbdt、xgbGBDT几类算法的性能。

lr = LogisticRegression()
svc = SVC()
knn = KNeighborsClassifier(n_neighbors = 3)
dt = DecisionTreeClassifier()
rf = RandomForestClassifier(n_estimators=300,min_samples_leaf=4,class_weight={0:0.745,1:0.255})
gbdt = GradientBoostingClassifier(n_estimators=500,learning_rate=0.03,max_depth=3)
xgbGBDT = XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)
clfs = [logreg, svc, knn, decision_tree, random_forest, gbdt, xgb]

kfold = 10
cv_results = []
for classifier in clfs :
    cv_results.append(cross_val_score(classifier, X_all, y = Y_all, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,
                       "Algorithm":["LR","SVC",'KNN','decision_tree',"random_forest","GBDT","xgbGBDT"]})

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

观察发现不同的模型的feature importance 有比较大的差别,,,把他们组合再一起会不会更好呢?

class Ensemble(object):
    
    def __init__(self,estimators):
        self.estimator_names = []
        self.estimators = []
        for i in estimators:
            self.estimator_names.append(i[0])
            self.estimators.append(i[1])
        self.clf = LogisticRegression()
    
    def fit(self, train_x, train_y):
        for i in self.estimators:
            i.fit(train_x,train_y)
        x = np.array([i.predict(train_x) for i in self.estimators]).T
        y = train_y
        self.clf.fit(x, y)
    
    def predict(self,x):
        x = np.array([i.predict(x) for i in self.estimators]).T
        #print(x)
        return self.clf.predict(x)
        
    
    def score(self,x,y):
        s = precision_score(y,self.predict(x))
        return s

集成框架准备好了,我们把基分类器丢进去。

bag = Ensemble([('xgb',xgb),('lr',lr),('rf',rf),('svc',svc),('gbdt',gbdt)])
score = 0
for i in range(0,10):
    num_test = 0.20
    X_train, X_cv, Y_train, Y_cv = train_test_split(X_all, Y_all, test_size=num_test)
    bag.fit(X_train, Y_train)
    #Y_test = bag.predict(X_test)
    acc_xgb = round(bag.score(X_cv, Y_cv) * 100, 2)
    score+=acc_xgb
print(score/10)  #0.8786

集成算法

from sklearn.ensemble import VotingClassifier  
model=VotingClassifier(estimators=[('lr',lr),('svc',svc),('rf',randomf),('GDBT',gdbt)],voting='hard',weights=[0.5,1.5,0.6,0.6])  
model.fit(x_tr,y_tr)  
print model.score(x_te,y_te)  

##############提特征######################  
titanic['Familysize'] = titanic['SibSp'] + titanic['Parch'] #家庭总共多少人  
titanic['NameLength'] = titanic['Name'].apply(lambda x: len(x)) #名字的长度  
import re  
def get_title(name):  
    title_reserch = re.search('([A-Za-z]+)\.',name)  
    if title_reserch:  
        return title_reserch.group(1)  
    return ""  
titles = titanic['Name'].apply(get_title)  
print pd.value_counts(titles)      
#将称号转换成数值表示  
title_mapping = {"Mr":1,"Miss":2,"Mrs":3,"Master":4,"Dr":5,"Rev":6,"Col":7,"Major":8,"Mlle":9,"Countess":10,"Ms":11,"Lady":12,"Jonkheer":13,"Don":14,"Mme":15,"Capt":16,"Sir":17}  
for k,v in title_mapping.items():  
    titles[titles==k] = v  
    print (pd.value_counts(titles))  
titanic["titles"] = titles #添加title特征  
import numpy as np  
from sklearn.feature_selection import SelectKBest,f_classif#引入feature_selection看每一个特征的重要程度  
import matplotlib.pyplot as plt  
predictors = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Familysize','NameLength','titles']  
selector = SelectKBest(f_classif,k=5)  
selector.fit(titanic[predictors],titanic['Survived'])  
scores = -np.log10(selector.pvalues_)  
plt.bar(range(len(predictors)),scores)  
plt.xticks(range(len(predictors)),predictors,rotation='vertical')  
plt.show()  

##########集成分类器#############  
from sklearn.ensemble import GradientBoostingClassifier  
import numpy as np  
algorithas = [  
        [GradientBoostingClassifier(random_state=1,n_estimators=25,max_depth=3),['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Familysize','NameLength','titles']],  
        [LogisticRegression(random_state=1),['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Familysize','NameLength','titles']]          
        ]    
kf = KFold(titanic.shape[0],n_folds=3,random_state=1)  
predictions = []  
for train, test in kf:  
    train_target = titanic['Survived'].iloc[train]  
    full_test_predictions = []  
    for alg,predictors in algorithas:        
        alg.fit(titanic[predictors].iloc[train,:],train_target)    
        test_prediction = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]  
        full_test_predictions.append(test_prediction)  
    test_predictions = (full_test_predictions[0]*3 + full_test_predictions[1]) / 2  
    test_predictions[test_predictions >.5] = 1  
    test_predictions[test_predictions <=.5] = 0        
    predictions.append(test_predictions)    
predictions = np.concatenate(predictions,axis=0)      
accury = sum(predictions[predictions == titanic['Survived']]) / len(predictions)#测试准确率  
print accury    

参考文献

一个实例告诉你:Kaggle 数据竞赛都有哪些套路

Kaggle_Titanic
机器学习系列(3)_逻辑回归应用之Kaggle泰坦尼克之灾
使用sklearn进行kaggle案例泰坦尼克Titanic船员获救预测

分分钟,杀入Kaggle TOP 5% 系列(1)
分分钟,杀入Kaggle TOP 5% 系列(2)

数据集下载地址
视频地址

Jason Brownlee-加入 Kaggle 大数据竞赛,总共分几步?
Jason Brownlee-Applied Machine Learning
https://www.kaggle.com/c/titanic#tutorials

上一篇下一篇

猜你喜欢

热点阅读