R. python新手日记python从零开始学人工智能/模式识别/机器学习精华专题

Python从零开始第六章机器学习①逻辑回归实战(1)

2018-12-15  本文已影响549人  柳叶刀与小鼠标

在本节中,您将使用机器学习算法解决泰坦尼克号预测问题:Logistic回归。 Logistic回归是一种分类算法,涉及预测事件的结果,例如乘客是否能够在泰坦尼克号灾难中幸存

1912年4月15日,在首次航行期间,泰坦尼克号撞上冰山后沉没,2224名乘客和机组人员中有1502人遇难。这场轰动的悲剧震惊国际社会,在这次海难中导致死亡率高的原因之一是没有足够的救生艇给乘客和机组人员,虽然幸存下来有一部分的运气因素,但是还是有一些人比其他人的生存下来的可能性更高,比如妇女、儿童和上层阶级的人士。在这个学习之中,我们将用逻辑回归来预测一些人生存的可能性。用机器学习来预测哪些乘客能更幸免于难。在此用到的编程语言是Python。

%reset -f
%clear
# In[*]

import pandas as pd
from sklearn import linear_model
from sklearn import preprocessing
import os
os.chdir('D:\\train\\all')
# In[*]
# read the data
df = pd.read_csv("train.csv")

我个人的习惯是每一步都看一下数据框,以验证数据是否正确加载。

# drop the columns that are not useful to us
df = df.drop('PassengerId', axis=1)
# axis=1 means column
df = df.drop('Name', axis=1)
df = df.drop('Ticket', axis=1)
df = df.drop('Cabin', axis=1)
# initialize label encoder
label_encoder = preprocessing.LabelEncoder()
# convert Sex and Embarked features to numeric
sex_encoded = label_encoder.fit_transform(df["Sex"])
print(sex_encoded)
# 0 = female
# 1 = male
df['Sex'] = sex_encoded
embarked_encoded = label_encoder.fit_transform(df["Embarked"])
print(embarked_encoded)
# 0 = C
# 1 = Q
# 2 = S
df['Embarked'] = embarked_encoded
print(df.head())

请注意,Sex和Embarked字段的值现在已替换为编码值。

要使字段分类,请使用Pandas中的Categorical类:

# In[*]
# make fields categorical
df["Pclass"] = pd.Categorical(df["Pclass"])
df["Sex"] = pd.Categorical(df["Sex"])
df["Embarked"] = pd.Categorical(df["Embarked"])
df["Survived"] = pd.Categorical(df["Survived"])
print(df.dtypes) # examine the datatypes
                       # for each feature
Survived    category
Pclass      category
Sex         category
Age          float64
SibSp          int64
Parch          int64
Fare         float64
Embarked    category
dtype: object
    # In[*]                   
                       
 # we use all columns except Survived as
# features for training
features = df.drop('Survived',1)
# the label is Survived
label = df['Survived']
                     
                       
from sklearn.model_selection import train_test_split
# split the dataset into train and test sets
train_features,test_features, train_label,test_label = train_test_split(features,
        label,
        test_size = 0.25, # split ratio
        random_state = 1, # Set random seed
        stratify = df["Survived"])
# Training set
print(train_features.head())
print(train_label)  
上一篇下一篇

猜你喜欢

热点阅读