Kaggle

Learn专题四——机器学习的入门

2019-06-03  本文已影响0人  Python解决方案

这个专题主要解决以下问题:


1.tackle data types often found in real-world datasets (missing values, categorical variables),

2.design pipelines to improve the quality of your machine learning code,

3.use advanced techniques for model validation (cross-validation),

4.build state-of-the-art models that are widely used to win Kaggle competitions (XGBoost), and

5.avoid common and important data science mistakes (leakage).


第一课&第二课 简介和练习


这一课主要还是在复习第二专题的内容,不同的是在第二专题的基础上,将将模型预测的过程用函数进行了封装,此外介绍了如何将生成的预测数据上传至Kaggle中。


from sklearn.ensemble import RandomForestRegressor

# Define the models

model_1 = RandomForestRegressor(n_estimators=50, random_state=0)

model_2 = RandomForestRegressor(n_estimators=100, random_state=0)

model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)

model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)

model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]



from sklearn.metrics import mean_absolute_error

# Function for comparing different models

def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):

    model.fit(X_t, y_t)

    preds = model.predict(X_v)

    return mean_absolute_error(y_v, preds)

for i in range(0, len(models)):

    mae = score_model(models[i])

    print("Model %d MAE: %d" % (i+1, mae))


# Fit the model to the training data

my_model.fit(X, y)

# Generate test predictions

preds_test = my_model.predict(X_test)

# Save predictions in format used for competition scoring

output = pd.DataFrame({'Id': X_test.index,

                      'SalePrice': preds_test})

output.to_csv('submission.csv', index=False)


第三课&第四课 缺失值处理


:即drop.

cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

也可以使用dropna(),如

X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)

:即imputation。可以使用mean值直接填,也可以用一些复杂的方法,比如regression impution.

from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

:即Extension,Extension的同时也需要做Imputation,同时注意这里对Column的处理,需要重新添加.

# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

顺便结合Kaggle Datasets:Housing price in Beijing 顺便介绍其他对特征的预处理方式.
异常值处理
继续阅读需要先去看一下原数据,通过观察,发现特征livingRoom、price、constructionTime、floor均存在异常值,livingRoom中是“#name?”,constructionTime中是“未知”,price中是一些很小的价格数,比如房价为1-1000元、floor中是“钢混结构”。
对于前三个采用“扔的策略”.

with_NAME_row=house_select_columns.index[house_select_columns["livingRoom"]=="#NAME?"].tolist()
house_select_columns=house_select_columns.drop(with_NAME_row,axis=0) 

with_weizhi_row=house_select_columns.index[house_select_columns["constructionTime"]=="未知"].tolist()
house_select_columns=house_select_columns.drop(with_weizhi_row,axis=0)

with_lowprice_row=house_select_columns.index[house_select_columns["price"]<10000].tolist()#10000元/平以下的数据都不要
house_select_columns=house_select_columns.drop(with_lowprice_row,axis=0)

对于floor特征,还有一点需要说明,除了出现“钢混结构”的异常值外,其他值都是类似于“高 27”、“低 7”等这样的str类型数据,因此为了将其喂给模型还需要做其他处理,这里采用的方法是,直接将数字(楼层高度)提取出来作为floor特征的值,汉字不要,当然你也可以单独再见一个特征用来区分“顶、高、中、低、底”这可能会用到后续将要介绍的分类变量的知识。

floor_list=[]
floor_copy=house_data["floor"]

for i in range(len(floor_copy)):
    if re.findall('(\d+)',str(floor_copy[i])):
        f1=re.findall('(\d+)',str(floor_copy[i]))
        f1_int=int(f1[0])
        floor_list.append(f1_int)
    else:
        floor_list.append(-1)#32个

house_data["floor"]=floor_list

数据类型转换
由于livingRoom、drawingRoom、bathRoom的数据类型都是object,所以要将其转化为float.

house_select_columns["drawingRoom"]=house_select_columns["drawingRoom"].astype("float")
house_select_columns["bathRoom"]=house_select_columns["bathRoom"].astype("float")
house_select_columns["livingRoom"]=house_select_columns["livingRoom"].astype("float")

选择“非object”类型数据方法如下:

X = X_full.select_dtypes(exclude=['object'])

日期的处理
tradeTime特征是形如“2017-01-01”类型的字符串,为了将其喂给模型,需要将其转化为时间戳,由于时间戳的数值较大,会增加计算量,因此将其减去某个固定日期的时间戳,比如都减去"2017-01-01",再除以一天的秒数86400,实际得到的是tradeTime距离2017年1月1日的天数。

timeStamp=[]
timeSub=time.strptime("2017-01-01", "%Y-%m-%d")
stampSub= int(time.mktime(timeSub))

tradeTime_copy=house_data["tradeTime"]

for i in range(len(house_data["tradeTime"])):
    timeArray = time.strptime(str(tradeTime_copy[i]), "%Y-%m-%d")
    stamp= (int(stampSub-time.mktime(timeArray)))/86400
    timeStamp.append(stamp)
    
house_data["tradeTime"]=timeStamp

当然还有很多Tricks可以用来解决特征问题,最好熟练掌握pandas中的DataFrame类型的各类操作以及ipython的tab的快捷键


第五课&第六课 分类变量


所谓分类变量(categorical variable)就是只取固定几个值的变量,而且这类变量通常都不是数值型数据,因此不便于喂给模型进行计算。介绍了几种解决办法。

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
#We set handle_unknown='ignore' to avoid errors when the validation data #contains classes that aren't represented in the training data, and
#setting sparse=False ensures that the encoded columns are returned as a numpy #array (instead of a sparse matrix).

OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

获取分类变量的分类名及分类数量,字典类型。

# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

第七课&第八课 管道机制


管道机制通过对全部步骤的流式化封装和管理(streaming workflows with pipelines),使得参数集在新数据集(比如测试集)上的重复使用。虽然用得少,但具有以下优势:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = **SimpleImputer**(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = **Pipeline**(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = **ColumnTransformer**(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

2.定义模型

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)

**3.创建和评估Pipeline

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

Pipeline对象接受二元tuple构成的list,每一个二元 tuple 中的第一个元素为 arbitrary identifier string,我们用以获取(access)Pipeline object 中的 individual elements,二元 tuple 中的第二个元素是 scikit-learn与之相适配的transformer 或者estimator
执行流程如图:

Pipeline执行流程
类似于深度神经网络的层模型。

第九课&第十课 交叉验证


原理

Cross-Validation with 5 folds

使用Cross-Validation的原则

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,#set the number of folds with the cv parameter
                              scoring='neg_mean_absolute_error')

第十一课&第十二课 XGBoost


ML方法选择路径图

集成方法包括baggingboosting,前面介绍的RandomForest就是bagging的一种,本节介绍的XGBoosting就是boosting的一种,全称为extreme gradient boosting。

XGBoosting流程
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

注意xgboost不是在sklearn里,sklearn里有一个GradientBoostingClassifier。
XGBRegressor中的几个参数n_estimators、learning_rate、n_jobs可以进行tuning。


第十三课&第十四课 数据泄露


定义:Data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction.
分类: 数据泄露分为两种:target leakage and train-test contamination.
所谓target leakage就是指训练数据中的某些特征实际上已经能够反映预测值了,比如预测房价,如果将面积和单价均作为特征喂给模型,则对于验证数据模型会评估的很好,但是对于测试数据则很烂。
所谓train-test contamination可以通过一个例子理解:
For example, imagine you run preprocessing (like fitting an imputer for missing values) before calling train_test_split(). The end result? Your model may get good validation scores, giving you great confidence in it, but perform poorly when you deploy it to make decisions.

参考:sklearn 中的 Pipeline 机制

上一篇 下一篇

猜你喜欢

热点阅读