Learn专题四——机器学习的入门
这个专题主要解决以下问题:
1.tackle data types often found in real-world datasets (missing values, categorical variables),
2.design pipelines to improve the quality of your machine learning code,
3.use advanced techniques for model validation (cross-validation),
4.build state-of-the-art models that are widely used to win Kaggle competitions (XGBoost), and
5.avoid common and important data science mistakes (leakage).
第一课&第二课 简介和练习
这一课主要还是在复习第二专题的内容,不同的是在第二专题的基础上,将将模型预测的过程用函数进行了封装,此外介绍了如何将生成的预测数据上传至Kaggle中。
from sklearn.ensemble import RandomForestRegressor
# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)
models = [model_1, model_2, model_3, model_4, model_5]
from sklearn.metrics import mean_absolute_error
# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
model.fit(X_t, y_t)
preds = model.predict(X_v)
return mean_absolute_error(y_v, preds)
for i in range(0, len(models)):
mae = score_model(models[i])
print("Model %d MAE: %d" % (i+1, mae))
# Fit the model to the training data
my_model.fit(X, y)
# Generate test predictions
preds_test = my_model.predict(X_test)
# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
第三课&第四课 缺失值处理
- Load and Read任一数据集后,通常第一步要将数据的head()和tail()展示出来,然后整体describe()一下,对数据有一个直观的了解。
- 通过head()和tail()对每个特征进行了解,对于实际的数据集务必掌握其含义。
- 通过describe()了解一些基本的信息,比如count、mean、std、min、max等等,注意describe和head展示的数据列不一定相同,因为describe只显示整型、浮点型之类的数据,对于str类型的数据由于无法给出mean、std等统计指标,所以不显示。
- 可以对每个特征进行describe,采用data["feature_"].describe(),不用data.feature.decribe(),通过每个特征的描述掌握,该特征是否存在missing data,该特征的type是float64还是object,这些对于后续处理都很重要。
- 确定了某个特征含有missing data之后,根据讲义可知,有三种方法可以解决这样的问题。
扔:即drop.
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
也可以使用dropna(),如
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
填:即imputation。可以使用mean值直接填,也可以用一些复杂的方法,比如regression impution.
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
展:即Extension,Extension的同时也需要做Imputation,同时注意这里对Column的处理,需要重新添加.
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
# Make new columns indicating what will be imputed
for col in cols_with_missing:
X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
顺便结合Kaggle Datasets:Housing price in Beijing 顺便介绍其他对特征的预处理方式.
异常值处理
继续阅读需要先去看一下原数据,通过观察,发现特征livingRoom、price、constructionTime、floor均存在异常值,livingRoom中是“#name?”,constructionTime中是“未知”,price中是一些很小的价格数,比如房价为1-1000元、floor中是“钢混结构”。
对于前三个采用“扔的策略”.
with_NAME_row=house_select_columns.index[house_select_columns["livingRoom"]=="#NAME?"].tolist()
house_select_columns=house_select_columns.drop(with_NAME_row,axis=0)
with_weizhi_row=house_select_columns.index[house_select_columns["constructionTime"]=="未知"].tolist()
house_select_columns=house_select_columns.drop(with_weizhi_row,axis=0)
with_lowprice_row=house_select_columns.index[house_select_columns["price"]<10000].tolist()#10000元/平以下的数据都不要
house_select_columns=house_select_columns.drop(with_lowprice_row,axis=0)
对于floor特征,还有一点需要说明,除了出现“钢混结构”的异常值外,其他值都是类似于“高 27”、“低 7”等这样的str类型数据,因此为了将其喂给模型还需要做其他处理,这里采用的方法是,直接将数字(楼层高度)提取出来作为floor特征的值,汉字不要,当然你也可以单独再见一个特征用来区分“顶、高、中、低、底”这可能会用到后续将要介绍的分类变量的知识。
floor_list=[]
floor_copy=house_data["floor"]
for i in range(len(floor_copy)):
if re.findall('(\d+)',str(floor_copy[i])):
f1=re.findall('(\d+)',str(floor_copy[i]))
f1_int=int(f1[0])
floor_list.append(f1_int)
else:
floor_list.append(-1)#32个
house_data["floor"]=floor_list
数据类型转换
由于livingRoom、drawingRoom、bathRoom的数据类型都是object,所以要将其转化为float.
house_select_columns["drawingRoom"]=house_select_columns["drawingRoom"].astype("float")
house_select_columns["bathRoom"]=house_select_columns["bathRoom"].astype("float")
house_select_columns["livingRoom"]=house_select_columns["livingRoom"].astype("float")
选择“非object”类型数据方法如下:
X = X_full.select_dtypes(exclude=['object'])
日期的处理
tradeTime特征是形如“2017-01-01”类型的字符串,为了将其喂给模型,需要将其转化为时间戳,由于时间戳的数值较大,会增加计算量,因此将其减去某个固定日期的时间戳,比如都减去"2017-01-01",再除以一天的秒数86400,实际得到的是tradeTime距离2017年1月1日的天数。
timeStamp=[]
timeSub=time.strptime("2017-01-01", "%Y-%m-%d")
stampSub= int(time.mktime(timeSub))
tradeTime_copy=house_data["tradeTime"]
for i in range(len(house_data["tradeTime"])):
timeArray = time.strptime(str(tradeTime_copy[i]), "%Y-%m-%d")
stamp= (int(stampSub-time.mktime(timeArray)))/86400
timeStamp.append(stamp)
house_data["tradeTime"]=timeStamp
当然还有很多Tricks可以用来解决特征问题,最好熟练掌握pandas中的DataFrame类型的各类操作以及ipython的tab的快捷键。
第五课&第六课 分类变量
所谓分类变量(categorical variable)就是只取固定几个值的变量,而且这类变量通常都不是数值型数据,因此不便于喂给模型进行计算。介绍了几种解决办法。
- 扔掉,即Drop Categorical Variable.
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
-
标签编码,即Label Encoding.
标签编码
from sklearn.preprocessing import LabelEncoder
# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
label_X_train[col] = label_encoder.fit_transform(X_train[col])
label_X_valid[col] = label_encoder.transform(X_valid[col])
-
独热编码,即One-Hot Encoding.
One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
#We set handle_unknown='ignore' to avoid errors when the validation data #contains classes that aren't represented in the training data, and
#setting sparse=False ensures that the encoded columns are returned as a numpy #array (instead of a sparse matrix).
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
获取分类变量的分类名及分类数量,字典类型。
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))
# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])
第七课&第八课 管道机制
管道机制通过对全部步骤的流式化封装和管理(streaming workflows with pipelines),使得参数集在新数据集(比如测试集)上的重复使用。虽然用得少,但具有以下优势:
- Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
- Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
- Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
-
More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.
举例如下:
1.定义预处理步骤
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data
numerical_transformer = **SimpleImputer**(strategy='constant')
# Preprocessing for categorical data
categorical_transformer = **Pipeline**(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = **ColumnTransformer**(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
2.定义模型
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)
**3.创建和评估Pipeline
from sklearn.metrics import mean_absolute_error
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
Pipeline对象接受二元tuple构成的list,每一个二元 tuple 中的第一个元素为 arbitrary identifier string,我们用以获取(access)Pipeline object 中的 individual elements,二元 tuple 中的第二个元素是 scikit-learn与之相适配的transformer 或者estimator。
执行流程如图:
类似于深度神经网络的层模型。
第九课&第十课 交叉验证
原理

使用Cross-Validation的原则
- For small datasets, where extra computational burden isn't a big deal, you should run cross-validation.
- For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout.
There's no simple threshold for what constitutes a large vs. small dataset. But if your model takes a couple minutes or less to run, it's probably worth switching to cross-validation.
Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment yields the same results, a single validation set is probably sufficient.
from sklearn.model_selection import cross_val_score
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
cv=5,#set the number of folds with the cv parameter
scoring='neg_mean_absolute_error')
第十一课&第十二课 XGBoost

集成方法包括bagging和boosting,前面介绍的RandomForest就是bagging的一种,本节介绍的XGBoosting就是boosting的一种,全称为extreme gradient boosting。

from xgboost import XGBRegressor
my_model = XGBRegressor()
my_model.fit(X_train, y_train)
注意xgboost不是在sklearn里,sklearn里有一个GradientBoostingClassifier。
XGBRegressor中的几个参数n_estimators、learning_rate、n_jobs可以进行tuning。
第十三课&第十四课 数据泄露
定义:Data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction.
分类: 数据泄露分为两种:target leakage and train-test contamination.
所谓target leakage就是指训练数据中的某些特征实际上已经能够反映预测值了,比如预测房价,如果将面积和单价均作为特征喂给模型,则对于验证数据模型会评估的很好,但是对于测试数据则很烂。
所谓train-test contamination可以通过一个例子理解:
For example, imagine you run preprocessing (like fitting an imputer for missing values) before calling train_test_split()
. The end result? Your model may get good validation scores, giving you great confidence in it, but perform poorly when you deploy it to make decisions.