Kaggle|Courses|Pipelines
管道机制。
管道捆绑了 预处理 和 建模 的步骤,可以使代码更简单和井井有条。虽然有一些数据科学家不使用管道,但是使用管道有一些重要的好处:
-更整洁的代码:在预处理的每个步骤中都要考虑数据会很混乱。使用管道则无需在每个步骤中手动跟踪
-易于产出:很难将模型从原型过渡到可大规模部署的模型。在这里我们不会涉及许多相关问题,但是管道可以提供帮助。
-更多模型验证方法:交叉验证等
Step 1: Define Preprocessing Steps
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer
class to bundle together different preprocessing steps.
The code below:
-imputes missing values in numerical data, and
-imputes missing values and applies a one-hot encoding to categorical data.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
Step 2: Define the Model
Next, we define a random forest model with the familiar RandomForestRegressor
class.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)
Step 3: Create and Evaluate the Pipeline
Finally, we use the Pipeline
class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:
- With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
- With the pipeline, we supply the unprocessed features in
X_valid
to thepredict()
command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)
from sklearn.metrics import mean_absolute_error
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
MAE: 160679.18917034855
Conclusion
Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.
Your Turn
Use a pipeline in the next exercise to use advanced data preprocessing techniques and improve your predictions!