Kaggle|Courses|Pipelines

2020-04-22  本文已影响0人  十二支箭

管道机制。
管道捆绑了 预处理 和 建模 的步骤,可以使代码更简单和井井有条。虽然有一些数据科学家不使用管道,但是使用管道有一些重要的好处:
-更整洁的代码:在预处理的每个步骤中都要考虑数据会很混乱。使用管道则无需在每个步骤中手动跟踪
-易于产出:很难将模型从原型过渡到可大规模部署的模型。在这里我们不会涉及许多相关问题,但是管道可以提供帮助。
-更多模型验证方法:交叉验证等
Step 1: Define Preprocessing Steps
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps.
The code below:

-imputes missing values in numerical data, and

-imputes missing values and applies a one-hot encoding to categorical data.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Step 2: Define the Model

Next, we define a random forest model with the familiar RandomForestRegressor class.

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

Step 3: Create and Evaluate the Pipeline

Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)


MAE: 160679.18917034855

Conclusion

Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.

Your Turn

Use a pipeline in the next exercise to use advanced data preprocessing techniques and improve your predictions!

上一篇 下一篇

猜你喜欢

热点阅读