python机器学习模型调参

2022-01-11  本文已影响0人  JeremyL

1. 基于scikit-learn的RandomForestRegressor构建一个随机森林回归模型

sklearn.ensemble.RandomForestRegressor — scikit-learn 1.0.2 documentation

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=10,
                       random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
model.fit(X, y)

print(model.predict([range(0,10)]))
>>> model.get_params() 

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 0,
 'verbose': 0,
 'warm_start': False}

2. scikit-learn中的GridSearchCV()网格搜索

class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)

参数:
estimator: 模型,一般是scikit-learn模型;自定义的模型需要按要求编写方法
param_grid: 参数空间
scoring:交叉验证时,在测试集上的打分标准
n_jobs:多线程,-1代表所有处理器
refit:调参完成以后,是否使用最优参数组合在整个数据集训练模型
cv: cross-validation 的数据拆分份数
verbose:信息打印设置,>1:显示每个折叠的计算时间和候选参数;>2:也显示分数;>3:同时显示fold和候选参数索引,以及计算的开始时间。
pre_dispatch: 多线程时,设置作业数量;default=’2*n_jobs’
error_score: 如果在估计器拟合中发生错误,返回的数值
return_train_score:是否返回训练集的评价指标得分

输出:
cv_results_:各种参数组合的信息以及得分
best_estimator_:表现最好的模型; refit=True
best_score_:best_estimator_的得分
best_params_: best_estimator_的参数组合
......
X, y = make_regression(n_samples=200, n_features=10,
                       random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
param_grid = {'criterion':['squared_error', 'absolute_error', 'poisson'],
              'n_estimators':range(10,12),
              'max_depth':[5,10,15],
              'min_samples_leaf':[2,3,5]}

model_grid = GridSearchCV(model,param_grid=param_grid,cv=6)
    
model_grid.fit(X, y)

print(model_grid.best_estimator_, model_grid.best_params_,model_grid.best_score_, sep="\n") 

RandomForestRegressor(max_depth=15, min_samples_leaf=2, random_state=0)
{'min_samples_leaf': 2, 'max_depth': 15, 'criterion': 'squared_error'}
0.715165941558444

3. scikit-learn中的RandomizedSearchCV()随机搜索

RandomizedSearchCV()的使用与GridSearchCV()差不多,需要注意两个参数:param_distributionsn_iter

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=200, n_features=10,
                       random_state=0, shuffle=False)
model = RandomForestRegressor(random_state=0)
param_grid = {'criterion':['squared_error', 'absolute_error', 'poisson'],
              'n_estimators':range(10,12),
              'max_depth':[5,10,15],
              'min_samples_leaf':[2,3,5]}

model_grid = RandomizedSearchCV(model,param_distributions=param_grid,cv=6, n_iter=10)

model_grid.fit(X, y)

print(model_grid.best_estimator_, model_grid.best_params_,model_grid.best_score_, sep="\n") 

RandomForestRegressor(max_depth=5, min_samples_leaf=2, n_estimators=11,
                      random_state=0)
{'n_estimators': 11, 'min_samples_leaf': 2, 'max_depth': 5, 'criterion': 'squared_error'}
0.6628360955299456

4. 贝叶斯调参

#Install
pip install bayesian-optimization

#Import
from bayes_opt import BayesianOptimization
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=200, n_features=10,
                       random_state=0, shuffle=False)

#定义模型
def model(n_estimators, max_depth,min_samples_leaf):
    model= RandomForestRegressor(n_estimators=int(n_estimators),
        max_depth=int(max_depth),
        min_samples_leaf=int(min_samples_leaf), 
        random_state=10
    )
    model.fit(X,y)
    return model.score(X,y)

optimizer = BayesianOptimization(
    model,
    {'n_estimators': (10, 31),
     'max_depth': (5, 13),
     'min_samples_leaf':(5,31)}
)
 
optimizer.maximize(
    init_points=5, #贝叶斯初点
    n_iter=10, #迭代次数
    )

#最佳的参数组合
print(optimizer.max)

{'target': 0.885985055893347, 'params': {'max_depth': 13.0, 'min_samples_leaf': 5.0, 'n_estimators': 31.0}}

for i, res in enumerate(optimizer.max):
    print("Iteration {}: \n\t{}".format(i, res))

5. hyperopt

hyperopt/hyperopt: Distributed Asynchronous Hyperparameter Optimization in Python (github.com)

Hyperopt主要依赖于FMin()函数寻找模型的最小值,而不是最大值。如果模型的输出是越大越好的话,可以用1-output score或者取负(-output score)

pip install hyperopt
import numpy as np
from sklearn.datasets import make_regression
from hyperopt import fmin, tpe, hp,Trials, space_eval
from sklearn import metrics

np.random.seed(1)
X, y = make_regression(n_samples=2000, n_features=10,
                       random_state=0, shuffle=False)

def hyperparameter_tuning(params):
    clf = RandomForestRegressor(**params,n_jobs=-1, random_state=0)
    clf.fit(X[1:1500], y[1:1500])
    mse = metrics.mean_squared_error(clf.predict(X[1500:2000]), y[1500:2000])
    return mse

# 初始化Trial 对象
trials = Trials()

#参数空间
space = {
    "n_estimators": hp.choice("n_estimators", range(5,15,5)), 
    "criterion": hp.choice("criterion", ["squared_error", "absolute_error"]),
    "max_depth": hp.quniform("max_depth", 10, 12,1)
}

best = fmin(
    fn=hyperparameter_tuning,
    space = space, 
    algo=tpe.suggest, #Tree of Parzen Estimators(TPE),Adaptive TPE
    max_evals=10, 
    trials=trials
)

print("Best: {}".format(best))
print(space_eval(space, best))

trials.trials
trials.results
trials.losses()
trials.statuses()

参考

sklearn.ensemble
Hyperopt: Distributed Hyperparameter Optimization)
Bayesian Optimization

上一篇 下一篇

猜你喜欢

热点阅读