XGBoost机器学习——网格搜索交叉验证 in py3

2017-08-11 本文已影响0人周叫兽的猫

参数调优GridSearchCV：我们可以根据交叉验证评估的结果，选择最佳参数的模型

–输入待调节参数的范围（grid），对一组参数对应的模型进行评估，并给出最佳模型及其参数

•通常k-折交叉验证是评估机器学习模型的黄金准则（k=3, 5, 10）

•当类别数目较多，或者每类样本数目不均衡时，采用stratified交

叉验证

•当训练数据集很大，train/test split带来的模型性能估计偏差很

小，或者模型训练很慢时，采用train/test split

•对给定问题找到一种技术，速度快且能得到合理的性能估计

•如果有疑问，对回归问题，采用10-fold cross-validation ;对分类，

采用stratified 10-fold cross-validation

# 运行 xgboost安装包中的示例程序

from xgboost import XGBClassifier

# 加载LibSVM格式数据模块

from sklearn.datasets import load_svmlight_file

from sklearn.grid_search import GridSearchCV

from sklearn.metrics import accuracy_score

from matplotlib import pyplot

# read in data，数据在xgboost安装的路径下的demo目录,现在copy到代码目录下的data目录

my_workpath = 'C:/Users/zdx/xgboost/demo/data/'

X_train,y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')

X_test,y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

# specify parameters via map

params = {'max_depth':2, 'eta':0.1, 'silent':0, 'objective':'binary:logistic' }

#bst = XGBClassifier(param)

bst =XGBClassifier(max_depth=2, learning_rate=0.1, silent=True, objective='binary:logistic')

# 设置boosting迭代计算次数

param_test = { #弱分类器的数目以及范围

'n_estimators':list(range(1, 51, 1))

}

clf = GridSearchCV(estimator = bst, param_grid = param_test, scoring='accuracy', cv=5)

clf.fit(X_train, y_train)

clf.grid_scores_, clf.best_params_, clf.best_score_

#make prediction

preds = clf.predict(X_test)

predictions = [round(value) for value in preds]

test_accuracy = accuracy_score(y_test, predictions)

print("Test Accuracy of gridsearchcv: %.2f%%" % (test_accuracy * 100.0))