XGBoost机器学习——网格搜索交叉验证 in py3
参数调优GridSearchCV:我们可以根据交叉验证评估的结果,选择最佳参数的模型
–输入待调节参数的范围(grid),对一组参数对应的模型进行评估,并给出最佳模型及其参数
模型评估小结
•通常k-折交叉验证是评估机器学习模型的黄金准则(k=3, 5, 10)
•当类别数目较多,或者每类样本数目不均衡时,采用stratified交
叉验证
•当训练数据集很大,train/test split带来的模型性能估计偏差很
小,或者模型训练很慢时,采用train/test split
•对给定问题找到一种技术,速度快且能得到合理的性能估计
•如果有疑问,对回归问题,采用10-fold cross-validation ;对分类,
采用stratified 10-fold cross-validation
# 运行 xgboost安装包中的示例程序
from xgboost import XGBClassifier
# 加载LibSVM格式数据模块
from sklearn.datasets import load_svmlight_file
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
# read in data,数据在xgboost安装的路径下的demo目录,现在copy到代码目录下的data目录
my_workpath = 'C:/Users/zdx/xgboost/demo/data/'
X_train,y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test,y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')
# specify parameters via map
params = {'max_depth':2, 'eta':0.1, 'silent':0, 'objective':'binary:logistic' }
#bst = XGBClassifier(param)
bst =XGBClassifier(max_depth=2, learning_rate=0.1, silent=True, objective='binary:logistic')
# 设置boosting迭代计算次数
param_test = { #弱分类器的数目以及范围
'n_estimators':list(range(1, 51, 1))
}
clf = GridSearchCV(estimator = bst, param_grid = param_test, scoring='accuracy', cv=5)
clf.fit(X_train, y_train)
clf.grid_scores_, clf.best_params_, clf.best_score_
#make prediction
preds = clf.predict(X_test)
predictions = [round(value) for value in preds]
test_accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy of gridsearchcv: %.2f%%" % (test_accuracy * 100.0))