Bagging算法是如何工作的

2017-03-01 本文已影响104人 54b59ee78c42

自举集成算法使用叫作自举（bootstrap）的取样方法。自举取样通常用来从一个中等规模的数据集中产生取样统计。一个（非参）自举取样是从数据集放回式地随机选择元素（也就是说，自举取样可能会重复取出原始数据中的同一行数据）。自举集成从训练数据集中获得一系列的自举样本，然后针对每一个自举样本训练一个基学习器。对于回归问题，结果为基学习器的均值。对于分类问题，结果是从不同类别所占的百分比引申出来的各种类别的概率或均值。代码清单6-4展示了对本章开始介绍的合成数据问题如何应用Bagging算法。

代码预留30%的数据作为测试数据，以代替交叉验证方法。参数numTreesMax决定集成方法包含的决策树的最大数目。代码建立模型是从第一个决策树开始，然后是前两个决策树、前三个决策树，以此类推，直到numTreesMax个决策树，可以看到预测的准确性与决策树数目之间的关系。代码将训练好的模型存入一个列表，并且存储了测试数据的预测值，这些预测值用于评估测试误差。代码画了两个图，一个展示了当集成方法增加决策树时，均方误差是如何变化的。另外一个图展示了第一个决策树的预测值、前10个决策树的平均预测值和前20个决策树的平均预测值的对比图。这个对比分析图与预测值曲线和实际标签值的对比图十分相似。

代码清单6-4　自举集成算法-simpleBagging.py

__author__ = 'mike-bowles'

import numpy
import matplotlib.pyplot as plot
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from math import floor
import random

#Build a simple data set with y = x + random
nPoints = 1000

#x values for plotting
xPlot = [(float(i)/float(nPoints) - 0.5) for i in range(nPoints + 1)]

#x needs to be list of lists.
x = [[s] for s in xPlot]

#y (labels) has random noise added to x-value
#set seed
random.seed(1)
y = [s + random.normal(scale=0.1) for s in xPlot]

#take fixed test set 30% of sample
nSample = int(nPoints * 0.30)
idxTest = random.sample(range(nPoints), nSample)
idxTest.sort()
idxTrain = [idx for idx in range(nPoints) if not(idx in idxTest)]

#Define test and training attribute and label sets
xTrain = [x[r] for r in idxTrain]
xTest = [x[r] for r in idxTest]
yTrain = [y[r] for r in idxTrain]
yTest = [y[r] for r in idxTest]

#train a series of models on random subsets of the training data
#collect the models in a list and check error of composite as list grows

#maximum number of models to generate
numTreesMax = 20

#tree depth - typically at the high end
treeDepth = 1

#initialize a list to hold models
modelList = []
predList = []

#number of samples to draw for stochastic bagging
nBagSamples = int(len(xTrain) * 0.5)

for iTrees in range(numTreesMax):
    idxBag = random.sample(range(len(xTrain)), nBagSamples)
    xTrainBag = [xTrain[i] for i in idxBag]
    yTrainBag = [yTrain[i] for i in idxBag]

    modelList.append(DecisionTreeRegressor(max_depth=treeDepth))
    modelList[-1].fit(xTrainBag, yTrainBag)

    #make prediction with latest model and add to list of predictions
    latestPrediction = modelList[-1].predict(xTest)
    predList.append(list(latestPrediction))

#build cumulative prediction from first "n" models
mse = []
allPredictions = []
for iModels in range(len(modelList)):

    #average first "iModels" of the predictions
    prediction = []
    for iPred in range(len(xTest)):
        prediction.append(sum([predList[i][iPred] \
            for i in range(iModels + 1)])/(iModels + 1))

    allPredictions.append(prediction)
    errors = [(yTest[i] - prediction[i]) for i in range(len(yTest))]
    mse.append(sum([e * e for e in errors]) / len(yTest))

nModels = [i + 1 for i in range(len(modelList))]

plot.plot(nModels,mse)
plot.axis('tight')
plot.xlabel('Number of Models in Ensemble')
plot.ylabel('Mean Squared Error')
plot.ylim((0.0, max(mse)))
plot.show()

plotList = [0, 9, 19]
for iPlot in plotList:
    plot.plot(xTest, allPredictions[iPlot])
plot.plot(xTest, yTest, linestyle="--")
plot.axis('tight')
plot.xlabel('x value')
plot.ylabel('Predictions')
plot.show()

图6-11展示了当决策树数目增加时均方误差是如何变化的。误差在0.025左右稳定下来。这个结果并不好。添加的噪声标准差为0.1。一个预测算法的最佳均方误差应该是这个标准差的平方，也就是0.01。本章前面的单个二进制决策树就已经接近0.01了。为什么复杂的算法性能反倒下降？

《Python机器学习预测分析核心算法》

本书通过集中介绍两类可以进行有效预测的机器学习算法，展示了如何使用Python 编程语言完成机器学习任务，从而降低机器学习难度，使机器学习能够被更广泛的人群掌握。

作者利用多年的机器学习经验带领读者设计、构建并实现自己的机器学习方案。本书尽可能地用简单的术语来介绍算法，避免复杂的数学推导，同时提供了示例代码帮助读者迅速上手。读者会很快深入了解模型构建背后的原理，不论简单问题还是复杂问题，读者都可以学会如何找到问题的解决算法。书中详细的示例，给出了具体的可修改的代码，展示了机器学习机理，涵盖了线性回归和集成方法，帮助理解使用机器学习方法的基本流程。

本书为不具备数学或统计背景的读者量身打造，详细介绍了如何：
● 针对任务选择合适算法；
● 对不同目的应用训练好的模型；
● 评估模型性能以保证应用效果；
● 掌握Python 机器学习核心算法包；
● 使用示例代码设计和构建你自己的模型；
● 构建实用的多功能预测模型。

Bagging算法是如何工作的

猜你喜欢

热点阅读