GBDT 二分类算法实践
2019-10-29 本文已影响0人
邵红晓
问题:
1、残差,残差近似等于用损失函数的负梯度 residual = 2 * y/exp(2 * y * f(i)) f(i)初始值为0
2、叶子节点估值node.predict_value,每个叶子节点region下的所有instance的残差值

用牛顿法进行优化得到如下公式:

代码实现:sum( residual(i) )/sum(|residual(i)|*|2-residual(i) |)
3、预测值更新,所有instance都更新 f(i)+=learn_rate * node.predict_value
4、最小化损失函数迭代优化 采用负二项式分布损失函数
loss = L(y,F)=log(1+exp(−2yF)),y∈−1,1
F(x)=1/2log[Pr(y=1|x)/Pr(y=0|x)]
F-对数几率,预测概率,y-真实值
def fit(self, dataset, train_data):
self.loss = BinomialDeviance(n_classes=dataset.get_label_size())
# 1、初始化预测值
f = dict() # 记录F_{m-1}的值 预测值
self.loss.initialize(f, dataset)
for iter in range(1, self.max_iter + 1):
subset = train_data
if 0 < self.sample_rate < 1:
# 从list中随机获取5个元素,作为一个片断返回
subset = sample(subset, int(len(subset) * self.sample_rate))
# 用损失函数的负梯度作为回归问题提升树的残差近似值 2*y/exp(2*y*f(i))
# 2、残差 residual 计算为 = 2*y/exp(2*y*f(i))
residual = self.loss.compute_residual(dataset, subset, f)
leaf_nodes = []
targets = residual
# 3、构造决策树,进行叶子节点估值,估值需要用到残差 f = sum(fi)/sum(|fi|*|2-fi|)
tree = construct_decision_tree(dataset, subset, targets, 0, leaf_nodes, self.max_depth, self.loss,
self.split_points)
self.trees[iter] = tree
# max_iter=20, sample_rate=0.8, learn_rate=0.5, max_depth=7, loss_type='binary-classification'
# 更新预测值 learn_rate*预测值
self.loss.update_f_value(f, tree, leaf_nodes, subset, dataset, self.learn_rate)
# loss[y,f] = log(1+e^(-2yf))
# f(i) = 1/2*log(p(y=1)/p(y=0)) 对数几率
# 所有instance的loss和
train_loss = self.compute_loss(dataset, f)
print("iter%d : train loss=%f" % (iter, train_loss))
公式参考
https://nbviewer.jupyter.org/github/liudragonfly/GBDT/blob/master/GBDT.ipynb