Coding Gradient boosted machines

2018-11-07 本文已影响9人 Liam_ml

Motivation

有很多机器学习算法。你不可能学习所有算法的机制，但是，许多算法都是从最成熟的算法中发展出来的，例如：普通最小二乘法，梯度增强法，支持向量机，基于树的算法和神经网络。在STATWORX，我们每天讨论算法，以评估它们对特定项目或问题的有用性。无论如何，理解这些核心算法是文献中大多数机器学习算法的关键。

虽然我喜欢阅读机器学习研究论文，但数学有时难以理解。这就是为什么我喜欢自己在R中实现算法的原因。

在我随后的两篇博客文章中，我将在不到150行的R代码中介绍两种机器学习算法。算法将涵盖所有核心机制，同时非常通用。你可以在我的GitHub上找到所有代码。

image.png

这篇博文将向您介绍梯度提升机器。当然，有很多伟大的文章在理论上解释了梯度增强，并附有一个动手实例。这不是本博文的目的。如果ni对算法的理论和演变感兴趣，我强烈建议您阅读Bühlmann和Hothorn（2007）的论文。（1）本博文的目的是通过编写简单的R代码来建立算法理论。你不需要任何先前的算法知识。

Gathering all puzzle pieces

# this is our loss function
# y - the target
# yhat - the fitted values
loss <- function(y, yhat) return(mean(1/2*(y-yhat)^2))

我们的目标是找到这个损失函数的最小值，这是对均方误差的适应。该算法利用了这样的事实，即我们的损失函数中至少没有斜率。函数的梯度，即相对于yhat的负偏导数，描述了这个斜率。因此，我们实际上可以重新设计我们的目标，即搜索零或接近零的梯度，以最终实现我们减少损失的目标。

# the derivative of our loss function is the gradient
# y - the target
# yhat - the fitted values
gradient <- function(y, yhat) {return(y - yhat)}

现在是算法的有趣部分。在我们的例子中，梯度与残差u = y - yhat一致。请记住，我们希望渐变为零或接近零。换句话说，我们希望状态y = yhat。该算法通过简单地迭代拟合（增强）残差（梯度）而不是y本身来利用这一事实。这意味着我们在每次迭代中更新渐变以改善我们的拟合。这一步描述了我们的运动到最小的损失函数。一旦我们看到理论在行动，我们将稍微阐述一下这个观察。

The algorithm

# grad_boost
#' Fit a boosted linear model
#'
#' @param formula an object of class formula
#' @param data a data.frame or matrix
#' @param nu a numeric within the range of [0,1], the learning rate
#' @param stop a numeric value determining the total boosting iterations
#' @param loss.fun a convex loss function which is continuously differentiable
#' @param grad.fun a function which computes the gradient of the loss function
#' @param yhat.init a numeric value determining the starting point
#'
#' @return \itemize{
#' \item theta - this is our estimator
#' \item u - this is the last gradient value
#' \item fit - our fitted values, i.e. X %*% theta
#' \item formula - the underlying formula
#' \item data - the underlying data
#' }
#' @export
#'
#' @examples # Complete runthrough see: www.github.com/andrebleier/cheapml
grad_boost <- function(formula, data, nu = 0.01, stop, 
                       grad.fun, loss.fun, yhat.init = 0) {

  # coerce to data.frame
  data <- as.data.frame(data)

  # handle formula
  formula <- terms.formula(formula)

  # get the design matrix
  X <- model.matrix(formula, data)

  # extract target
  y <- data[, as.character(formula)[2]]

  # initial fit
  fit <- yhat.init

  # initialize the gradient with yhat.init
  u <- grad.fun(y = y, yhat = fit)

  # initial working parameter (our theta)
  # this is just an empty body
  theta <- rep(0, ncol(X))

  # track loss
  loss <- c()

  # boost from 1 to stop
  for (i in 1:stop) {

    # estimate working parameter (via OLS)
    # theta_i = (X'X)^-1 X'y
    # This is the (base procedure) where you can literally place 
    # any optimization algorithm
    base_prod <- lm.fit(x = X, y = u)
    theta_i <- coef(base_prod)

    # update our theta
    theta <- theta + nu*as.vector(theta_i)

    # new fitted values
    fit <- fit + nu * fitted(base_prod)

    # update gradient
    u <- grad.fun(y = y, yhat = fit)

    # update loss 
    loss <- append(loss, loss.fun(y = y, yhat = fit))
  }  

  # naming the estimator
  names(theta) <- colnames(X)

  # define return
  return(list(theta = theta, u = u, fit = fit, loss = loss, 
              formula = formula, data = data))
}

for循环使得编码少于30行代码。其他一切只是输入/输出处理等等。

# coerce to data.frame
  data <- as.data.frame(data)
  
  # handle formula
  formula <- terms.formula(formula)
  
  # get the design matrix
  X <- model.matrix(formula, data)
    
  # extract target
  y <- data[, as.character(formula)[2]]

  # initial fit
  fit <- f.init
  
  # initialize the gradient with yhat.init
  u <- grad.fun(y = y, yhat = fit)
  
  # initial working parameter (our theta)
  # this is just an empty body
  theta <- rep(0, ncol(X))
  
  # track loss
  loss <- c()

他的第一部分只是输入处理，但我们可以看到我们必须初始化渐变，从最佳猜测拟合yhat.init开始，例如， 0 初始化之后，我们需要一些容器来存储我们的损失和我们的估算器theta。

下一部分是关于我们的优化，我们迭代地接近我们预先指定的损失函数的最小值。

# boost from 1 to stop
for (i in 1:stop) {
    
    # estimate working parameter (via OLS)
    # theta_i = (X'X)^-1 X'y
    # This is the (base procedure) where you can literally place 
    # any optimization algorithm
    base_prod <- lm.fit(x = X, y = u)
    theta_i <- coef(base_prod)
  
    # update our theta
    theta <- theta + nu*as.vector(theta_i)
    
    # new fitted values
    fit <- fit + nu * fitted(base_prod)
    
    # update gradient
    u <- grad.fun(y = y, yhat = fit)
    
    # update loss 
    loss <- append(loss, loss.fun(y = y, yhat = fit))
}

Unveiling the magic

最初我已经说过梯度增强是机器学习算法的载体，而不是机器学习算法本身。从本质上讲，这是因为我们可以在这里使用任何其他优化函数而不是lm.fit。大多数梯度增强机器使用基于树的算法，例如， xgboost。这使得梯度增强机器成为非常独特的机器学习算法。

References

Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4), 477-505
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.