Boruta 特征选择

2019-01-07  本文已影响0人  X_cda6

机器学习任务中,在正式训练模型之前,我们一般会从原始数据中尽可能多的提取特征,作为模型的输入。

但是特征也不是越多越好,就像《赌神》里的戒指梗:高进在最近500盘赌局中偷鸡时喜欢摸戒指,让对手误以为发现了他的习惯:一摸戒指就要偷鸡,结果在比赛中上当了。摸戒指就是一个不好的特征,把它纳入模型中,反而会使模型变差。

本文介绍的Boruta算法就是一种特征选择方法,使用特征的重要性来选取特征。boruta_py是Boruta算法的python实现,类似于sklearn的扩展,用起来很方便。

Boruta的主要步骤如下:

1. 创建阴影特征 (shadow feature) : 对每个真实特征R,随机打乱顺序,得到阴影特征矩阵S,拼接到真实特征后面,构成新的特征矩阵N = [R, S].

2. 用新的特征矩阵N作为输入,训练模型,能输出feature_importances_的模型,如RandomForest, lightgbm,xgboost都可以,得到真实特征和阴影特征的feature importances,

3. 取阴影特征feature importance的最大值S_max,真实特征中feature importance大于S_max的,记录一次命中。

4. 用(3)中记录的真实特征累计命中,标记特征重要或不重要。原论文中用Bonferroni校正作显著性检验,boruta_py认为Bonferroni校正太过保守,默认增加了FDR校正,用two_step参数可以切换两种检验方法。

5. 删除不重要的特征,重复1-4,直到所有特征都被标记。

训练结束后,boruta_py 还可以输出特征ranking_,表示特征的重要性等级,在特征选择中也是一个很有用的指标。

下面结合代码看一下两种检验的区别,boruta_py的特征检验在实现如下。

def _do_tests(self, dec_reg, hit_reg, _iter):
    active_features = np.where(dec_reg >= 0)[0]
    hits = hit_reg[active_features]
    # get uncorrected p values based on hit_reg
    to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
    to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()

    if self.two_step:
        # two step multicor process
        # first we correct for testing several features in each round using FDR
        to_accept = self._fdrcorrection(to_accept_ps, alpha=self.alpha)[0]
        to_reject = self._fdrcorrection(to_reject_ps, alpha=self.alpha)[0]

        # second we correct for testing the same feature over and over again
        # using bonferroni
        to_accept2 = to_accept_ps <= self.alpha / float(_iter)
        to_reject2 = to_reject_ps <= self.alpha / float(_iter)

        # combine the two multi corrections, and get indexes
        to_accept *= to_accept2
        to_reject *= to_reject2
    else:
        # as in th original Boruta, we simply do bonferroni correction
        # with the total n_feat in each iteration
        to_accept = to_accept_ps <= self.alpha / float(len(dec_reg))
        to_reject = to_reject_ps <= self.alpha / float(len(dec_reg))

    # find features which are 0 and have been rejected or accepted
    to_accept = np.where((dec_reg[active_features] == 0) * to_accept)[0]
    to_reject = np.where((dec_reg[active_features] == 0) * to_reject)[0]

    # updating dec_reg
    dec_reg[active_features[to_accept]] = 1
    dec_reg[active_features[to_reject]] = -1
    return dec_reg

def _fdrcorrection(self, pvals, alpha=0.05):
    """
    Benjamini/Hochberg p-value correction for false discovery rate, from
    statsmodels package. Included here for decoupling dependency on statsmodels.

    Parameters
    ----------
    pvals : array_like
        set of p-values of the individual tests.
    alpha : float
        error rate

    Returns
    -------
    rejected : array, bool
        True if a hypothesis is rejected, False if not
    pvalue-corrected : array
        pvalues adjusted for multiple hypothesis testing to limit FDR
    """
    pvals = np.asarray(pvals)
    pvals_sortind = np.argsort(pvals)
    pvals_sorted = np.take(pvals, pvals_sortind)
    nobs = len(pvals_sorted)
    ecdffactor = np.arange(1, nobs + 1) / float(nobs)

    reject = pvals_sorted <= ecdffactor * alpha
    if reject.any():
        rejectmax = max(np.nonzero(reject)[0])
        reject[:rejectmax] = True

    pvals_corrected_raw = pvals_sorted / ecdffactor
    pvals_corrected = np.minimum.accumulate(pvals_corrected_raw[::-1])[::-1]
    pvals_corrected[pvals_corrected > 1] = 1
    # reorder p-values and rejection mask to original order of pvals
    pvals_corrected_ = np.empty_like(pvals_corrected)
    pvals_corrected_[pvals_sortind] = pvals_corrected
    reject_ = np.empty_like(reject)
    reject_[pvals_sortind] = reject
    return reject_, pvals_corrected_

其中 dec_reg 中用 1/0/-1 分别表示特征 接受/待定/拒绝 三种状态,hit_reg 存储了多次实验后每个特征的累计hit数,_iter为已经进行的实验次数,self.alpha参数是显著水平,默认值0.05.

Bonferroni校正:如果在同一数据集上同时检验n个独立的假设,那么用于每一假设的统计显著水平,应为仅检验一个假设时的显著水平的1/n,所以在比较时用self.alpha 除以 float(len(dec_reg))。

需要注意的是,boruta_py做特征选择时,会检查特征矩阵中是否包含NAN, INF这样的无效值,如果用lightgbm这样可以处理NAN值的模型来输出feature importances,可以把这一步去掉。

参考

Kursa M B, Rudnicki W R. Feature Selection with the Boruta Package[J]. Journal of Statistical Software, 2010, 36(11):1-13.

上一篇下一篇

猜你喜欢

热点阅读