Regularization & Dimension reduc

2018-10-15  本文已影响0人  flowncaaaaa

这周简直要忙疯,但是感觉这个seminar的课真的是对一个初学者的自己很有帮助,了解了很多SL的算法和基本知识,既然做了,就分享下regularization这部分的笔记吧~
内容整合了ISL("An Introduction to Statistical Learning" by James, Witten, Hastie and Tibshirani)和ESL(The Elements of Statistical Learning:Data Mining, Inference, and Prediction by Hastie, Tibshirani and Friedman)两本书

Useful R packages:
Glmnet
ElasticNet
eNetXplorer (后附简单的running example,强推(带强烈的个人感情色彩)其实没什么特别的,但是脑残粉就是喜欢哈哈哈orz)

Shrinkage/Regularization

Ridge Regression

Ridge coefficient estimates = argmin{RSS+shrinkage penalty} (L2 panelty)
The idea of penalizing by the sum-of-squares of the parameters is also used in neural networks, where it is known as weight decay.

or another expression
  • Shrinkage penalty is small when \betas are close to zero, so has the effect of shrinking the estimates of \beta to zero.
  • \lambda is important: \lambda = 0, the penalty has no effect; \lambda grows, panelty grows, and the coefficients approach zero.
  • Cross-validation could be used to choose \lambda.
  • Not equivariant underscaling of inputs, apply after standardizing the predictors.

Ridge regression bias-variance trade-off, compared with least squares.

  • \lambda increases, the flexibility decreases, leading to decreased variance but increased bias.
  • p~n/ p>n ridge regression could out perform well by trading off a small increase in bias for a large decrease in variance.
  • Substantial computational advantage compared with best subset selection.

Disadvantage:
Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model(unless \lambda = \infty ) -- a challenge in model interpretation esp p is large.

Lasso

Lasso coefficient estimates = argmin{RSS+L1 penalty}


or another expression
  • Also shrinks the coefficient estimates to zero.
  • L1 penalty has the effect of forcing some of the coefficient estimates = 0 when \lambda is sufficiently large. So, Lasso also performs variable selection.
  • Much easier to interperet than Ridge regression.

The limitations of the lasso

  • If p>n, the lasso selects at most n variables. The number of selected predictors is bounded by the number of samples.
  • Grouped variables: the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.

不得不提一下现在应用更广泛的Elastic Net

Elastic Net

Reference paper: Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.

Elastic Net coefficient estimates = argmin{RSS+elastic net penalty}
elastic net penalty -- a convex combination of the lasso and ridge penalty. When α = 1, the naive elastic net becomes simple ridge regression. (opposite to the α of glmnet package)

  • The L1 part of the penalty generates a sparse model.
  • The quadratic part of the penalty:
    -- Removes the limitation on the number of selected variables
    -- Encourages grouping effect;
    -- Stabilizes the L1 regularization path

Comparing the Lasso, Ridge and ElasticNet

Dimension Reduction

PCR (Principal Components Regression)

approach:

  1. Constructing the first M principal components -- typically chosen by cross-validation
  2. Using the predictors in a linear regression model that is fit using least squares.

PLS (Partial Least Squares)

A Comparison of the Selection and Shrinkage Methods

Book ESL: 3.6 p82 correlation between X1 and X2 = 0.5

Tips: high dimensional data

Simple R example on ElasticNet

# install.packages(eNetXplorer)
library(eNetXplorer)
data("H1N1_Flow")
H1N1_Flow$predictor_day7[1:3,1:12]
fit <- eNetXplorer(x=H1N1_Flow$predictor_day7, y=H1N1_Flow$response_numer[rownames(
H1N1_Flow$predictor_day7)], family="gaussian", n_run=25, n_perm_null=15)
还显示进度哦

GLM的family现在还支持two-class logistic和multinomial regression,以后的版本会更新Poisson regression and the Cox model for survival data.

plot(fit, plot.type="lambdaVsQF", alpha.index=4)
plot(fit, plot.type = "summary")
plot(fit, plot.type = "featureHeatmap", stat="freq", alpha.index=4)

参考文献:
https://www.biorxiv.org/content/early/2018/04/30/305870

上一篇 下一篇

猜你喜欢

热点阅读