R - 回归分析(Regression Analysis)
回归分析是数据集分析中的首选工具之一,用于估计变量之间的关系,通常可以让你立即获得数据结构的信息。 - 《R for Data Science》- Dan Toomey
简单回归
简单线性回归就如其名:是一种非常简单的简单线性方法,用单个预测变量X预测响应变量Y。它估计X和Y近似成线性关系用,数学上可以用Y ≈intercept + slope*X来表示。
plot()简单看一下数据集中各数据有没有什么特殊的模式,也可以查看模型诊断结果。
plot(iris)
attach(iris)
cor(Petal.Length, Petal.Width)
fit <- lm(Petal.Length ~ Petal.Width)
detach()
# 总体信息
summary(fit)
# 数据存储点
names(fit )
# 系数估计的置信区间
confint(fit)
ggResidpanel是一个基于ggplot2的查看R产生的模型的诊断图的包,暂时“lm”, “glm”, “lme”, “lmer”, “glmer”, and “lmerTest” 等一些R的模型。结果和plot()的图片差不多。
评估模型的准确性
回归方程的显著性检验
-
F检验法: F检验用于对所有的自变量X在整体上看对于Y的线性显著。对于F-statistic的值,当响应变量和预测变量之间没有关系时,F统计量的值趋近于1。 当然,也可以用P-value判断显著性判断是否拒绝零假设,小于0.01更小时说明整体上自变量与Y相关关系显著。
-
T检验法:T检验是检验模型某个自变量X对于Y的显著性,通常用P-value判断显著性,小于0.01更小时,可以拒绝零假设,认为两个变量间有相互关系。
当拒绝零假设后,我们自然要考虑模型的拟合程度的好坏,该程度通常又残差分析和R^2(R平方)度量。
残差分析(预测值和实际值之间的差)
plot(fit)
残差和拟合值图(Residuals vs fit),数据点均匀分布在y=0两侧,呈现出随机的分布,红色线呈现出一条平稳的曲线并没有明显的形状特征,说明残差数据表现非常好。异常值已经被标出。
残差QQ图(normal Q-Q),用来描述残差是否符合正态分布。如果我们正确的构建了回归分析模型,那么模型的残差会符合完美的正态分布;对于近似服从正态分布的标准化残差,应该有 95% 的样本点落在 [-2,2] 区间内。异常值已经被标出。
对标准化残差平方根和拟合值图(Scale-Location),与残差和拟合值图的判断方法类似,数据随机分布,红色线呈现出一条平稳的曲线,无明显的形状特征。异常值已经被标出。
对标准化残差和杠杆值图(Residuals vs leverage),虚线表示的cooks距离等高线用来度量的回归影响点。如果出现红色的等高线,则说明数据中有特别影响回归结果的异常点。异常值已经被标出。
通常残差值分析给出的是拟合度的绝对度量,而R平方给出了另一个度量方式。
the R2 statistic
R2(R平方)相关系统检验法:用来判断回归方程的拟合程度,R2的取值在0,1之间,越接近1说明拟合程度越好。如果R平方等于0.60,可以通俗的认为有Y中有60%的变量被该拟合公式中的X解释。
R平方是X和Y之间的线性关系的度量,而相关性分析也是度量X和Y之间的线性关系的方法,因此R平方统计量和相关性分析的r有相同的作用。事实上,在简单的线性回归中,R平方 是等于 r平方。换句话说,在简单的线性回归中,可以用相关性的r平方来代替线性回归的R2统计量。但,在多元线性回归中,回归的概念无法与相关的概念互通,因此多元回归中的线性关系度量只能用R平方。
模型预测
可以用抽样训练模型,然后评估的方式来检验模型。
predict()为预测计算置信区间和预测区间。
多元回归分析
lm(y∼x1+x2+x3)
回归方程的显著性检验
F检验适用于变量不是太多的情况,如果太多则考虑forward selection等高纬数据方法。
- 方差膨胀系数vif(){car}:表示回归系数估计量的方差与假设自变量间不线性相关时方差相比的比值。VIF值越接近于1,多重共线性越轻,反之越重。当多重共线性严重时,应采取适当的方法进行调整。
《An Introduction to Statistical Learning with Applications in R》- G. Casella - Chapter 6
变量选择 - 变量重要性筛选
- 看每个自变量的p值。- 《An Introduction to Statistical Learning with Applications in R》- G. Casella - Chapter 6
- 每个变量慢慢试,看看那个模型更好。
- Akaike information criterion (AIC)
- Bayesian information criterion (BIC)
- adjusted R平方
大佬们给出的解决方式包括
Forward selection
Backward selection
Mixed selection
tips:如果p > n,则不能使用向后筛选。向前筛选是一种贪婪的方法,任何情况的可以使用,但它可能在早期包含一些多余变量,而混合选择可以弥补这一点。
评估模型的准确性
RSE和R2是两种最常见的模型拟合数值评估方法,计算和解释方式与简单线性回归相同。R平方与r平方不同。
模型预测
可以用抽样训练模型,然后评估的方式来检验模型。
其他注意事项
定性预测(Qualitative Predictors):
利用由数字构成的指标或哑变量回归。
contrasts()返回哑变量密码:那些变量被重新编码了,数值是怎么样的。
潜在问题
- Non-linearity of the response-predictor relationships.
- Correlation of error terms.
- Non-constant variance of error terms.
- Outliers.
- High-leverage points.
- Collinearity.
可视化
ggplot2 有geom_smooth 可以自动进行回归并可视化。
require(ggplot2)
ggplot(data = cars) +
geom_point(aes(speed, dist)) +
geom_smooth(aes(speed, dist), method = "lm")
summary_fit.png
dist = 3.9324 * speed - 17.5791
如果模型需要调试,获得公式后亦可以用geom_line()进行可视化。
ggplot(data = cars) +
geom_point(aes(speed, dist)) +
geom_line(aes(speed, -17.579 + 3.9324*speed),
color = "red",size = 1) + theme_minimal()
geom_smooth.png
ggeffects
ggeffects 是以ggplot2 为基础写的拟合模型可视化包,目前支持: bamlss, bayesx, betabin, betareg, bglmer, blmer, bracl, brglm, brmsfit, brmultinom, clm, clm2, clmm, coxph, fixest, gam (package mgcv), Gam (package gam), gamlss, gamm, gamm4, gee, geeglm, glm, glm.nb, glmer, glmer.nb, glmmTMB, glmmPQL, glmrob, glmRob, gls, hurdle, ivreg, lm, lm_robust, lme, lmer, lmrob, lmRob, logistf, lrm, MixMod, MCMCglmm, multinom, negbin, nlmer, ols, plm, polr, rlm, rlmer, rq, rqss, stanreg, survreg, svyglm, svyglm.nb, tobit, truncreg, vgam, wbm, zeroinfl and zerotrunc 的可视化。
R 回归分析函数总览
R Functions For Regression Analysis - Vito Ricci
Linear model
- Anova: Anova Tables for Linear and Generalized Linear Models (car)
- anova: Compute an analysis of variance table for one or more linear model fits (stasts)
- coef: is a generic function which extracts model coefficients from objects returned by modeling functions. coefficients is an alias for it (stasts)
- coeftest: Testing Estimated Coefficients (lmtest)
- confint: Computes confidence intervals for one or more parameters in a fitted model. Base has a method for objects inheriting from class "lm" (stasts)
- deviance:Returns the deviance of a fitted model object (stats)
- effects: Returns (orthogonal) effects from a fitted model, usually a linear model. This is a generic function, but currently only has a methods for objects inheriting from classes "lm" and "glm" (stasts)
- fitted: is a generic function which extracts fitted values from objects returned by modeling functions fitted.values is an alias for it (stasts)
- formula: provide a way of extracting formulae which have been included in other objects (stasts)
- linear.hypothesis: Test Linear Hypothesis (car)
- lm: is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (stasts)
- model.matrix: creates a design matrix (stasts)
- predict: Predicted values based on linear model object (stasts)
- residuals: is a generic function which extracts model residuals from objects returned by modeling functions (stasts)
- summary.lm: summary method for class "lm" (stats)
- vcov: Returns the variance-covariance matrix of the main parameters of a fitted model object (stasts)
Model – Variables selection
- add1: Compute all the single terms in the scope argument that can be added to or dropped from the model, fit those models and compute a table of the changes in fit (stats)
- AIC: Generic function calculating the Akaike information criterion for one or several fitted model objects for which a log-likelihood value can be obtained, according to the formula -2log-likelihood + knpar, where npar represents the number of parameters in the fitted model, and k = 2 for the usual AIC, or k = log(n) (n the number of observations) for the so-called BIC or SBC (Schwarz's Bayesian criterion) (stats)
- Cpplot: Cp plot (faraway)
- drop1: Compute all the single terms in the scope argument that can be added to or dropped from the model, fit those models and compute a table of the changes in fit (stats)
- extractAIC: Computes the (generalized) Akaike An Information Criterion for a fitted parametric model (stats)
- leaps: Subset selection by `leaps and bounds' (leaps)
- maxadjr: Maximum Adjusted R-squared (faraway)
- offset: An offset is a term to be added to a linear predictor, such as in a
generalised linear model, with known coefficient 1 rather than an estimated coefficient (stats) - step: Select a formula-based model by AIC (stats)
- update.formula: is used to update model formulae. This typically involves adding or dropping terms, but updates can be more general (stats)
Diagnostics
- cookd: Cook's Distances for Linear and Generalized Linear Models (car)
- cooks.distance: Cook’s distance (stats)
- covratio: covariance ratio (stats)
- dfbeta: DBETA (stats)
- dfbetas: DBETAS (stats)
- dffits: DFFTITS (stats)
- hat: diagonal elements of the hat matrix (stats)
- hatvalues: diagonal elements of the hat matrix (stats)
- influence.measures: This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models (stats)
- lm.influence: This function provides the basic quantities which are used in forming a wide variety of diagnostics for checking the quality of regression fits (stats)
- ls.diag: Computes basic statistics, including standard errors, t- and p-values for the regression coefficients (stats)
- outlier.test: Bonferroni Outlier Test (car)
- rstandard: standardized residuals (stats)
- rstudent: studentized residuals (stats)
- vif: Variance Inflation Factor (car)
Graphics
- ceres.plots: Ceres Plots (car)
- cr.plots: Component+Residual (Partial Residual) Plots (car)
- influence.plot: Regression Influence Plot (car)
- leverage.plots: Regression Leverage Plots (car)
- panel.car: Panel Function Coplots (car)
- plot.lm: Four plots (selectable by which) are currently provided: a plot ofresiduals against fitted values, a Scale-Location plot of sqrt{| residuals |} against fitted values, a Normal Q-Q plot, and a plot of Cook's distances versus row labels (stats)
- prplot: Partial Residual Plot (faraway)
- qq.plot: Quantile-Comparison Plots (car)
- qqline: adds a line to a normal quantile-quantile plot which passes through the first and third quartiles (stats)
- qqnorm: is a generic function the default method of which produces a normal QQ plot of the values in y (stats)
- reg.line: Plot Regression Line (car)
- scatterplot.matrix: Scatterplot Matrices (car)
- scatterplot: Scatterplots with Boxplots (car)
- spread.level.plot: Spread-Level Plots (car)