统计分析与数据挖掘数学与统计学机器学习

R - 回归分析(Regression Analysis)

2019-12-07  本文已影响0人  吴十三和小可爱的札记

回归分析是数据集分析中的首选工具之一,用于估计变量之间的关系,通常可以让你立即获得数据结构的信息。 - 《R for Data Science》- Dan Toomey

简单回归

简单线性回归就如其名:是一种非常简单的简单线性方法,用单个预测变量X预测响应变量Y。它估计X和Y近似成线性关系用,数学上可以用Y ≈intercept + slope*X来表示。

plot()简单看一下数据集中各数据有没有什么特殊的模式,也可以查看模型诊断结果。

plot(iris)
attach(iris)
cor(Petal.Length, Petal.Width)
fit <- lm(Petal.Length ~ Petal.Width)
detach()
# 总体信息
summary(fit)
# 数据存储点
names(fit )
# 系数估计的置信区间
confint(fit)

ggResidpanel是一个基于ggplot2的查看R产生的模型的诊断图的包,暂时“lm”, “glm”, “lme”, “lmer”, “glmer”, and “lmerTest” 等一些R的模型。结果和plot()的图片差不多。

评估模型的准确性

回归方程的显著性检验

  1. F检验法: F检验用于对所有的自变量X在整体上看对于Y的线性显著。对于F-statistic的值,当响应变量和预测变量之间没有关系时,F统计量的值趋近于1。 当然,也可以用P-value判断显著性判断是否拒绝零假设,小于0.01更小时说明整体上自变量与Y相关关系显著。

  2. T检验法:T检验是检验模型某个自变量X对于Y的显著性,通常用P-value判断显著性,小于0.01更小时,可以拒绝零假设,认为两个变量间有相互关系。

当拒绝零假设后,我们自然要考虑模型的拟合程度的好坏,该程度通常又残差分析和R^2(R平方)度量。

残差分析(预测值和实际值之间的差)

plot(fit)

残差和拟合值图(Residuals vs fit),数据点均匀分布在y=0两侧,呈现出随机的分布,红色线呈现出一条平稳的曲线并没有明显的形状特征,说明残差数据表现非常好。异常值已经被标出。

残差QQ图(normal Q-Q),用来描述残差是否符合正态分布。如果我们正确的构建了回归分析模型,那么模型的残差会符合完美的正态分布;对于近似服从正态分布的标准化残差,应该有 95% 的样本点落在 [-2,2] 区间内。异常值已经被标出。

对标准化残差平方根和拟合值图(Scale-Location),与残差和拟合值图的判断方法类似,数据随机分布,红色线呈现出一条平稳的曲线,无明显的形状特征。异常值已经被标出。

对标准化残差和杠杆值图(Residuals vs leverage),虚线表示的cooks距离等高线用来度量的回归影响点。如果出现红色的等高线,则说明数据中有特别影响回归结果的异常点。异常值已经被标出。

通常残差值分析给出的是拟合度的绝对度量,而R平方给出了另一个度量方式。

the R2 statistic

R2(R平方)相关系统检验法:用来判断回归方程的拟合程度,R2的取值在0,1之间,越接近1说明拟合程度越好。如果R平方等于0.60,可以通俗的认为有Y中有60%的变量被该拟合公式中的X解释。

R平方是X和Y之间的线性关系的度量,而相关性分析也是度量X和Y之间的线性关系的方法,因此R平方统计量和相关性分析的r有相同的作用。事实上,在简单的线性回归中,R平方 是等于 r平方。换句话说,在简单的线性回归中,可以用相关性的r平方来代替线性回归的R2统计量。但,在多元线性回归中,回归的概念无法与相关的概念互通,因此多元回归中的线性关系度量只能用R平方。

模型预测

可以用抽样训练模型,然后评估的方式来检验模型。

predict()为预测计算置信区间和预测区间。

多元回归分析

lm(y∼x1+x2+x3)

回归方程的显著性检验

F检验适用于变量不是太多的情况,如果太多则考虑forward selection等高纬数据方法。

《An Introduction to Statistical Learning with Applications in R》- G. Casella - Chapter 6

变量选择 - 变量重要性筛选

  1. 看每个自变量的p值。- 《An Introduction to Statistical Learning with Applications in R》- G. Casella - Chapter 6
  2. 每个变量慢慢试,看看那个模型更好。
    • Akaike information criterion (AIC)
    • Bayesian information criterion (BIC)
  3. adjusted R平方

大佬们给出的解决方式包括

Forward selection

Backward selection

Mixed selection

tips:如果p > n,则不能使用向后筛选。向前筛选是一种贪婪的方法,任何情况的可以使用,但它可能在早期包含一些多余变量,而混合选择可以弥补这一点。

评估模型的准确性

RSE和R2是两种最常见的模型拟合数值评估方法,计算和解释方式与简单线性回归相同。R平方与r平方不同。

模型预测

可以用抽样训练模型,然后评估的方式来检验模型。

其他注意事项

定性预测(Qualitative Predictors):

利用由数字构成的指标或哑变量回归。

contrasts()返回哑变量密码:那些变量被重新编码了,数值是怎么样的。

潜在问题

  1. Non-linearity of the response-predictor relationships.
  2. Correlation of error terms.
  3. Non-constant variance of error terms.
  4. Outliers.
  5. High-leverage points.
  6. Collinearity.

可视化

ggplot2 有geom_smooth 可以自动进行回归并可视化。

require(ggplot2)
ggplot(data = cars) + 
  geom_point(aes(speed, dist)) +
  geom_smooth(aes(speed, dist), method = "lm")
summary_fit.png

dist = 3.9324 * speed - 17.5791

如果模型需要调试,获得公式后亦可以用geom_line()进行可视化。

ggplot(data = cars) + 
  geom_point(aes(speed, dist)) +
  geom_line(aes(speed, -17.579 + 3.9324*speed),
            color = "red",size = 1) + theme_minimal()
geom_smooth.png

ggeffects

ggeffects 是以ggplot2 为基础写的拟合模型可视化包,目前支持: bamlss, bayesx, betabin, betareg, bglmer, blmer, bracl, brglm, brmsfit, brmultinom, clm, clm2, clmm, coxph, fixest, gam (package mgcv), Gam (package gam), gamlss, gamm, gamm4, gee, geeglm, glm, glm.nb, glmer, glmer.nb, glmmTMB, glmmPQL, glmrob, glmRob, gls, hurdle, ivreg, lm, lm_robust, lme, lmer, lmrob, lmRob, logistf, lrm, MixMod, MCMCglmm, multinom, negbin, nlmer, ols, plm, polr, rlm, rlmer, rq, rqss, stanreg, survreg, svyglm, svyglm.nb, tobit, truncreg, vgam, wbm, zeroinfl and zerotrunc 的可视化。

R 回归分析函数总览

R Functions For Regression Analysis - Vito Ricci

Linear model

  1. Anova: Anova Tables for Linear and Generalized Linear Models (car)
  2. anova: Compute an analysis of variance table for one or more linear model fits (stasts)
  3. coef: is a generic function which extracts model coefficients from objects returned by modeling functions. coefficients is an alias for it (stasts)
  4. coeftest: Testing Estimated Coefficients (lmtest)
  5. confint: Computes confidence intervals for one or more parameters in a fitted model. Base has a method for objects inheriting from class "lm" (stasts)
  6. deviance:Returns the deviance of a fitted model object (stats)
  7. effects: Returns (orthogonal) effects from a fitted model, usually a linear model. This is a generic function, but currently only has a methods for objects inheriting from classes "lm" and "glm" (stasts)
  8. fitted: is a generic function which extracts fitted values from objects returned by modeling functions fitted.values is an alias for it (stasts)
  9. formula: provide a way of extracting formulae which have been included in other objects (stasts)
  10. linear.hypothesis: Test Linear Hypothesis (car)
  11. lm: is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (stasts)
  12. model.matrix: creates a design matrix (stasts)
  13. predict: Predicted values based on linear model object (stasts)
  14. residuals: is a generic function which extracts model residuals from objects returned by modeling functions (stasts)
  15. summary.lm: summary method for class "lm" (stats)
  16. vcov: Returns the variance-covariance matrix of the main parameters of a fitted model object (stasts)

Model – Variables selection

  1. add1: Compute all the single terms in the scope argument that can be added to or dropped from the model, fit those models and compute a table of the changes in fit (stats)
  2. AIC: Generic function calculating the Akaike information criterion for one or several fitted model objects for which a log-likelihood value can be obtained, according to the formula -2log-likelihood + knpar, where npar represents the number of parameters in the fitted model, and k = 2 for the usual AIC, or k = log(n) (n the number of observations) for the so-called BIC or SBC (Schwarz's Bayesian criterion) (stats)
  3. Cpplot: Cp plot (faraway)
  4. drop1: Compute all the single terms in the scope argument that can be added to or dropped from the model, fit those models and compute a table of the changes in fit (stats)
  5. extractAIC: Computes the (generalized) Akaike An Information Criterion for a fitted parametric model (stats)
  6. leaps: Subset selection by `leaps and bounds' (leaps)
  7. maxadjr: Maximum Adjusted R-squared (faraway)
  8. offset: An offset is a term to be added to a linear predictor, such as in a
    generalised linear model, with known coefficient 1 rather than an estimated coefficient (stats)
  9. step: Select a formula-based model by AIC (stats)
  10. update.formula: is used to update model formulae. This typically involves adding or dropping terms, but updates can be more general (stats)

Diagnostics

  1. cookd: Cook's Distances for Linear and Generalized Linear Models (car)
  2. cooks.distance: Cook’s distance (stats)
  3. covratio: covariance ratio (stats)
  4. dfbeta: DBETA (stats)
  5. dfbetas: DBETAS (stats)
  6. dffits: DFFTITS (stats)
  7. hat: diagonal elements of the hat matrix (stats)
  8. hatvalues: diagonal elements of the hat matrix (stats)
  9. influence.measures: This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models (stats)
  10. lm.influence: This function provides the basic quantities which are used in forming a wide variety of diagnostics for checking the quality of regression fits (stats)
  11. ls.diag: Computes basic statistics, including standard errors, t- and p-values for the regression coefficients (stats)
  12. outlier.test: Bonferroni Outlier Test (car)
  13. rstandard: standardized residuals (stats)
  14. rstudent: studentized residuals (stats)
  15. vif: Variance Inflation Factor (car)

Graphics

  1. ceres.plots: Ceres Plots (car)
  2. cr.plots: Component+Residual (Partial Residual) Plots (car)
  3. influence.plot: Regression Influence Plot (car)
  4. leverage.plots: Regression Leverage Plots (car)
  5. panel.car: Panel Function Coplots (car)
  6. plot.lm: Four plots (selectable by which) are currently provided: a plot ofresiduals against fitted values, a Scale-Location plot of sqrt{| residuals |} against fitted values, a Normal Q-Q plot, and a plot of Cook's distances versus row labels (stats)
  7. prplot: Partial Residual Plot (faraway)
  8. qq.plot: Quantile-Comparison Plots (car)
  9. qqline: adds a line to a normal quantile-quantile plot which passes through the first and third quartiles (stats)
  10. qqnorm: is a generic function the default method of which produces a normal QQ plot of the values in y (stats)
  11. reg.line: Plot Regression Line (car)
  12. scatterplot.matrix: Scatterplot Matrices (car)
  13. scatterplot: Scatterplots with Boxplots (car)
  14. spread.level.plot: Spread-Level Plots (car)
上一篇下一篇

猜你喜欢

热点阅读