【R->练习】生物统计学练习4
2019-05-28 本文已影响0人
莫讠
一、数据如文件1_data.txt所示,x为自变量,y为因变量
(1) 画出y~x散点图
(2) 求y=ax+b的回归方程,并在散点图上画出回归直线;
(3) 计算x和y的相关系数
(4) 求(2)中得到的回归模型的残差并计算是否满足正太分布
(5) 判断所拟合的回归线对于数据集是否够好?(修改版)
解:
(1)
> setwd("/home/kevin/Desktop/biostatistics/2018 homework/homework 6/")
> data_1 <- read.table("1_data.txt",header = T)
# 作散点图
> plot(data_1)
散点图
(2)
# 对data1 进行线性回归
> fit <- lm(data_1$y~data_1$x)
> abline(fit)
> summary(fit)
Call:
lm(formula = data_1$y ~ data_1$x)
Residuals:
Min 1Q Median 3Q Max
-2119.8 -1136.2 -328.5 968.6 2390.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -487.8408 1663.3041 -0.293 0.772
data_1$x 0.7175 0.1255 5.718 9.44e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1327 on 22 degrees of freedom
Multiple R-squared: 0.5978, Adjusted R-squared: 0.5795
F-statistic: 32.7 on 1 and 22 DF, p-value: 9.438e-06
回归方程为:y = -487.8408 + 0.7175
添加回归线(3)
# 计算x与y之间的相关系数
> cor(data_1$x,y = data_1$y)
[1] 0.7731769
(4)
> data1.res <-residuals(fit)
> shapiro.test(data1.res)
Shapiro-Wilk normality test
data: data1.res
W = 0.94895, p-value = 0.2571
p-value = 0.2574 < 0.05, 因此接受原假设,即残差满足正态分布
第二题:
如数据lung_cancer.txt所示,科学家希望通过找出一些和肺癌相关的因素来预测人未来是否会得肺癌。利用logistic回归的方式,以人是否得肺癌为因变量,以基因A的表达、性别、是否吸烟、是否喝酒和年龄这5个因素为自变量,计算自变量的回归系数、回归显著性并尝试是否可以简化模型。
数据中,status 0表示未患病,1表示患病;gender 0 表示男性,1表示女;smoke 1表示抽烟,0表示不抽烟;drink 1表示喝酒,0表示不喝酒。
注:基因A的表达应用看家基因(housekeeping gene)的表达(exp_actin)进行校正,即用基因A的表达值除以看家基因的表达值。
> lung_cancer <- read.table("lung_cancer.txt",header = T)
> head(lung_cancer)
status exp_A exp_actin gender smoke drink age
1 0 105.68 212.13 0 0 0 51
2 0 74.33 208.44 1 0 0 53
3 0 57.71 244.47 1 0 0 55
4 0 94.22 181.19 1 0 0 50
5 0 49.19 187.21 0 0 0 60
6 0 76.93 188.44 1 0 1 47
> lung_cancer$normalized_A <- lung_cancer$exp_A/lung_cancer$exp_actin
> fit2 <- glm(status~lung_cancer$normalized_A+gender+smoke+drink+age,data = lung_cancer,family = binomial())
> summary(fit2)
Call:
glm(formula = status ~ lung_cancer$normalized_A + gender + smoke +
drink + age, family = binomial(), data = lung_cancer)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2550 -0.8264 0.4466 0.8115 1.7073
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.363e+00 2.173e+00 -1.087 0.27689
lung_cancer$normalized_A 6.197e+00 1.545e+00 4.012 6.02e-05 ***
gender 7.298e-03 3.944e-01 0.019 0.98524
smoke 1.262e+00 4.255e-01 2.967 0.00301 **
drink -4.293e-01 4.175e-01 -1.028 0.30373
age -6.139e-05 3.658e-02 -0.002 0.99866
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 209.36 on 169 degrees of freedom
Residual deviance: 170.84 on 164 degrees of freedom
AIC: 182.84
Number of Fisher Scoring iterations: 4
性别,年龄,饮酒对是否患病的影响不显著,去掉这三个变量重新拟合
> fit_re <- glm(status~normalized_A+smoke,data = lung_cancer, family = binomial())
> summary(fit_re)
Call:
glm(formula = status ~ normalized_A + smoke, family = binomial(),
data = lung_cancer)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3523 -0.8975 0.4732 0.7749 1.6852
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.4379 0.7067 -3.449 0.000562 ***
normalized_A 6.0794 1.5263 3.983 6.8e-05 ***
smoke 1.2289 0.4183 2.938 0.003306 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 209.36 on 169 degrees of freedom
Residual deviance: 171.96 on 167 degrees of freedom
AIC: 177.96
Number of Fisher Scoring iterations: 4
> anova(fit2,fit_re,test = "Chisq")
Analysis of Deviance Table
Model 1: status ~ lung_cancer$normalized_A + gender + smoke + drink +
age
Model 2: status ~ normalized_A + smoke
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 164 170.84
2 167 171.96 -3 -1.1221 0.7717
Pr>0.05 两个模型差异不显著,拟合程度一致,基因A和抽烟对患病有显著的影响。