生物统计专题

【R->练习】生物统计学练习4

2019-05-28  本文已影响0人  莫讠

一、数据如文件1_data.txt所示,x为自变量,y为因变量
(1) 画出y~x散点图
(2) 求y=ax+b的回归方程,并在散点图上画出回归直线;
(3) 计算x和y的相关系数
(4) 求(2)中得到的回归模型的残差并计算是否满足正太分布
(5) 判断所拟合的回归线对于数据集是否够好?(修改版)

解:
(1)

> setwd("/home/kevin/Desktop/biostatistics/2018 homework/homework 6/")
> data_1 <- read.table("1_data.txt",header = T)
# 作散点图
> plot(data_1)
散点图

(2)

# 对data1 进行线性回归
> fit <- lm(data_1$y~data_1$x)
> abline(fit)
> summary(fit)

Call:
lm(formula = data_1$y ~ data_1$x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2119.8 -1136.2  -328.5   968.6  2390.8 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -487.8408  1663.3041  -0.293    0.772    
data_1$x       0.7175     0.1255   5.718 9.44e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1327 on 22 degrees of freedom
Multiple R-squared:  0.5978,    Adjusted R-squared:  0.5795 
F-statistic:  32.7 on 1 and 22 DF,  p-value: 9.438e-06

回归方程为:y = -487.8408 + 0.7175

添加回归线

(3)

# 计算x与y之间的相关系数
> cor(data_1$x,y = data_1$y)
[1] 0.7731769

(4)

> data1.res <-residuals(fit)
> shapiro.test(data1.res)

    Shapiro-Wilk normality test

data:  data1.res

W = 0.94895, p-value = 0.2571

p-value = 0.2574 < 0.05, 因此接受原假设,即残差满足正态分布

第二题:

如数据lung_cancer.txt所示,科学家希望通过找出一些和肺癌相关的因素来预测人未来是否会得肺癌。利用logistic回归的方式,以人是否得肺癌为因变量,以基因A的表达、性别、是否吸烟、是否喝酒和年龄这5个因素为自变量,计算自变量的回归系数、回归显著性并尝试是否可以简化模型。
数据中,status 0表示未患病,1表示患病;gender 0 表示男性,1表示女;smoke 1表示抽烟,0表示不抽烟;drink 1表示喝酒,0表示不喝酒。
注:基因A的表达应用看家基因(housekeeping gene)的表达(exp_actin)进行校正,即用基因A的表达值除以看家基因的表达值。

> lung_cancer <- read.table("lung_cancer.txt",header = T)
> head(lung_cancer)
  status  exp_A exp_actin gender smoke drink age
1      0 105.68    212.13      0     0     0  51
2      0  74.33    208.44      1     0     0  53
3      0  57.71    244.47      1     0     0  55
4      0  94.22    181.19      1     0     0  50
5      0  49.19    187.21      0     0     0  60
6      0  76.93    188.44      1     0     1  47
> lung_cancer$normalized_A <- lung_cancer$exp_A/lung_cancer$exp_actin
> fit2 <- glm(status~lung_cancer$normalized_A+gender+smoke+drink+age,data = lung_cancer,family = binomial())

> summary(fit2)

Call:
glm(formula = status ~ lung_cancer$normalized_A + gender + smoke + 
    drink + age, family = binomial(), data = lung_cancer)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2550  -0.8264   0.4466   0.8115   1.7073  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -2.363e+00  2.173e+00  -1.087  0.27689    
lung_cancer$normalized_A  6.197e+00  1.545e+00   4.012 6.02e-05 ***
gender                    7.298e-03  3.944e-01   0.019  0.98524    
smoke                     1.262e+00  4.255e-01   2.967  0.00301 ** 
drink                    -4.293e-01  4.175e-01  -1.028  0.30373    
age                      -6.139e-05  3.658e-02  -0.002  0.99866    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 209.36  on 169  degrees of freedom
Residual deviance: 170.84  on 164  degrees of freedom
AIC: 182.84

Number of Fisher Scoring iterations: 4


性别,年龄,饮酒对是否患病的影响不显著,去掉这三个变量重新拟合

> fit_re <- glm(status~normalized_A+smoke,data = lung_cancer, family = binomial())
> summary(fit_re)

Call:
glm(formula = status ~ normalized_A + smoke, family = binomial(), 
    data = lung_cancer)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3523  -0.8975   0.4732   0.7749   1.6852  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.4379     0.7067  -3.449 0.000562 ***
normalized_A   6.0794     1.5263   3.983  6.8e-05 ***
smoke          1.2289     0.4183   2.938 0.003306 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 209.36  on 169  degrees of freedom
Residual deviance: 171.96  on 167  degrees of freedom
AIC: 177.96

Number of Fisher Scoring iterations: 4

> anova(fit2,fit_re,test = "Chisq")
Analysis of Deviance Table

Model 1: status ~ lung_cancer$normalized_A + gender + smoke + drink + 
    age
Model 2: status ~ normalized_A + smoke
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       164     170.84                     
2       167     171.96 -3  -1.1221   0.7717

Pr>0.05 两个模型差异不显著,拟合程度一致,基因A和抽烟对患病有显著的影响。

上一篇下一篇

猜你喜欢

热点阅读