TCGA | 生存曲线
本教程目录:
-
首先使用cgdsr获取表达数据集临床信息
-
临床资料解读
-
简单的KM生存分析
-
有分类的KM生存分析
-
根据基因表达量对样本进行分组做生存分析
-
cox生存分析
-
某基因突变与否也可以用来分组
-
基因的拷贝数也可以进行分组
-
批量进行分组并且做生存分析
生存分析一般来说是针对RNA表达数据,可以说mRNA-seq的转录组数据,也可以说miRNA-seq数据,或者基因表达芯片的表达量值。
生存分析,大多就是说的KM方法估计生存函数,并且画出生存曲线,然后还可以根据分组检验一下它们的生存曲线是否有显著的差异。
在R中,有个包survival做生存分析就很方便!只需要记住和熟练使用三个函数:
-
Surv:用于创建生存数据对象
-
survfit:创建KM生存曲线或是Cox调整生存曲线
-
survdiff:用于不同组的统计检验
首先使用cgdsr获取表达数据集临床信息
既然是要说明如何对任意癌症的任意基因做生存分析,那么我们首先需要理解cgdsr下载TCGA任意数据的用法(见之前的教程),下面的例子是获取TCGA数据库的乳腺癌的BRCA1和BRCA2基因的表达,以及涉及到的病人的临床资料。
rm(list = ls())library(cgdsr)library(DT) mycgds <- CGDS("http://www.cbioportal.org/public-portal/")mycancerstudy = 'brca_tcga' ## 下面的代码可以不需要运行,因为已经保存好了用来做生存分析的数据。### 但是需要看懂代码,这样才能做任意癌症的任意基因的任意数据的生存分析;if(F){ getCaseLists(mycgds,mycancerstudy)[,1] getGeneticProfiles(mycgds,mycancerstudy)[,1] mycaselist ='brca_tcga_rna_seq_v2_mrna' mygeneticprofile = 'brca_tcga_rna_seq_v2_mrna' choose_genes=c('BRCA1','BRCA2') ## get expression data expr=getProfileData(mycgds,choose_genes, mygeneticprofile,mycaselist) ## get mutation data mut_df <- getProfileData(mycgds, caseList ="brca_tcga_sequenced", geneticProfile = "brca_tcga_mutations", genes = choose_genes ) mut_df <- apply(mut_df,2,as.factor) mut_df[mut_df == "NaN"] = "" mut_df[is.na(mut_df)] = "" mut_df[mut_df != ''] = "MUT" ## get copy number data cna <- getProfileData(mycgds, caseList ="brca_tcga_sequenced", geneticProfile = "brca_tcga_gistic", genes = choose_genes) rn=rownames(cna) cna <- apply(cna,2,function(x) as.character(factor(x, levels = c(-2:2), labels = c("HOMDEL", "HETLOSS", "DIPLOID", "GAIN", "AMP")))) cna[is.na(cna)] = "" cna[cna == 'DIPLOID'] = "" rownames(cna)=rn # Get clinical data for the case list myclinicaldata = getClinicalData(mycgds,mycaselist) save(expr,myclinicaldata,cna,mut_df,file='survival_input.Rdata')}load(file='survival_input.Rdata')DT::datatable(expr)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: left; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">上述代码取决于网速,我已经下载整理好了:survival_input.Rdata 数据,避免每次重复这个教程重新下载的尴尬</figcaption>
DT::datatable(myclinicaldata, extensions = 'FixedColumns', options = list( #dom = 't', scrollX = TRUE, fixedColumns = TRUE ))
## Warning in instance$preRenderHook(instance): It seems your data is too## big for client-side DataTables. You may consider server-side processing:## http://rstudio.github.io/DT/server.html
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">
</figcaption>
可以看到所谓的表达矩阵就是每个基因在各个样本的表达量,只不过是要注意单位,可以是RPKM,TPM等。
生存分析所需的临床信息一般是生存时间和是否死亡的状态信息。
这两个数据,也可以自己从其它途径(比如很多人喜欢excel表格)整理,然后读入到R语言里面再来做生存分析。
临床资料解读
临床信息是非常复杂的,但是我们并不需要理解那么多,下面是我挑出的一些比较重要的,需要彻底理解:
choose_columns=c('AJCC_METASTASIS_PATHOLOGIC_PM','AJCC_NODES_PATHOLOGIC_PN','AJCC_PATHOLOGIC_TUMOR_STAGE','AJCC_TUMOR_PATHOLOGIC_PT', "AGE",'SEX','VITAL_STATUS','OS_STATUS','OS_MONTHS', "DFS_MONTHS","DFS_STATUS")choose_clinicaldata=myclinicaldata[,choose_columns]colnames(choose_clinicaldata)[1:4]=c('M','N','STAGE','T')str(choose_clinicaldata, no.list = T, vec.len = 2)
## 'data.frame': 1100 obs. of 11 variables:## $ M : chr "M0" "MX" ...## $ N : chr "N0" "N3" ...## $ STAGE : chr "Stage IIA" "Stage IIIC" ...## $ T : chr "T2" "T3" ...## $ AGE : int 62 59 70 42 69 ...## $ SEX : chr "Female" "Female" ...## $ VITAL_STATUS: chr "Alive" "Alive" ...## $ OS_STATUS : chr "LIVING" "LIVING" ...## $ OS_MONTHS : num 10.3 26 ...## $ DFS_MONTHS : num 10.3 26 ...## $ DFS_STATUS : chr "DiseaseFree" "DiseaseFree" ...
虽然上面我挑出的临床信息还有很多,但是我们只需要用到OS_MONTHS和OS_STATUS就可以来估计KM生存函数,画出最简单的生存曲线!
-
无病生存期(Disease-free survival,DFS)的定义是指从随机化开始至疾病复发或由于疾病进展导致患者死亡的时间。该指标也常作 为抗肿瘤药物III期临床试验的主要终点。某些情况下,DFS与OS相比,作为终点比较难以记录,因为它要求认真随访,及时发现疾病复发,而且肿瘤患者的 死亡原因也很难确定。肿瘤患者常有合并症(如,心血管病),这些合并症可能会干扰对DFS的判断。并且,肿瘤患者常死于医院外,不能常规进行尸检。
-
总生存期(Overall survival,OS) 的定义是指从随机化开始至因任何原因引起死亡的时间。该指标常常被认为是肿瘤临床试验中最佳的疗效终点。如果在生存期上有小幅度的提高,可以认为是有意义的临床受益证据。作为一个终点,生存期应每天进行评价,可通过在住院就诊时,通过与患者直接接触或者通过电话与患者交谈,这些相对比较容易 记录。确认死亡的日期通常几乎没有困难,并且死亡的时间有其独立的因果关系。当记录至死亡之前的失访患者,通常截止到最后一次有记录的、与患者接触的时间。
事件时间分布如下:死亡(DECEASED)和存活(LIVING)
dat=choose_clinicaldata[choose_clinicaldata$OS_MONTHS > 0,]table(dat$OS_STATUS)
## ## DECEASED LIVING ## 154 930
attach(dat)
## The following object is masked from package:base:## ## T
library(ggplot2)ggplot(dat, aes(x = OS_MONTHS, group = OS_STATUS,colour = OS_STATUS, fill = OS_STATUS)) + geom_density(alpha = 0.5)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
detach(dat)
简单的KM生存分析
library(survival)dat=choose_clinicaldata[choose_clinicaldata$OS_MONTHS > 0,]table(dat$OS_STATUS)
## ## DECEASED LIVING ## 154 930
attach(dat)
## The following object is masked from package:base:## ## T
## 估计KM生存曲线my.surv <- Surv(OS_MONTHS,OS_STATUS=='DECEASED')## The status indicator, normally 0=alive, 1=dead. ## Other choices are TRUE/FALSE (TRUE = death) or 1/2 (2=death). kmfit1 <- survfit(my.surv~1)# summary(kmfit1)plot(kmfit1)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
detach(dat)
上面的生存分析并没有指定样本如何进行分组,是最简单版本的生存分析了。
我们很容易看到,随着感癌症的时间延长,病人的死亡率到了一定程度就不增加了,有些病人熬过了这个癌症,或者说,到我们拿到数据为止,他们还没有死亡!
有分类的KM生存分析
首先根据性别分组,看看不同性别的乳腺癌患者的生存曲线
library(survival)dat=choose_clinicaldata[choose_clinicaldata$OS_MONTHS > 0,]attach(dat)
## The following object is masked from package:base:## ## T
table(SEX,OS_STATUS)
## OS_STATUS## SEX DECEASED LIVING## Female 153 919## Male 1 11
my.surv <- Surv(OS_MONTHS,OS_STATUS=='DECEASED')kmfit2 <- survfit(my.surv~SEX)
plot(kmfit2,col = rainbow(length(unique(SEX))))
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: left; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">这个其实没啥必要,因为乳腺癌本身在女性发病率远高于男性,尤其是TCGA里面,就12个男性样本!</figcaption>
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
# 看这个因子的不同水平是否有显著差异,其中默认用是的logrank test 方法。survdiff(my.surv~(SEX))
## Call:## survdiff(formula = my.surv ~ (SEX))## ## n=1084, 3 observations deleted due to missingness.## ## N Observed Expected (O-E)^2/E (O-E)^2/V## SEX=Female 1072 153 152.8 0.000262 0.0337## SEX=Male 12 1 1.2 0.033305 0.0337## ## Chisq= 0 on 1 degrees of freedom, p= 0.854
# 来检测自己感兴趣的因子是否受其它因子(age,gender等等)的影响。# 用strata来控制协变量的影响, 比如年龄等因素survdiff(my.surv~SEX+strata(AGE))
## Call:## survdiff(formula = my.surv ~ SEX + strata(AGE))## ## n=1084, 3 observations deleted due to missingness.## ## N Observed Expected (O-E)^2/E (O-E)^2/V## SEX=Female 1072 153 153.181 0.000214 0.0437## SEX=Male 12 1 0.819 0.040052 0.0437## ## Chisq= 0 on 1 degrees of freedom, p= 0.834
detach(dat)
这个分析意义真不大,因为乳腺癌女性患者数量要远超过男性患者,但是有非常多有意义的分类呀!
然后根据癌症的stage来进行分类
library(survival)dat=choose_clinicaldata[choose_clinicaldata$OS_MONTHS > 0,]attach(dat)
## The following object is masked from package:base:## ## T
table(STAGE,OS_STATUS)
## OS_STATUS## STAGE DECEASED LIVING## 4 7## Stage I 13 77## Stage IA 3 82## Stage IB 0 5## Stage II 0 6## Stage IIA 34 319## Stage IIB 34 221## Stage III 2 0## Stage IIIA 25 129## Stage IIIB 8 17## Stage IIIC 9 57## Stage IV 15 5## Stage X 7 5
my.surv <- Surv(OS_MONTHS,OS_STATUS=='DECEASED')kmfit2 <- survfit(my.surv~STAGE)
plot(kmfit2,col = rainbow(length(unique(STAGE))))
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
# 检验显著性survdiff(my.surv~(STAGE))
## Call:## survdiff(formula = my.surv ~ (STAGE))## ## n=1084, 3 observations deleted due to missingness.## ## N Observed Expected (O-E)^2/E (O-E)^2/V## STAGE= 11 4 2.890 0.426 0.443## STAGE=Stage I 90 13 20.141 2.532 2.949## STAGE=Stage IA 85 3 11.594 6.370 7.015## STAGE=Stage IB 5 0 0.825 0.825 0.831## STAGE=Stage II 6 0 1.246 1.246 1.266## STAGE=Stage IIA 353 34 45.149 2.753 3.921## STAGE=Stage IIB 255 34 37.445 0.317 0.420## STAGE=Stage III 2 2 0.209 15.375 15.415## STAGE=Stage IIIA 154 25 20.460 1.007 1.169## STAGE=Stage IIIB 25 8 3.874 4.393 4.598## STAGE=Stage IIIC 66 9 4.997 3.208 3.360## STAGE=Stage IV 20 15 2.389 66.572 67.799## STAGE=Stage X 12 7 2.781 6.403 6.632## ## Chisq= 112 on 12 degrees of freedom, p= 0
# 用strata来控制协变量的影响, 比如年龄等因素survdiff(my.surv~STAGE+strata(AGE))
## Call:## survdiff(formula = my.surv ~ STAGE + strata(AGE))## ## n=1084, 3 observations deleted due to missingness.## ## N Observed Expected (O-E)^2/E (O-E)^2/V## STAGE= 11 4 3.943 8.19e-04 0.00214## STAGE=Stage I 90 13 21.507 3.36e+00 6.59955## STAGE=Stage IA 85 3 10.870 5.70e+00 7.79916## STAGE=Stage IB 5 0 0.395 3.95e-01 0.48088## STAGE=Stage II 6 0 1.146 1.15e+00 1.56141## STAGE=Stage IIA 353 34 42.376 1.66e+00 2.93671## STAGE=Stage IIB 255 34 36.959 2.37e-01 0.50934## STAGE=Stage III 2 2 0.545 3.88e+00 4.13017## STAGE=Stage IIIA 154 25 20.534 9.71e-01 1.53489## STAGE=Stage IIIB 25 8 2.585 1.13e+01 14.06707## STAGE=Stage IIIC 66 9 5.054 3.08e+00 3.62911## STAGE=Stage IV 20 15 4.698 2.26e+01 69.87385## STAGE=Stage X 12 7 3.389 3.85e+00 6.94665## ## Chisq= 117 on 12 degrees of freedom, p= 0
detach(dat)
可以看到癌症诊断里面的stage种类太多,生存分析曲线可视化效果不好。后面我们会介绍一个比较好的生存分析结果可视化包。
根据基因表达量对样本进行分组做生存分析
首先探索一下基因的表达量情况,这里选取的是BRCA1,BRCA2这两个基因。
library(survival)dat=cbind(choose_clinicaldata[,c('OS_STATUS','OS_MONTHS')],expr[rownames(choose_clinicaldata),])dat=dat[dat$OS_MONTHS > 0,]dat=dat[!is.na(dat$OS_STATUS),]dat$OS_STATUS=as.character(dat$OS_STATUS)# ggplot(dat,aes(x=OS_STATUS,y=BRCA1))+geom_boxplot()library(ggpubr)
## Loading required package: magrittr
p <- ggboxplot(dat, x="OS_STATUS", y="BRCA1", color = "OS_STATUS", palette = "jco", add = "jitter")p+stat_compare_means(method = "t.test")
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
p <- ggboxplot(dat, x="OS_STATUS", y="BRCA2", color = "OS_STATUS", palette = "jco", add = "jitter")p+stat_compare_means(method = "t.test")
![](https://img.haomeiwen.com/i20722303/95ea4ea13315f342.gif)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
有表达差异不一定代表有生存分析的显著性,下面分别进行生存分析检验。
dat$brca1_group=ifelse(dat$BRCA1 > median(dat$BRCA1),'high','low')dat$brca2_group=ifelse(dat$BRCA2 > median(dat$BRCA2),'high','low')attach(dat)table(brca1_group,brca2_group)
## brca2_group## brca1_group high low## high 375 167## low 167 375
library(survival)my.surv <- Surv(OS_MONTHS,OS_STATUS=='DECEASED')kmfit1 <- survfit(my.surv~brca1_group,data = dat) plot(kmfit1,col = rainbow( 2 ))
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
kmfit2 <- survfit(my.surv~brca2_group,data = dat) plot(kmfit2,col = rainbow( 2 ))
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
library("survminer")
ggsurvplot(kmfit1,conf.int =F, pval = T,risk.table =T, ncensor.plot = TRUE)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
ggsurvplot(kmfit2,conf.int =F, pval = T,risk.table =T, ncensor.plot = TRUE)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
可以看到这个survminer包对生存分析可视化的效果很赞,之所以可以显示P值,是因为我们的survfit函数已经做了检验,返回的kmfit这样的对象里面本身就含有非常丰富的信息,大家可以自行摸索。
当然,即使某个基因在KM生存分析显示有显著性的差异,也还需要排除其它因素的影响,而不仅仅是该基因表达的高低决定了生存的好坏。所以还需要进行下面的 cox生存分析
cox生存分析
Cox比例风险回归模型(Cox’s proportional hazards regression model),简称Cox回归模型。该模型由英国统计学家D.R.Cox于1972年提出,主要用于肿瘤和其它慢性病的预后分析,也可用于队列研究的病因探索。
参考:
http://www.sthda.com/english/wiki/cox-proportional-hazards-model
https://www.r-bloggers.com/cox-proportional-hazards-model/
library(survival)library("survminer")dat=choose_clinicaldata[choose_clinicaldata$OS_MONTHS > 0,]dat=dat[!is.na(dat$OS_STATUS),]dat$OS_STATUS=as.character(dat$OS_STATUS)str(dat, no.list = T, vec.len = 2)
## 'data.frame': 1084 obs. of 11 variables:## $ M : chr "M0" "MX" ...## $ N : chr "N0" "N3" ...## $ STAGE : chr "Stage IIA" "Stage IIIC" ...## $ T : chr "T2" "T3" ...## $ AGE : int 62 59 70 42 69 ...## $ SEX : chr "Female" "Female" ...## $ VITAL_STATUS: chr "Alive" "Alive" ...## $ OS_STATUS : chr "LIVING" "LIVING" ...## $ OS_MONTHS : num 10.3 26 ...## $ DFS_MONTHS : num 10.3 26 ...## $ DFS_STATUS : chr "DiseaseFree" "DiseaseFree" ...
attach(dat)
## The following objects are masked from dat (pos = 4):## ## OS_MONTHS, OS_STATUS
## The following object is masked from package:base:## ## T
my.surv <- Surv( dat$OS_MONTHS,dat$OS_STATUS=='DECEASED')## 可以针对多个变量分别批量进行cox分析,或者直接把多个变量一起放在cox函数里面。m=coxph(my.surv ~ AGE + SEX + N + T + M + STAGE, data = dat)
## Warning in fitter(X, Y, strats, offset, init, control, weights = weights, :## Loglik converged before variable 5,9,18,22,25,32,35,36 ; beta may be## infinite.
ggsurvplot(survfit(m, data = dat), color = "#2E9FDF", ggtheme = theme_minimal())
## Warning: Now, to change color palette, use the argument palette= '#2E9FDF'## instead of color = '#2E9FDF'
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
beta <- coef(m)se <- sqrt(diag(vcov(m)))HR <- exp(beta)HRse <- HR * sedetach(dat)
某基因突变与否也可以用来分组
library(survival)length(intersect(rownames(choose_clinicaldata),rownames(mut_df)))
## [1] 979
dat=cbind(choose_clinicaldata[rownames(mut_df),c('OS_STATUS','OS_MONTHS')],mut_df)dat=dat[dat$OS_MONTHS > 0,]dat=dat[!is.na(dat$OS_STATUS),]dat$OS_STATUS=as.character(dat$OS_STATUS) attach(dat)
## The following objects are masked from dat (pos = 4):## ## BRCA1, BRCA2, OS_MONTHS, OS_STATUS
table(BRCA1,BRCA2)
## BRCA2## BRCA1 MUT## 939 14## MUT 12 0
library(survival)my.surv <- Surv(OS_MONTHS,OS_STATUS=='DECEASED')kmfit1 <- survfit(my.surv~BRCA1,data = dat) kmfit2 <- survfit(my.surv~BRCA2,data = dat) library("survminer")
ggsurvplot(kmfit1,conf.int =F, pval = T,risk.table =T, ncensor.plot = TRUE)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
ggsurvplot(kmfit2,conf.int =F, pval = T,risk.table =T, ncensor.plot = TRUE)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
基因的拷贝数也可以进行分组
library(survival)length(intersect(rownames(choose_clinicaldata),rownames(cna)))
## [1] 979
dat=cbind(choose_clinicaldata[rownames(cna),c('OS_STATUS','OS_MONTHS')],cna)dat=dat[dat$OS_MONTHS > 0,]dat=dat[!is.na(dat$OS_STATUS),]dat$OS_STATUS=as.character(dat$OS_STATUS) attach(dat)
## The following objects are masked from dat (pos = 3):## ## BRCA1, BRCA2, OS_MONTHS, OS_STATUS
## The following objects are masked from dat (pos = 5):## ## BRCA1, BRCA2, OS_MONTHS, OS_STATUS
table(BRCA1,BRCA2)
## BRCA2## BRCA1 AMP GAIN HETLOSS HOMDEL## 288 2 21 118 8## AMP 5 0 0 7 2## GAIN 59 4 30 84 3## HETLOSS 110 6 43 163 4## HOMDEL 2 0 1 4 1
library(survival)my.surv <- Surv(OS_MONTHS,OS_STATUS=='DECEASED')kmfit1 <- survfit(my.surv~BRCA1,data = dat) kmfit2 <- survfit(my.surv~BRCA2,data = dat) library("survminer")
ggsurvplot(kmfit1,conf.int =F, pval = T,risk.table =T, ncensor.plot = TRUE)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
ggsurvplot(kmfit2,conf.int =F, pval = T,risk.table =T, ncensor.plot = TRUE)
![](https://img.haomeiwen.com/i20722303/3e693890ef808bf7.gif)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
批量进行分组并且做生存分析
这个代码非常经典,所以我可以放在了 生信技能树»生信技能树›互动作业›脚本能力实践›生信人必练的200个数据处理任务 上面,希望所有学习生物信息学的朋友都掌握:http://www.biotrainee.com/thread-929-1-1.html
(阅读原文可以直达)
## 首先模拟 50个病人的 200个基因的表达数据。## 50 patients and 200 genes dat=matrix(rnorm(10000,mean=8,sd=4),nrow = 200)rownames(dat)=paste0('gene_',1:nrow(dat))colnames(dat)=paste0('sample_',1:ncol(dat))os_years=abs(floor(rnorm(ncol(dat),mean = 50,sd=20)))os_status=sample(rep(c('LIVING','DECEASED'),25))library(survival)my.surv <- Surv( os_years,os_status=='DECEASED')## The status indicator, normally 0=alive, 1=dead. Other choices are TRUE/FALSE (TRUE = death) or 1/2 (2=death). ## And most of the time we just care about the time od DECEASED . fit.KM=survfit(my.surv~1)fit.KM
## Call: survfit(formula = my.surv ~ 1)## ## n events median 0.95LCL 0.95UCL ## 50 25 64 56 NA
plot(fit.KM)
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">img</figcaption>
log_rank_p <- apply(dat, 1, function(values1){ group=ifelse(values1>median(values1),'high','low') kmfit2 <- survfit(my.surv~group) #plot(kmfit2) data.survdiff=survdiff(my.surv~group) p.val = 1 - pchisq(data.survdiff$chisq, length(data.survdiff$n) - 1)})names(log_rank_p[log_rank_p<0.05])
## [1] "gene_63" "gene_100" "gene_102" "gene_125" "gene_146" "gene_149"## [7] "gene_150" "gene_193"
gender= ifelse(rnorm(ncol(dat))>1,'male','female')age=abs(floor(rnorm(ncol(dat),mean = 50,sd=20)))## gender and age are similar with group(by gene expression)cox_results <- apply(dat , 1, function(values1){ group=ifelse(values1>median(values1),'high','low') survival_dat <- data.frame(group=group,gender=gender,age=age,stringsAsFactors = F) m=coxph(my.surv ~ age + gender + group, data = survival_dat) beta <- coef(m) se <- sqrt(diag(vcov(m))) HR <- exp(beta) HRse <- HR * se #summary(m) tmp <- round(cbind(coef = beta, se = se, z = beta/se, p = 1 - pchisq((beta/se)^2, 1), HR = HR, HRse = HRse, HRz = (HR - 1) / HRse, HRp = 1 - pchisq(((HR - 1)/HRse)^2, 1), HRCILL = exp(beta - qnorm(.975, 0, 1) * se), HRCIUL = exp(beta + qnorm(.975, 0, 1) * se)), 3) return(tmp['grouplow',])})DT::datatable(cox_results , extensions = 'FixedColumns', options = list( #dom = 't', scrollX = TRUE, fixedColumns = TRUE ))
<figcaption data-darkmode-color-16081790047527="rgb(153, 153, 153)" data-darkmode-original-color-16081790047527="rgb(153, 153, 153)" style="margin: 10px 0px 0px; padding: 0px; max-width: 100%; box-sizing: border-box !important; overflow-wrap: break-word !important; line-height: inherit; text-align: center; color: rgb(153, 153, 153); font-size: 0.7em; user-select: auto;">
</figcaption>
## 有统计学显著性的如下:cox_results[,cox_results[4,]<0.05]## gene_26 gene_63 gene_94 gene_100 gene_102 gene_103 gene_146## coef 0.853 1.109 -0.903 0.935 -0.995 0.890 -1.372## se 0.421 0.437 0.431 0.426 0.464 0.429 0.478## z 2.025 2.534 -2.096 2.196 -2.146 2.073 -2.869## p 0.043 0.011 0.036 0.028 0.032 0.038 0.004## HR 2.346 3.030 0.405 2.548 0.370 2.435 0.254## HRse 0.988 1.326 0.175 1.086 0.171 1.046 0.121## HRz 1.363 1.531 -3.406 1.426 -3.676 1.373 -6.154## HRp 0.173 0.126 0.001 0.154 0.000 0.170 0.000## HRCILL 1.028 1.285 0.174 1.106 0.149 1.050 0.099## HRCIUL 5.354 7.142 0.943 5.874 0.918 5.649 0.647## gene_149 gene_150 gene_155 gene_193## coef 1.011 1.004 -0.924 -1.072## se 0.442 0.479 0.440 0.440## z 2.290 2.096 -2.101 -2.437## p 0.022 0.036 0.036 0.015## HR 2.749 2.728 0.397 0.342## HRse 1.214 1.306 0.175 0.151## HRz 1.441 1.323 -3.456 -4.368## HRp 0.150 0.186 0.001 0.000## HRCILL 1.157 1.068 0.168 0.145## HRCIUL 6.531 6.973 0.940 0.811