一篇数据挖掘文章的图表复现-1

2022-04-07 本文已影响0人小洁忘了怎么分身

0.文章

文章标题：Characterization of an endoplasmic reticulum stress‐related signature to evaluate immune features and predict prognosis in glioma
期刊：J Cell Mol Med
影响因子:5.3
虽然是个小小的5分文章，但涉及到的分析非常丰富，图表也很多样，我把他的数据拿来做例子进行分析，开启一段更新咯。
流程图

构建模型部分的主要结果：

图A是四个数据集中，用两种算法（log-rank test和单因素cox）p<0.05 选出来的基因。B和C是lasso回归的两张经典图表。

图D是结合逐步回归法，选出的16个基因构成的多因素cox模型。可以看到C-index值是0.86。
这篇文章的后续用了两个数据做测试集，效果都很好，那是因为一开始的时候筛选构建lasso回归模型所使用的基因就是从四个数据集、两种算法取出来的交集，这些基因都是与生存关系非常密切的，所以用他们构建出来的模型，在这四个数据集里表现都不会差的，至于这算不算投机取巧嘛，无法定论，自己衡量就好。

这个系列将会有好几篇，今天先更新第一篇，整理数据。

1.数据整理的说明

1.1 下载的数据

四个数据的表达矩阵、生存信息表格。

TCGA的LGG和GBM是分开的。后续用代码合并到一起。

1.2 表达矩阵的要求

(1)就要是取过log、没有异常值的矩阵，标准化过的也可以，因为不做差异分析。

(2)如果是转录组数据，要log之后的标准化数据。（因为这里只做生存分析，不做差异分析，所以没有必要找count数据）

(3)行名都转换成gene symbol

(4)只要tumor样本，不要normal

1.3 临床信息的要求

(1)要有每个样本对应的生存时间和结局事件，并用代码调整顺序，与表达矩阵的样本顺序相同，严格一一对应。

(2)为了后续代码统一，把生存时间和结局事件列名统一起来，都叫“time”和“event”。

(3)至于时间的单位，反正是分别计算的，统不统一无所谓。

1.4 关于数据来源的说明

数据来源于哪个数据库，不是主要矛盾。重点是分清楚是芯片数据还是转录组数据，是芯片的话要能找到探针注释。如果是转录组数据，则需要分清楚是否标准化，进行了哪种标准化。反正是分别分析，又不需要合并，所以直接使用即可。

2.TCGA的RNA-seq数据

Table S1显示样本数量691，TCGA的GBM数据173样本，LGG有529样本。

2.1 TCGA表达矩阵

exp_gbm = read.table("TCGA-GBM.htseq_fpkm.tsv.gz",
                     header = T,row.names = 1,check.names = F)
exp_lgg = read.table("TCGA-LGG.htseq_fpkm.tsv.gz",
                     header = T,row.names = 1,check.names = F)
exp1 = as.matrix(cbind(exp_gbm,exp_lgg))
#行名转换为gene symbol
library(stringr)
library(AnnoProbe)
rownames(exp1) = str_split(rownames(exp1),"\\.",simplify = T)[,1];head(rownames(exp1))

## [1] "ENSG00000242268" "ENSG00000270112" "ENSG00000167578" "ENSG00000273842"
## [5] "ENSG00000078237" "ENSG00000146083"

re = annoGene(rownames(exp1),ID_type = "ENSEMBL");head(re)

##        SYMBOL                           biotypes         ENSEMBL  chr start
## 1     DDX11L1 transcribed_unprocessed_pseudogene ENSG00000223972 chr1 11869
## 2      WASH7P             unprocessed_pseudogene ENSG00000227232 chr1 14404
## 3   MIR6859-1                              miRNA ENSG00000278267 chr1 17369
## 4 MIR1302-2HG                             lncRNA ENSG00000243485 chr1 29554
## 6     FAM138A                             lncRNA ENSG00000237613 chr1 34554
## 7      OR4G4P             unprocessed_pseudogene ENSG00000268020 chr1 52473
##     end
## 1 14409
## 2 29570
## 3 17436
## 4 31109
## 6 36081
## 7 53312

library(tinyarray)
exp1 = trans_array(exp1,ids = re,from = "ENSEMBL",to = "SYMBOL")
exp1[1:4,1:4]

##             TCGA-06-0878-01A TCGA-26-5135-01A TCGA-06-5859-01A TCGA-06-2563-01A
## DDX11L1            0.0000000        0.0221755       0.00000000       0.01797799
## WASH7P             0.6673943        1.7429940       0.65334641       1.57566444
## MIR6859-1          1.2006708        2.1880585       1.51229174       2.17610561
## MIR1302-2HG        0.0000000        0.0000000       0.02520857       0.00000000

dim(exp1)
## [1] 56514   702
#删除正常样本
Group = make_tcga_group(exp1)
table(Group)
## Group
## normal  tumor 
##      5    697
exp1 = exp1[,Group == "tumor"]
#过滤表达量0值太多的基因
exp1 = exp1[apply(exp1, 1, function(x){sum(x>0)>200}),]

2.2 TCGA生存信息

surv_gbm = readr::read_tsv("TCGA-GBM.survival.tsv")
surv_gbm$TYPE = "GBM"
surv_lgg = readr::read_tsv("TCGA-LGG.survival.tsv")
surv_lgg$TYPE = "LGG"
surv1 = rbind(surv_gbm,surv_lgg)
head(surv1)
## # A tibble: 6 x 5
##   sample              OS `_PATIENT`   OS.time TYPE 
##   <chr>            <dbl> <chr>          <dbl> <chr>
## 1 TCGA-12-0657-01A     1 TCGA-12-0657       3 GBM  
## 2 TCGA-32-1977-01A     0 TCGA-32-1977       3 GBM  
## 3 TCGA-19-1791-01A     0 TCGA-19-1791       4 GBM  
## 4 TCGA-28-1757-01A     0 TCGA-28-1757       4 GBM  
## 5 TCGA-19-2624-01A     1 TCGA-19-2624       5 GBM  
## 6 TCGA-41-4097-01A     1 TCGA-41-4097       6 GBM
table(colnames(exp1) %in% surv1$sample)
## 
## FALSE  TRUE 
##     6   691
s = intersect(colnames(exp1),surv1$sample)
exp1 = exp1[,s]
surv1 = surv1[match(s,surv1$sample),]
colnames(surv1)[c(2,4)] = c("event","time")
exp1[1:4,1:4]
##            TCGA-06-0878-01A TCGA-26-5135-01A TCGA-06-5859-01A TCGA-06-2563-01A
## WASH7P           0.66739425       1.74299400       0.65334641       1.57566444
## MIR6859-1        1.20067078       2.18805848       1.51229174       2.17610561
## AL627309.1       0.00000000       0.02066275       0.01386984       0.00000000
## CICP27           0.09691981       0.01013527       0.00000000       0.02449211
head(surv1)
## # A tibble: 6 x 5
##   sample           event `_PATIENT`    time TYPE 
##   <chr>            <dbl> <chr>        <dbl> <chr>
## 1 TCGA-06-0878-01A     0 TCGA-06-0878   218 GBM  
## 2 TCGA-26-5135-01A     1 TCGA-26-5135   270 GBM  
## 3 TCGA-06-5859-01A     0 TCGA-06-5859   139 GBM  
## 4 TCGA-06-2563-01A     0 TCGA-06-2563   932 GBM  
## 5 TCGA-41-2571-01A     1 TCGA-41-2571    26 GBM  
## 6 TCGA-28-5207-01A     1 TCGA-28-5207   343 GBM
identical(colnames(exp1),surv1$sample)
## [1] TRUE

共691个有生存信息的tumor样本。

3.CGGA芯片数据整理

exp2 = read.table("CGGA/CGGA.mRNA_array_301_gene_level.20200506.txt",header = T,row.names = 1)
surv2 = read.table("CGGA/CGGA.mRNA_array_301_clinical.20200506.txt",sep = "\t",header = T,check.names = F)
head(surv2)
##    CGGA_ID TCGA_subtypes PRS_type Histology  Grade Gender Age   OS
## 1  CGGA_11     Classical  Primary       GBM WHO IV Female  57  155
## 2 CGGA_124   Mesenchymal  Primary       GBM WHO IV   Male  53  414
## 3 CGGA_126   Mesenchymal  Primary       GBM WHO IV Female  50 1177
## 4 CGGA_168   Mesenchymal  Primary       GBM WHO IV   Male  17 3086
## 5 CGGA_172   Mesenchymal  Primary       GBM WHO IV Female  57  462
## 6 CGGA_195   Mesenchymal  Primary       GBM WHO IV   Male  48  486
##   Censor (alive=0; dead=1) Radio_status (treated=1;un-treated=0)
## 1                        1                                     1
## 2                        1                                     1
## 3                        1                                     1
## 4                        0                                     1
## 5                        1                                     1
## 6                        1                                     1
##   Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
## 1                                         0            Wildtype
## 2                                         0            Wildtype
## 3                                         1            Wildtype
## 4                                         1            Wildtype
## 5                                         0            Wildtype
## 6                                         1            Wildtype
##   1p19q_Codeletion_status MGMTp_methylation_status
## 1                    <NA>            un-methylated
## 2                    <NA>            un-methylated
## 3                    <NA>            un-methylated
## 4                    <NA>            un-methylated
## 5                    <NA>            un-methylated
## 6                    <NA>            un-methylated
s = intersect(surv2$CGGA_ID,colnames(exp2))
exp2 = as.matrix(exp2[,s])
surv2 = surv2[match(s,surv2$CGGA_ID),]
colnames(surv2)[c(9,8)] = c("event","time")
exp2[1:4,1:4]
##          CGGA_11  CGGA_124  CGGA_126   CGGA_168
## A1BG  -0.3603635 0.5649519 0.3047719  0.1749144
## A1CF   2.3413600 1.1707935 2.2081513  0.9652286
## A2BP1  0.0345194 1.0964074 0.3024883  0.0949111
## A2LD1 -0.9130564 0.5108800 0.4253669 -0.1949377
head(surv2)
##    CGGA_ID TCGA_subtypes PRS_type Histology  Grade Gender Age time event
## 1  CGGA_11     Classical  Primary       GBM WHO IV Female  57  155     1
## 2 CGGA_124   Mesenchymal  Primary       GBM WHO IV   Male  53  414     1
## 3 CGGA_126   Mesenchymal  Primary       GBM WHO IV Female  50 1177     1
## 4 CGGA_168   Mesenchymal  Primary       GBM WHO IV   Male  17 3086     0
## 5 CGGA_172   Mesenchymal  Primary       GBM WHO IV Female  57  462     1
## 6 CGGA_195   Mesenchymal  Primary       GBM WHO IV   Male  48  486     1
##   Radio_status (treated=1;un-treated=0)
## 1                                     1
## 2                                     1
## 3                                     1
## 4                                     1
## 5                                     1
## 6                                     1
##   Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
## 1                                         0            Wildtype
## 2                                         0            Wildtype
## 3                                         1            Wildtype
## 4                                         1            Wildtype
## 5                                         0            Wildtype
## 6                                         1            Wildtype
##   1p19q_Codeletion_status MGMTp_methylation_status
## 1                    <NA>            un-methylated
## 2                    <NA>            un-methylated
## 3                    <NA>            un-methylated
## 4                    <NA>            un-methylated
## 5                    <NA>            un-methylated
## 6                    <NA>            un-methylated
identical(surv2$CGGA_ID,colnames(exp2))
## [1] TRUE

4.CGGA转录组数据整理

exp3 = read.table("CGGA/CGGA.mRNAseq_325.RSEM-genes.20200506.txt",header = T,row.names = 1)
exp3 = as.matrix(log2(exp3+1))
surv3 = read.table("CGGA/CGGA.mRNAseq_325_clinical.20200506.txt",sep = "\t",header = T,check.names = F)
head(surv3)
##     CGGA_ID  PRS_type Histology   Grade Gender Age   OS
## 1 CGGA_1001   Primary       GBM  WHO IV   Male  11 3817
## 2 CGGA_1006   Primary        AA WHO III   Male  42  254
## 3 CGGA_1007   Primary       GBM  WHO IV Female  57  345
## 4 CGGA_1011   Primary       GBM  WHO IV Female  46  109
## 5 CGGA_1015   Primary       GBM  WHO IV   Male  62  164
## 6 CGGA_1019 Recurrent      rGBM  WHO IV   Male  60  212
##   Censor (alive=0; dead=1) Radio_status (treated=1;un-treated=0)
## 1                        0                                     0
## 2                        1                                     1
## 3                        1                                     1
## 4                        1                                     1
## 5                        1                                     1
## 6                        1                                     0
##   Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
## 1                                         1            Wildtype
## 2                                         1            Wildtype
## 3                                         1            Wildtype
## 4                                         0            Wildtype
## 5                                         0            Wildtype
## 6                                         1            Wildtype
##   1p19q_codeletion_status MGMTp_methylation_status
## 1               Non-codel            un-methylated
## 2               Non-codel            un-methylated
## 3               Non-codel            un-methylated
## 4               Non-codel            un-methylated
## 5               Non-codel            un-methylated
## 6               Non-codel               methylated
colnames(surv3)[c(8,7)] = c("event","time")

s = intersect(surv3$CGGA_ID,colnames(exp3))
exp3 = exp3[,s]
surv3 = surv3[match(s,surv3$CGGA_ID),]
exp3[1:4,1:4]
##          CGGA_1001 CGGA_1006 CGGA_1007 CGGA_1011
## A1BG      3.769772 3.0054000  4.958379  2.933573
## A1BG-AS1  1.641546 1.0908534  2.933573  2.411426
## A2M       8.826294 6.7487296  7.698357  9.467952
## A2M-AS1   2.104337 0.1763228  0.704872  1.384050
head(surv3)
##     CGGA_ID  PRS_type Histology   Grade Gender Age time event
## 1 CGGA_1001   Primary       GBM  WHO IV   Male  11 3817     0
## 2 CGGA_1006   Primary        AA WHO III   Male  42  254     1
## 3 CGGA_1007   Primary       GBM  WHO IV Female  57  345     1
## 4 CGGA_1011   Primary       GBM  WHO IV Female  46  109     1
## 5 CGGA_1015   Primary       GBM  WHO IV   Male  62  164     1
## 6 CGGA_1019 Recurrent      rGBM  WHO IV   Male  60  212     1
##   Radio_status (treated=1;un-treated=0)
## 1                                     0
## 2                                     1
## 3                                     1
## 4                                     1
## 5                                     1
## 6                                     0
##   Chemo_status (TMZ treated=1;un-treated=0) IDH_mutation_status
## 1                                         1            Wildtype
## 2                                         1            Wildtype
## 3                                         1            Wildtype
## 4                                         0            Wildtype
## 5                                         0            Wildtype
## 6                                         1            Wildtype
##   1p19q_codeletion_status MGMTp_methylation_status
## 1               Non-codel            un-methylated
## 2               Non-codel            un-methylated
## 3               Non-codel            un-methylated
## 4               Non-codel            un-methylated
## 5               Non-codel            un-methylated
## 6               Non-codel               methylated
identical(surv3$CGGA_ID,colnames(exp3))
## [1] TRUE

我没仔细看这个表达矩阵具体是哪种标准化，反正这个数据只是作为例子，我们就不深究到底是啥了，直接拿来用。注意如果是自己要用来发文章的数据，是要看清楚的，没标准化需要自行标准化，cpm，tmp，fpkm都行。

5.GSE16011的整理

library(tinyarray)
geo = geo_download("GSE16011")
library(stringr)
#只要tumor样本
k = str_detect(geo$pd$title,"glioma");table(k)
## k
## FALSE  TRUE 
##     8   276
geo$exp = geo$exp[,k]
geo$pd = geo$pd[k,]
# 把行名转换为gene symbol
gpl = GEOquery::getGEO(filename = "GPL8542_family.soft.gz",destdir = ".")

ids  = gpl@dataTable@table[,1:2]
library(clusterProfiler)
library(org.Hs.eg.db)
e2s = bitr(ids$ORF,fromType = "ENTREZID",toType = "SYMBOL",OrgDb = "org.Hs.eg.db")
ids = merge(ids,e2s,by.x = "ORF",by.y = "ENTREZID")
ids = ids[,2:3]
colnames(ids) = c("probe_id","symbol")
exp4 = trans_array(geo$exp,ids)
surv4 = geo$pd[,c(1,4,7,9)]
exp4[1:4,1:4]
##      GSM405201 GSM405202 GSM405203 GSM405204
## A1BG  7.219401  6.031261  6.707681  6.330666
## A2M  12.206626 13.260587 12.352693 13.044375
## NAT1  5.182989  7.213272  6.570733  6.343490
## NAT2  5.239842  5.260373  5.697435  4.859682
head(surv4)
##               title age at diagnosis:ch1  histology:ch1 gender:ch1
## GSM405201  glioma 8                44.57 OD (grade III)     Female
## GSM405202  glioma 9                28.69 OD (grade III)       Male
## GSM405203 glioma 11                38.58 OD (grade III)       Male
## GSM405204 glioma 13                33.89 OD (grade III)       Male
## GSM405205 glioma 20                48.03 OD (grade III)       Male
## GSM405206 glioma 21                31.53 OD (grade III)       Male
identical(rownames(surv4),colnames(exp4))
## [1] TRUE

临床信息表格里没有生存信息，所以从文章附件里找

从文章附件里提取临床信息表格

https://aacrjournals.org/cancerres/article/69/23/9065/553005/Intrinsic-Gene-Expression-Profiles-of-Gliomas-Are

包非常难装，需要配置java，但这个需求不常用，直接跳过这个过程。

if(!file.exists("exp_surv4.Rdata")){
  library(tabulizer)
  f <- "00085472can092307-sup-stabs_1-6.pdf"
  re <- extract_tables(f,pages = 1:10) #提取前10页的表格。
  str(re)
  re = do.call(rbind,re)
  re[1:4,1:4]
  colnames(re) = re[1,]
  re = re[-1,]
  re = data.frame(re)
  re[re==""]=NA

  library(readr)
  re$Survival..years. = parse_double(re$Survival..years.,locale = locale(decimal_mark = ","))

  re$Age.at.diagnosis = parse_double(re$Age.at.diagnosis,locale = locale(decimal_mark = ","))

  dim(exp4)
  k = re$Reviewed.histological.diagnosis!="control";table(k)
  re = re[k,]
  re$Database.number = paste("glioma",re$Database.number)
  surv4$ID = rownames(surv4)

  surv4 = merge(surv4,re,by.x = "title",by.y = "Database.number")
  colnames(surv4)[13:14] = c("event","time")
  table(surv4$event)
  surv4$time = as.integer(surv4$time*365) #天为单位
  surv4$event[surv4$event=="Lost to\rfollow up"]=NA
  table(surv4$event)

  surv4$event=ifelse(surv4$event=="Alive",0,1)
  head(surv4)

  s = intersect(surv4$ID,colnames(exp4))
  exp4 = exp4[,s]
  surv4 = surv4[match(s,surv4$ID),]
  save(exp4,surv4,file = "exp_surv4.Rdata")
}
load("exp_surv4.Rdata")
exp4[1:4,1:4]
##      GSM405234 GSM405235 GSM405236 GSM405237
## A1BG  7.613150  6.995666  7.337393  6.287498
## A2M  13.166457 13.003225 13.814942 12.991301
## NAT1  5.581672  7.470894  7.845210  6.088651
## NAT2  5.206464  4.618245  4.881652  5.011908
head(surv4)
##        title age at diagnosis:ch1  histology:ch1 gender:ch1        ID Gender
## 1 glioma 101                57.68 GBM (grade IV)     Female GSM405234 Female
## 2 glioma 104                71.09 GBM (grade IV)       Male GSM405235   Male
## 3 glioma 105                52.20 GBM (grade IV)       Male GSM405236   Male
## 4 glioma 107                51.17 GBM (grade IV)       Male GSM405237   Male
## 5  glioma 11                38.58 OD (grade III)       Male GSM405203   Male
## 6 glioma 111                37.25 GBM (grade IV)       Male GSM405238   Male
##   Reviewed.histological.diagnosis Age.at.diagnosis  KPS     Type.of.surgery
## 1                  GBM (grade IV)            57.68   70   Partial resection
## 2                  GBM (grade IV)            71.09   80   Partial resection
## 3                  GBM (grade IV)            52.20   80  Complete resection
## 4                  GBM (grade IV)            51.17   70 Stereotactic biopsy
## 5                  OD (grade III)            38.58 <NA>  Complete resection
## 6                  GBM (grade IV)            37.25   80 Stereotactic biopsy
##   Radiotherapy Chemotherapy event time
## 1          yes           no     1  226
## 2          yes           no     1   76
## 3          yes           no     1  375
## 4          yes           no     1  149
## 5          yes           no     1 3255
## 6          yes           no     1  343
identical(surv4$ID,colnames(exp4))
## [1] TRUE

检查和保存数据

par(mfrow = c(2,2))
boxplot(exp1[,1:10])
boxplot(exp2[,1:10])
boxplot(exp3[,1:10])
boxplot(exp4[,1:10])

image.png

save(exp1,surv1,file = "exp_surv1.Rdata")
save(exp2,surv2,file = "exp_surv2.Rdata")
save(exp3,surv3,file = "exp_surv3.Rdata")
save(exp4,surv4,file = "exp_surv4.Rdata")