TCGA 同源重组缺陷(HRD)
0.背景知识
最近在研读新的文献O(∩_∩)O。
Genomic, epigenomic, and transcriptomic signatures for telomerase complex components: a pan‐cancer analysis
其中提到:
Genomic instability included aneuploidy, somatic total mutation burden (TMB), somatic copy number alterations (SCNA), loss of heterozygosity (LOH), and homologous recombination deficiency (HRD).
基因组不稳定性包括非整倍性、体细胞总突变负荷(TMB)、体细胞拷贝数改变(SCNA)、杂合性丧失(LOH)和同源重组缺陷(HRD)。
同源重组修复(homologous recombination repair,HRR)是DNA双链断裂(double strand break,DSB)的首选修复方式。
同源重组修复缺陷(homologous recombination deficiency,HRD)通常指细胞水平上的HRR功能障碍状态,可由HRR相关基因胚系突变或体细胞突变以及表观遗传失活等诸多因素导致,常存在于多种恶性肿瘤中,其中在卵巢癌、乳腺癌、胰腺导管癌、前列腺癌等肿瘤中尤其突出。
杂合性缺失(loss of heterozygosity,LOH):大于15 Mb且小于整个染色体长度的杂合性缺失;
端粒等位基因不平衡(telomeric allelic imbalance,TAI):延伸到其中一个亚端粒但不超过着丝粒且大于11 Mb的等位基因不平衡的染色体片段;
大片段迁移(large-scale state transition,LST):两个相邻区域(两个区域长度均大于等于10 Mb,且区域间距小于3 Mb)之间的染色体断裂位点,肿瘤基因组截断点的总数可以用来描述基因组的不稳定性。
中国抗癌协会肿瘤标志专业委员会遗传性肿瘤标志物协作组, 中华医学会病理学分会分子病理学组 . 同源重组修复缺陷临床检测与应用专家共识(2021版)[J]. 中国癌症防治杂志, 2021, 13(4): 329-338.
经过查找搜寻,找到了HRD的数据计算结果。在:
https://gdc.cancer.gov/about-data/publications/PanCan-DDR-2018
DNA damage repair (DDR)
给出了一组tsv,和一个大的xls文件,这是关于DNA损伤修复所有的资料,非常之齐全。
dir("TCGA_DDR_Data_Resources/")
## [1] "DDRscores.tsv" "GeneAlterations.tsv" "GeneDeletions.tsv"
## [4] "GeneMutations.tsv" "Genes.tsv" "GeneSilencing.tsv"
## [7] "PathwayAlterations.tsv" "PathwayDeletions.tsv" "PathwayMembership.tsv"
## [10] "PathwayMutations.tsv" "Pathways.tsv" "PathwaySilencing.tsv"
## [13] "Samples.tsv" "Scores.tsv"
对这些文件的描述在pdf文件里,可以在上面的网址里下载到。
DDRscores.tsv
Data table listing the scores of the 43 DDR footprints and the RRPA-based DDR score across all samples (n=9,125).
The order of the genes, pathways, samples and footprint scores in these TSV files are as given in Genes.tsv, Pathways.tsv, Samples.tsv and Scores.tsv
1.获得DDRscores表格
sc = read.delim("TCGA_DDR_Data_Resources/DDRscores.tsv",header = F)
s = read.delim("TCGA_DDR_Data_Resources/Samples.tsv",header = F)
cl = read.delim("TCGA_DDR_Data_Resources/Scores.tsv",header = F)
rownames(sc) = s$V1
colnames(sc) = cl$V1
colnames(sc)
## [1] "mutLoad_silent" "mutLoad_nonsilent" "mutSig1"
## [4] "mutSig2" "mutSig3" "mutSig4"
## [7] "mutSig5" "mutSig6" "mutSig7"
## [10] "mutSig8" "mutSig9" "mutSig10"
## [13] "mutSig11" "mutSig12" "mutSig13"
## [16] "mutSig14" "mutSig15" "mutSig16"
## [19] "mutSig17" "mutSig18" "mutSig19"
## [22] "mutSig20" "mutSig21" "CNA_n_segs"
## [25] "CNA_frac_altered " "CNA_n_focal_amp_del" "aneuploidy_score"
## [28] "aneuploidy_score_prime" "LOH_n_seg" "LOH_frac_altered "
## [31] "purity" "ploidy" "genome_doublings"
## [34] "subclonal_frac" "HRD_TAI" "HRD_LST"
## [37] "HRD_LOH" "HRD_Score" "eCARD"
## [40] "PARPi7" "PARPi7_bin" "RPS"
## [43] "tp53_score" "rppa_ddr_score"
# HDR分数
scores = sc[,35:38]
head(scores)
## HRD_TAI HRD_LST HRD_LOH HRD_Score
## TCGA-OR-A5J1-01 3 2 2 7
## TCGA-OR-A5J2-01 4 2 3 9
## TCGA-OR-A5J3-01 0 0 0 0
## TCGA-OR-A5J5-01 2 2 4 8
## TCGA-OR-A5J6-01 3 1 1 5
## TCGA-OR-A5J7-01 10 8 3 21
nrow(scores)
## [1] 9125
2. 加上癌症类型画个图看看
head(s)
## V1 V2
## 1 TCGA-OR-A5J1-01 ACC
## 2 TCGA-OR-A5J2-01 ACC
## 3 TCGA-OR-A5J3-01 ACC
## 4 TCGA-OR-A5J5-01 ACC
## 5 TCGA-OR-A5J6-01 ACC
## 6 TCGA-OR-A5J7-01 ACC
identical(s$V1,rownames(scores))
## [1] TRUE
scores = cbind(s,scores)
colnames(scores)[1:2] = c("Id","Project")
library(tidyverse)
dat = drop_na(scores,HRD_Score)
su = group_by(dat,Project) %>%
summarise(a = median(HRD_Score)) %>%
arrange(desc(a))
dat$Project = factor(dat$Project,levels = su$Project)
library(ggplot2)
library(RColorBrewer)
mypalette <- colorRampPalette(brewer.pal(8,"Set1"))
ggplot(dat,aes(x = Project,y = HRD_Score,fill = Project))+
geom_boxplot()+
theme_bw()+
theme(axis.text.x = element_text(vjust = 1,hjust = 1,angle = 45),legend.position = "bottom")+
scale_fill_manual(values = mypalette(33))+
guides (fill=guide_legend (nrow=3, byrow=TRUE))
# ggplot legend number of rows
这个数据本身就是全部癌症样本啦。可以看到HRD分数最高的是卵巢癌。
3. 对表格列名的注释
xs = rio::import_list("TCGA_DDR_Data_Resources.xlsx")
names(xs)
## [1] "DDR genes and pathways" "DDR gene alterations ONCOPRINT"
## [3] "DDR gene mutations" "DDR deep deletions"
## [5] "DDR epigenetic silencing" "DDR gene alterations"
## [7] "DDR footprint summary" "DDR footprints"
## [9] "DDR pathway alterations ONCOPR " "DDR pathway mutations "
## [11] "DDR pathway deletions" "DDR pathway silencing"
## [13] "DDR pathway alterations" "ME CO analysis core pathways"
## [15] "ME CO analysis inclusive pathw " "DDR Survival Univariate"
## [17] "DDR Survival Multivariate" "DDR gene fusions"
## [19] "TP53 predictor"
xs[[7]][14:17,1]
## [1] "TAI" "LST" "HRD LOH" "HRD Score"
xs[[7]][14:17,2]
## [1] "number of subchromosomal regions with allelic imbalance extending to the telomere"
## [2] "number of chromosomal breaks between adjacent regions of at least 10Mb"
## [3] "the number of LOH regions of intermediate size (> 15MB but < whole chromosome in length)"
## [4] "Homologous recombination deficiency score calculated from three scores (TAI + LST + HRD LOH)"