用R语言对vcf文件进行数据挖掘.10 利用vcf信息判断物种染
目录
vcf数据包含了所有的等位对立基因的信息,这样就可以帮助我们判断染色体的倍数。比方说有一个位点的碱基是A/T,测序覆盖率为20, 如果这个物种是二倍体,那么A,T的出现概率就是(50%),会各自出现10次,如果是3倍体,那么A会出现13次,T会出现7次,当然也有可能相反。当把所有的点位集合在一起的时候,我们就可以判断这个物种的倍数体了。
数据导入
用包里的自带数据,有疑问的小盆友可以查阅之前的文章,这里就不做赘述了。
# Load libraries
library(vcfR)
library(pinfsc50)
# Determine file locations
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz",
package = "pinfsc50")
# Read data into memory
vcf <- read.vcfR(vcf_file, verbose = FALSE)
vcf
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 22,031 variants
## Object size: 20.9 Mb
## 7.929 percent missing data
## ***** ***** *****
等位对立基因平衡
高通量数据测序可以保证每一个位点都经过很多次的读取,这样就相当于每一个等位基因都被测序过了差不多相等的次数。假设我们对一个二倍杂合体进行了覆盖率为30的测序,那么每一条染色体都被测了15次。当然真实情况不可能正好是这个数字,毕竟测序的时候会发生一定概率的错误。
假设我们用覆盖率为30给一个三倍杂合体进行测序,某基因位点为A/A/T,那么,A和T出现的期待值将是20和10。当某个基因位点的组合是A/G/C时,那么A,G,C就会各自出现10次。
knitr::kable(vcf@gt[c(1:2,11,30),1:4])
FORMAT BL2009P4_us23 DDR7602 IN2009T1_us22
GT:AD:DP:GQ:PL 1|1:0,7:7:21:283,21,0 1|1:0,6:6:18:243,18,0 1|1:0,8:8:24:324,24,0
GT:AD:DP:GQ:PL 0|0:12,0:12:36:0,36,427 0|0:20,0:20:60:0,60,819 0|0:16,0:16:48:0,48,650
GT:AD:DP:GQ:PL 0|1:17,12:29:99:453,0,667 0|0:32,0:32:96:0,96,1433 0|0:40,0:40:99:0,120,1765
GT:AD:DP:GQ:PL 1|0:7,4:11:99:130,0,260 0|0:16,0:16:48:0,48,677 0|0:26,0:26:78:0,78,1073
FORAMT里的AD表示对立基因的各自出现的次数。所以我们可以提取AD数据。
ad <- extract.gt(vcf, element = 'AD')
一般的SNP Caller都会默认双倍体检验,也就是出现两种对立基因型。所以可以计算每种基因的出现概率。
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)
然后用直方图可视化一下。
hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","1/3","3/4",1))
可以发现,大多数都是纯合,所以需要去掉纯合的部分。
gt <- extract.gt(vcf, element = 'GT')
hets <- is_het(gt)
is.na( ad[ !hets ] ) <- TRUE
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)
hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","1/3","3/4",1))
我们发现峰值出现在了1/2,说明这个物种时二倍体,和预期的一样。
然而这里有一个小小的问题,Fequency几乎从0到1横跨整个横坐标,这个明显不合理,需要进行改善。
根据等位对立测序深度进行改善
我们可以通过等位对立深度(AD)的信息来改善刚才提到的问题。
ad <- extract.gt(vcf, element = 'AD')
#ad[1:3,1:4]
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
# Subset to a vector for manipulation.
tmp <- allele1[,"P17777us22"]
#sum(tmp == 0, na.rm = TRUE)
#tmp <- tmp[tmp > 0]
tmp <- tmp[tmp <= 100]
hist(tmp, breaks=seq(0,100,by=1), col="#808080", main = "P17777us22")
sums <- apply(allele1, MARGIN=2, quantile, probs=c(0.15, 0.95), na.rm=TRUE)
sums[,"P17777us22"]
## 15% 95%
## 19 75
abline(v=sums[,"P17777us22"], col=2, lwd=2)
我们可以看到80%的数据分布在了19和75之间。然后再靠近40和60点的地方出现了两个峰,这分别代表杂合峰和纯合峰。然后整个数据还拖着一个尾巴,最长的地方超过了100,这表示部分区域包含了着非常高的拷贝数(CNVs)。此处的目的是为了可视化倍数体,所以选择100以下15%~95%的数据。
tmp <- allele2[,"P17777us22"]
tmp <- tmp[tmp>0]
tmp <- tmp[tmp<=100]
hist(tmp, breaks=seq(1,100,by=1), col="#808080", main="P17777us22")
sums[,"P17777us22"]
## 15% 95%
## 19 75
abline(v=sums[,"P17777us22"], col=2, lwd=2)
回想一下之前文章里介绍过的用箱图做可视化的内容,我们也可以通过同样的方法来确认过滤数据的效果。
#vcf <- extract.indels(vcf)
#gq <- extract.gt(vcf, element = 'GQ', as.numeric = TRUE)
#vcf@gt[,-1][ gq < 99 ] <- NA
ad <- extract.gt(vcf, element = 'AD')
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
boxplot(allele1, las=3)
#hist(allele1[,"P17777us22"], ylim=c(0,2000),
# breaks = seq(0,max(allele1[,"P17777us22"], na.rm = TRUE),by=5),
# xlim=c(0,100))
# Subset to a vector for manipulation.
#tmp <- allele1[,"P17777us22"]
#tmp <- tmp[tmp <= 100]
#hist(tmp, breaks=seq(0,100,by=1), col="#808080")
sums <- apply(allele1, MARGIN=2, quantile, probs=c(0.15, 0.95), na.rm=TRUE)
# Allele 1
dp2 <- sweep(allele1, MARGIN=2, FUN = "-", sums[1,])
#allele1[dp2 < 0] <- NA
vcf@gt[,-1][ dp2 < 0 & !is.na(vcf@gt[,-1]) ] <- NA
dp2 <- sweep(allele1, MARGIN=2, FUN = "-", sums[2,])
#allele1[dp2 > 0] <- NA
vcf@gt[,-1][dp2 > 0] <- NA
# Allele 2
dp2 <- sweep(allele2, MARGIN=2, FUN = "-", sums[1,])
vcf@gt[,-1][ dp2 < 0 & !is.na(vcf@gt[,-1]) ] <- NA
dp2 <- sweep(allele2, MARGIN=2, FUN = "-", sums[2,])
vcf@gt[,-1][dp2 > 0] <- NA
# Hard filter
#dp[dp < 4] <- NA
#vcf@gt[,-1][allele1 < 8] <- NA
#
#adp <- adp[adp<=100]
#adp <- adp[adp>=sums[,"P17777us22"][1]]
#adp <- adp[adp<=sums[,"P17777us22"][2]]
#hist(adp, breaks=seq(0, max(adp, na.rm = TRUE )+1, by=1), col="#C0C0C0", xlim = c(0,100))
#axis(side=1,at=1:4*10)
#abline(v=sums[,"P17777us22"], col=2, lwd = 2)
#adp <- allele2[,"P17777us22"]
#adp <- adp[adp>0]
#adp <- adp[adp<=100]
#hist(adp, breaks=seq(0, max(adp, na.rm = TRUE), by=1), col="#C0C0C0")
#par(mfrow=c(1,1))
#abline(v=sums[,"P17777us22"], col=2, lwd = 2)
看一下过滤后的结果。
ad <- extract.gt(vcf, element = 'AD')
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
boxplot(allele1, las=3)
果然好看很多。
最后再回到一开始,看倍数体的可视化效果。
gt <- extract.gt(vcf, element = 'GT')
hets <- is_het(gt)
is.na( ad[ !hets ] ) <- TRUE
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)
hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n", main="P17777us22")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","2/3","3/4",1))
结果明显干净易懂好多。
有同学会问,那么不是二倍体的话会出现什么样的结果呢。数据包的样本里正好有一个三倍体。
可以看到两个峰出现在了1/3,2/3处。结果和实际完美匹配。