D 统计量检测遗传渗入

2021-11-16 本文已影响0人 DumplingLucky

ABBA-BABA test

ABBA-BABA 统计（也称为 D 统计）为严格二叉进化历史的偏差提供了一种简单而强大的测试。因此，它们经常用于使用基因组规模的 SNP 数据测试基因渗入。

D 统计的示意图，它需要来自三个群体（1、2 和 3）和一个外群 (O) 的数据。外群用于推断遗传变异的祖先状态 (A) 。遗传变异的衍生状态 (B) 可能在种群 1、2 和 3 之间分离，从而导致图中的三种分布。
在三种分布中，分布 1 (BBAA) 预计最常见，因为种群 1 和 2 是共享最近共同祖先的姐妹种群。在没有基因渗入的情况下，由于存在不完全谱系分选，分布 2 (ABBA) 和分布 3 (BABA) 占比应该相似。种群 1 和 3 之间或种群 1 和 2 之间的基因流动将使分布 2 (ABBA) 与分布 3 (BABA) 的频率比例偏离1，从而提供基因渗入的证据。

D statistic.png
分析案例

Nature. 2012 Jul 5; 487(7405): 94–98.
本节我们使Simon martin开发的脚本来计算D统计量。

1. 软件下载：

https://github.com/simonhmartin/genomics_general

下载完之后解压即可。

2. 准备VCF文件：

#CHROM      POS      ind1      ind2      ind3
scaffold1  1        A/A       A/G       G|A
scaffold1  1        N/N       T/T       T|C

缺失数据用 N 表示，定相和非定相基因型通常用 | 和 / 表示。
VCF_processing 目录中的脚本 parseVCF.py 会将 vcf 转换为这种格式。另外可以基于读取深度、基因型质量或 vcf 的 FORMAT 列中的任何其他标志进行过滤。

3. ABBA-BABA 统计

脚本 ABBABABAwindows.py 可以在整个基因组的窗口中计算 D 统计量和估计量。

python ABBABABAwindows.py -g /zoo/disk1/shm45/vcf/set62/set62.chr21.DP5GQ30.AN100MAC1.diplo.gz -f phased -o output.csv -w 100000 -m 100 -s 100000 -P1 A -P2 B -P3 C -O D -T 10 --minData 0.5 --popsFile pops.txt --writeFailedWindows --polarize &

ABBABABAwindows.py 程序基本参数

-g 基因型geno文件。
-f 基因型文件的类型phased,pairs,haplo,diplo
-P1, P2, P3, P4种群的名称
--popFile 样本-种群的说明文件
-w window大小
-s 步移的大小
-m 窗口最少位点数
-o 输出文件

Notes:

如果 D 为负，fd 给出无意义的值（<0 或 >1）。如果 P2 和 P3 之间没有过量的共享衍生等位基因（由正 D 表示），则无法量化。因此，负 D 窗口的 fd 值应该被丢弃或转换为零。
如果每个窗口使用少量 SNP，即使 D 为正，随机误差也会导致 fd 具有无意义的值。因此，每个窗口至少要有 100 个双等位基因 SNP 。

Output

Column Header	Description
`scaffold`	The scaffold the window is on (all windows are on a single scaffold)
`start`	window start position (inclusive)
`end`	window end position (NOTE, this can exceed the length of the scaffold)
`sites`	Number of genotypes sites in the input file in each window
`sitesUsed`	number of sites used to compute statistics (biallelic SNPs)
`ABBA`	Pseudo count of ABBA sites (including polymorphic sites)
`BABA`	Pseudo count of BABA sites (including polymorphic sites)
`D`	D statistic
`fd`	f_d admixture estimation
`fdM`	Malinsky's modified statistic, f_dM to accomodate admixture between either P1 and P3 or P2 and P3

输出结果包括计算的D值、fd值、fdM值

结果.png

绘图显示D值、fd值

AB_files <- c("hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w25m250.csv.gz",
              "hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ama_txn_num.w25m250.csv.gz")
AB_tables = lapply(AB_files, read.csv)

head(AB_tables[[1]])
head(AB_tables[[2]])
#注意，如果D值为负，则fd值是无意义的。故设置所有fd值为0，当D值为负数时。
for (x in 1:length(AB_tables)){
  AB_tables[[x]]$fd = ifelse(AB_tables[[x]]$D < 0, 0, AB_tables[[x]]$fd)
}
par(mfrow=c(length(AB_tables), 1), mar = c(4,4,1,1))
for (x in 1:length(AB_tables)){
  plot(AB_tables[[x]]$mid, AB_tables[[x]]$fd,
       type = "l", xlim=c(0,17e6),ylim=c(0,1),ylab="Admixture Proportion",xlab=str_c("Position",x))
  rect(1000000,0,1250000,1, col = rgb(0.5,0,0,0.2), border=NA)
}

作者还提供了计算其他群体遗传学统计量的工具。

4. 通过滑动窗口分析群体多样性和分化

脚本 popgenWindows.py 可以根据滑动窗口，计算一些标准的群体基因组统计量：pi、FST 和 Dxy。

python popgenWindows.py -w 50000 -m 5000 -g input.geno.gz -o output.csv.gz -f phased -T 5 -p popA A1,A2,A3,A4 -p popB B1,B2,B3,B4,B6,B6 -p popC -p popD --popsFile pops.txt 
python popgenWindows.py -h 
#将打印完整的参数列表。

Notes

输入是一个 .geno 文件，可以被压缩 (.geno.gz)。输出是 .csv，如果添加 .gz 将被压缩。
基因型可以通过各种方式进行编码，如 -f 标志。

-f phased: 意味着基因型包括了相位信息（不一定需要实际定相）：

scaffold1 1 A/A G/G G|A

-f pairs 表示没有相位指示器：

scaffold1 1 AA GG AG

-f haplo 表示基因型是单倍体碱基：

scaffold1 1 A G G

-f diplo 表示基因型是表示二倍体基因型的单碱基：

scaffold1 1 A G R

--popsFile

C1  popC
C2  popC
C3  popC
C4  popC
D1  popD
D2  popD
D3  popD
D4  popD

5. 距离矩阵

脚本 distMat.py 计算所有个体对之间的距离矩阵，可以针对整个输入文件或在 windows 中计算。对于倍性 > 1，成对距离将是两个个体所有单倍型之间的平均距离。

python distMat.py -g input.geno.gz -f phased --windType cat -o output.dist
python distMat.py -h 
#打印命令参数列表.

6. 滑动窗口画树

phylo/ 目录中的两个脚本将在滑动窗口中生成树：phymlWindows.py 和 raxmlWindows.py。分别使用了 Phyml 和 RAxML。

python phyml_sliding_windows.py -T 10 -g input.phased.geno.gz --prefix output.phyml_bionj.w50 -w 50 --windType sites --model GTR

参考：
https://github.com/simonhmartin/tutorials/blob/master/ABBA_BABA_windows/
基因渐渗分析-introgression