ASMC计算位点溯祖时间

2022-04-10 本文已影响0人 DumplingLucky

ASMC(ascertained sequentiallyMarkovian coalescent)是2018年Alkes L. Price实验室开发的算法，相关文章发表在2018年的Nature Genetics上。

该方法利用 ASMC，分析SNP芯片和 WGS数据集，目的是快速计算基因组位点的溯祖时间，从而可以检测正选择和背景选择的特征。

Enrichment for recent coalescence events at the LCT locus

1. Installation

首先需要配置一下 C++ 库，基本上conda都可以装。

C++ compiler (C++17 or later)
CMake (3.15 or later)
Boost (1.62 or later)
Eigen (3.3.4 or later)
{fmt}
range-v3
OpenMP
zlib

2. 编译ASMC、FastSMC和二进制转换可执行文件

# Get the source
git clone --recurse-submodules https://github.com/PalamaraLab/ASMC
cd ASMC

# Create a build directory
mkdir build && cd build

# Configure and build
# On first run, CMake will build the required dependencies
cmake -DASMC_NO_PYTHON=TRUE ..
cmake --build . --parallel 4
#Tool to compute decoding quantities.
pip install asmc-preparedecoding

3. 输入输出文件格式

（1）输入文件

Genetic map (.map/map.gz)
可以使用plink来将vcf转换成map格式

map.format

plink --vcf input.vcf --recode --out output --double-id  --autosome-num 42

Demographic history (.demo)
该文件包含了群体过去的有效规模的分段大小。包含两列：

TimeStart   PopulationSize

其中 TimeStart是第一代人口的大小为 PopulationSize。第一行代表第 0 代，可以使用PSMC/MSMC/SMC++获得该文件数据。
Time discretization (.disc)
在 ASMC 的输入中提供的时间间隔列表。每行包含一个数字，表示以连续代测量的时间，并从 0.0 代开始。

0.0
30.0
60.0
90.0
120.0
150.0
180.0
... <lines omitted>
79855.6
96263.0
124311.7

此文件中定义的间隔为：{0.0-30.0, 30.0-60.0, ..., 96263.0-124311.7, 124311.7-Infinity}。

（2）输出文件

Decodingquantities(.decodingQuantities.gz)
这个文件是ASMCprepareDecoding的输出文件，是ASMC的输入文件，它用于执行溯祖时间的有效推断，不需要了解该文件的内容。
Time discretization intervals (.intervalsInfo)
*.intervalsInfo文件是 ASMCprepareDecoding的输出文件，作为ASMC的输入文件。它包含分析中使用的离散时间间隔数相对应的行数。以下是文件格式：

IntervalStart   ExpectedCoalescenceTime IntervalEnd

IntervalStart 和 IntervalEnd 表示每个离散时间间隔的开始/结束，ExpectedCoalescenceTime 是已推断在此时间间隔内溯祖的一对个体的预期溯祖时间，并且取决于人口统计模型。
成对后验合并概率之和.sumOverPairs.gz
ASMC的输出写入 *.{00,01,11}.sumOverPairs.gz文件。每个文件都包含一个大小为 SxD 的矩阵，其中 S 是数据中的位点数，D 是分析中使用的离散时间间隔数。每个矩阵的第 [i,j] 个条目包含 SNP i 和离散合并时间 j 处所有样本的后验溯祖概率之和。输出使用 {00,01,11} 后缀分解每个位点携带不同等位基因的样本的合并事件。具体来说：

*.00.sumOverPairs.gz 中矩阵的第i行包含在位点i处为纯合0的所有样本对的后验概率之和。
*.01.sumOverPairs.gz 中矩阵的第 i 行包含在位点 i 杂合的所有样本对的后验概率之和。
*.11.sumOverPairs.gz 中矩阵的第 i 行包含在位点 i 处为纯合 1 的所有样本对的后验概率之和。

4. 运行

（1）Creating decoding quantities

首先需要使用已知的群体历史（*.demo文件，离散Ne值），设定的离散时间间隔（.disc文件)，以及单群体的等位基因频率文件（ *frq文件，可以用vcftools计算得到），计算解码文件。
在python里面计算：

import pathlib

from asmc.preparedecoding import *
files_dir = (pathlib.Path('..') / 'test' / 'regression').resolve()

demo_file = str(files_dir / 'input_CEU.demo')
disc_file = str(files_dir / 'input_30-100-2000.disc')
freq_file = str(files_dir / 'input_UKBB.frq')

# Calculate with a discretization file
dq = prepare_decoding(
    demography=demo_file,
    discretization=disc_file,
    frequencies=freq_file,
    samples=50, # Use a larger number (300 is suggested) for real analysis
)

# Or calculate with pre-defined quantiles, and a user-defined number of additional quantiles
dq = prepare_decoding(
    demography=demo_file,
    discretization=[[30.0, 15], [100.0, 15], 39],
    frequencies=freq_file,
    samples=50, # Use a larger number (300 is suggested) for real analysis
)

# Or use with built-in demographies
dq = prepare_decoding(
    demography='CEU',
    discretization=[[30.0, 15], [100.0, 15], 39],
    frequencies=freq_file,
    samples=50, # Use a larger number (300 is suggested) for real analysis
)

# Or use with built-in frequency information from UKBB
dq = prepare_decoding(
    demography='CEU',
    discretization=[[30.0, 15], [100.0, 15], 39],
    frequencies='UKBB',
    samples=50, # Use a larger number (300 is suggested) for real analysis
)
#写出
# This will create a file `files_dir/output.decodingQuantities.gz`
dq.save_decoding_quantities(str(files_dir / 'output'))

# This will create a file `files_dir/output.csfs`
dq.save_csfs(str(files_dir / 'output'))

# You may also save other files, which may be of use:
dq.save_intervals(str(files_dir / 'output'))
dq.save_discretization(str(files_dir / 'output'))
save_demography(str(files_dir), 'CEU')

（2）运行ASMC

./build/ASMC_exe \
        --decodingQuantFile ASMC_data/decoding_quantities/30-100-2000_CEU.decodingQuantities.gz \
        --inFileRoot ASMC_data/examples/asmc/exampleFile.n300 \
        --majorMinorPosteriorSums \
        --mode sequence

参数解读：

必须参数：
--inFileRoot arg haps|haps|hap.gz|haps.gz 和 sample|samples 文件的前缀
--decodingQuantFile arg 解码文件

以下选其一：
--posteriorSums 后验分布总和的输出文件。
--majorMinorPosteriorSums 后验分布总和的输出文件，分为主要/次要等位基因。

可选参数：
--outFileRoot arg 后验分布总和的输出文件前缀（默认值：--hapsFileRoot 参数）
--jobs int (=0) 并行完成的任务数
--jobInd int (=0) 任务索引 (1..jobs)
--mode string (=array) 解码模式。可选 {sequence, array} 。
--compress (=false) 压缩为二进制（无 CSFS）
--useAncestral (=false) 假设祖先等位基因编码为 1 （否则将假设 1 = 次要等位）
--skipCSFSdistance float (=0.0) 两个 CSFS 排放之间的遗传距离（以 cM 为单位）

5. 结果可视化

gunzip -c Alineage.1-1.00.sumOverPairs.gz  | \
    awk '{for (i=5; i<=NF; i++) {c[i]+=$i; t+=$i;}} END{for (i=5; i<=NF; i++) print c[i]/t}' \
    > norm.txt
factor=1.5
cat /ASMC/input/gene_regions.txt | \
    tr '-' '\t' | \
    awk -v factor=$factor '{
        l=$3-$2;
        chr=$1;
        from=$2-factor*l;
        to=$3+factor*l;
        file="region"NR".chr"chr"."from"-"to;
        print "python3 plotPosteriorHeatMap.py -i /ASMC/decoding_quantities/Alineage.intervalsInfo -t 1 18 -n norm.txt -o " file " -f "$1".txt.gz" " -p " ($2-2*l) " " ($3+2*l)
    }' | \
    while read l; do 
        $l;
    done

其中，gene_regions.txt文件表示想要展示的基因组区域，第一列是染色体ID。如下格式：

1   1.42207-11.4235
1   33.7288-43.7335
1   37.104-47.1191

plotPosteriorHeatMap.py是ASMC自带的脚本。

6. Density of recent coalescence (DRC) statistic

如何通过计算ASMC来确定最近受选择区域呢？作者没有给出具体的代码，只给了分析思路。
可以通过将 sumOverPairs 矩阵中与所需时间间隔相对应的条目相加，并在一个窗口内对 SNP 进行平均来获得DRC 统计数据。
关于分析 DRC 统计的几点说明：使用 Gamma 分布作为 DRC 统计的零模型是一种近似值，需要进一步的工作来推导一个精确的零模型。DRC 不能大于 1，因此在 Gamma 模型下为非常大的 DRC 值获得的近似 p 值将非常小（假阳性）。
作者发现 Gamma 近似在某些条件下是合理的，例如最近的有效群体很大，这使得 DRC 在中性地区变小，但这种近似在所有人群（例如孤立人群）或时间间隔比较大时都不能很好地工作。
如果您决定使用这种方法来检测候选区域以进行选择，作者建议检查 Gamma 近似值是否合理。由于特别大的 DRC 值可能会导致人为地减小 p 值。或者可以因此仅使用 Gamma 近似来确定近似的显着性阈值，然后输出大于此阈值的原始 DRC 值，而不是输出近似的 p 值。
最后，作者建议检查以这种方式检测到的候选区域是否与基因组的问题区域不重叠，例如重组率非常高/低或 LD 的区域，以及可能影响的大结构变异，例如倒位重组。

参考：

P. Palamara, J. Terhorst, Y. Song, A. Price. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nature Genetics, 2018.
J. Nait Saada, G. Kalantzis, D. Shyr, F. Cooper, M. Robinson, A. Gusev, P. F. Palamara. Identity-by-descent detection across 487,409 British samples reveals fine-scale evolutionary history and trait associations. Nature Communications, 2020.
Spence, J.P. and Song, Y.S. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Science Advances, Vol. 5, No. 10, eaaw9206 (2019)
ASMC github