构建SNPs系统发育树

2019-04-03 本文已影响45人小光amateur

如果根据SNP或者Indel构建其系统进化树，可以展示群体中不同个体的相互关系，基因变异相似的往往会在同一个树的cluster中，一颗好的树可以给你一个群体大概的分类(你这个群体中有多少个cluster，一般同一个亚种或者有亲缘关系的个体会形成一个cluster），这是群体遗传中重要的一部分。其构建的核心原理就是把每个位点SNPs的信息提取，然后计算每个变异位点的差异得到算法中的“距离”。

在实战中，我们群体中样本的数量往往会是成百的，所以一般call出来的SNP变异的位点，或者说文件大小会很大，如果我们直接将没有过滤掉的文件拿来直接构建系统发育树，这不但会产生很大的误差（低质量的位点会影响距离的计算），而且一般的构树软件也不难接受如此大的文件，一来很消耗内存，二来运算量很大，你可能需要几个星期或几个月去完成你的建树。
这个时候你需要首先对你raw SNP calling的结果进行初步的过滤。

首先采用经典的过滤方式，保留可信的变异位点

plink --vcf All_Gm_combine.vcf --maf 0.05 --geno 0.1 --recode  vcf-iid --out All_test_11 --allow-extra-chr

然后，进行中性LD筛选

plink --vcf All_test_11.vcf  --allow-extra-chr--indep-pairwise 50 10 0.2 --out test_12

plink --allow-extra-chr --extract test_12.prune.in --make-bed --out test_12.prune.in --recode vcf-iid --vcf All_edit_Gm_tab.vcf

将vcf 格式转成phylip格式

vcf2phylip.py

python vcf2phylip.py -i All_edit_Gm_tab.vcf

该程序生成的结果并非是phylip程序需要的最佳结果。

因为phylip需要的phy格式默认名字最长10个字符，如果不够10个字符，需要用空格补齐，如果多于10个字符，就会自动将名字截断甚至于报错；

另外，序列一般需要按照每10个字符用空格分开，以便观察。因此，该脚本生成的文件还需要进一步处理，才能进行接下来的分析。

#cat formph.py
import re,argparse

parser = argparse.ArgumentParser(description='This script was used to download genbank file')
parser.add_argument('-i','--input_name',help='Please input  out_put directory path;default cwd',required=True)
args = parser.parse_args()
with open(args.input_name) as f:
    line_lst=f.readlines()
    print(line_lst[0].strip("\n"))
    for line in line_lst[1:]:
        name,seq=[k for k in line.strip("\n").split(" ") if k!=""]
        name="_".join(name.split("_")[1:])
        seq=seq.strip("\n")
        if len(name)<10:
            seq_lst=re.findall('.{10}',seq)
            seq_lst.append(seq[-(len(seq)%10):])
            print("{}{}".format(name," "*(10-len(name)))," ".join(seq_lst))
        else:
            print("your name is too long!")

通过上面的脚本，将phy文件进行格式化处理，然后进行下一步。

使用phylip软件包

phylip中有许多程序，大部分的程序运行方法相同，把infile作为默认的输入文件，输出结果写在outfile中。因此，在进行下一步分析前，需要重命名想要保存的文件。

seqboot: 生成随机样本，用bootstrap和jack-knife方法。需要设置选项M

dnadist：DNA距离矩阵计算器。

neighbor：NJ法的使用

consense：用多重树构建一致树。

每个程序都需要设定参数，因此还需要新建par文件。

#cat seqboot.par
all.merge.snp.phy #设定输入文件的名称，否则输入默认的名为infile的文件
y #yes确认以上设定的参数
9 #设定随机参数，输入奇数值。

#cat dnadist.par
seqboot.out #本程序的输入文件
2 #将软件运行情况显示出来
y #确认以上设定的参数

#cat neighbor.par
dnadist.out #本程序的输入文件
9 #设定随机数，输入奇数值
y #确认以上设定的参数

# cat consense.par
nei.tree  #本程序的输入文件
y #确认以上设定的参数

再运行以下命令行即可

seqboot<seqboot.par &&mv outfile seqboot.out &&dnadist<dnadist.par &&  mv outfile dnadist.out && neighbor<neighbor.par && mv  outfile nei.out && mv outtree nei.tree  && consense<consense.par && mv outfile cons.out && mv outtree constree

一键化流程

# cat task.sh
#! /usr/bin/bash

name="-------"#这里修改为自己的样本名
path="-------"#修改为软件的路径
python2 $path/vcf2phylip/vcf2phylip.py -i $name.vcf
python3 formphy.py -i $name.min4.phy > $name.phy
$path/phylip-3.697/exe/seqboot<seqboot.par && mv outfile seqboot.out && $path/phylip-3.697/exe/dnadist<dnadist.par &&  mv outfile dnadist.out && $path/phylip-3.697/exe/neighbor<neighbor.par && mv outfile nei.out && mv outtree nei.tree  && $path/phylip-3.697/exe/consense<consense.par && mv outfile cons.out && mv outtree constree