系统进化分析全记录——以HDZIP3家族为例(附自动化phylo
2019年第一篇。忙碌的小半年。整理完课题,考完雅思英语终于又可以有时间去学习感兴趣的东西了。学习/记录/分享,这是写简书博客的主题词,2019年希望能每周至少一篇学习笔记分享啊。flag已立。第一篇就以之前做基因家族分析完整记录的分享作为开始。
一、phylogeny总体分析流程
系统进化分析已经是做生物学、生物信息的一项基础分析方法了。确定同源基因、进化历史、基因家族分析、分子进化等都需要这一技能。总体来说,系统进化分析包括以下几个步骤:多序列的获取(blast/hmmsearch获取同源序列)、多序列比对、保守信息位点的筛选、核酸替换模型的确定以及构建系统进化树。
具体的,对于构建某一基因家族的进化树(以HD-ZIP3家族)来说分析流程如下:
基因家族分析流程
二、软件的准备
- 具体软件包括有:
- 比对软件:blast、hmmer
- 多序列比对:mafft、pal2nal.pl
- 保守信息位点筛选:Gblocks
- 系统进化树(ML树)的构建:iqtree
- 万能fastq/fasta小工具:seqkit
- 软件安装可以利用conda进行自动化安装。安装教程可参考转录组学习一(软件安装)
三、基因家族的序列获取
找序列属于较花时间步骤,若研究的基因所属家族有人发过基因家族的纹章,则可以根据文献下载相关序列,若没有人研究,则需要对选取物种的基因组/CDS/PEP文件做blast/hmmsearch搜索以及后续繁琐的筛选。
- 这里我们选取的HD-ZIP III属于转录因子,则可以根据植物转录因子数据库(http://planttfdb.cbi.pku.edu.cn/)下载目标物种中的HDZIP家族的CDS及PEP序列。
### 根据目标,下载拟南芥物种中的pep序列
wget http://planttfdb.cbi.pku.edu.cn/download_seq.php?sp=Ath\&fam=HD-ZIP
##下载拟南芥CDS序列
wget http://planttfdb.cbi.pku.edu.cn/download/seq/Ath_cds.fas.gz
-
根据文献,HD-ZIP III家族包括以下蛋白结构域。其中,MEKHLA属于HD-ZIP3的识别结构域,根据数据库Pfam数据库下载此结构域的hmm模型文件
HDZIP3蛋白结构域信息
wget http://pfam.xfam.org/family/PF08670/hmm
mv hmm MEKHLA
- 筛选得到拟南芥中的HDZIP3家族序列
##根据MEKHLA进行hmmsearch筛选
hmmsearch --domtblout ath.out --cut_nc MEKHLA download_seq.php\?sp\=Ath\&fam\=HD-ZIP
## 对结果整理,并获得拟南芥中HDZIP3的cds序列
grep -v "#" ath.out |cut -d '|' -f 1 |seqkit grep -r -f - Ath_cds.fas.gz
> hdzip3_ath_cds.fas
- 其它物种同拟南芥的过程,最后再对序列的id名称进行修改。一个基因的多个转录本可以去掉。
四、多序列比对
一般情况多序列比对直接根据mafft比对即可。对于基因家族来说,更准确的比对需先对基因家族的PEP序列比对得到氨基酸比对矩阵,再根据pal2nal转为核苷酸比对矩阵。
## 氨基酸比对
mafft --auto hdzip3_pep.fas > hdzip3_pep_aln.fas
## 根据提供的cds序列进行pal2nal转换为氨基酸矩阵。
pal2nal.pl hdzip3_pep_aln.fas hdzip3_cds.fas -output fasta > hdzip3_cds_aln.fas
这里pal2nal会经常报错,大部分情况都是cds跟pep序列的名字不一致,或者cds序列跟pep序列不能完全对应
五、保守信息位点的筛选
得到核苷酸比对矩阵之后就要对其中不好的位点进行过滤删除。得到建树有用的信息位点。这里我们所用的程序是Gblocks。
## 筛选信息位点,这里根据个人经验-b5参数选择保留含有gap的全部信息位点
Gblocks hdzip3_cds_aln.fas -t=c -b4=5 -b5=a >Gblocks.log
### 结果文件hdzip3_cds_aln.fas-gb即为筛选后的文件,可重新命名。html文件可用浏览器打开查看挑选具体的位点。
六、核酸替换模型的选择及构建最大似然树(ML树)
传统步骤里,核酸替换模型是需要利用modelfinder/modelgenerator软件算出模型,再用phylip/mega/raxml跑树。去年特别火的软件iqtree完全整合了这两个步骤
并如其名——智能化建树。
## -bb 参数为快速bootstrap检验,对于序列较少的参数则-b 参数 常规bootstrap即可
iqtree -s hdzip3_cds_aln.fas-gb -bb 10000 -pre hdzip3 -nt AUTO >iqtree.log
结果hdzip3.treefile即为我们要的树文件,后续可利用ITOL/Figtree/ggtree等一系列的可视化的软件对树文件进行美化。
七、结果
整套流程走下来,并对树文件进行了些简单的可视化。结果如下图,可以看到各个支系都分的很清楚,各物种的位置较符合真实的系统位置,拓扑结构的支持率也普遍都很高。
hdzip3 phylogeny
八、自动化构建系统发育树程序。
其实从多序列比对一直到构建系统进化树,已经是一套较为标准的流程了。之前也对这套流程做了个整合,写了个小程序。Phylogeny系统进化树的一键化构建——Perl_pipeline。之后对这个程序做了个优化,添加了可选参数以及氨基酸比对矩阵转为核酸比对矩阵 ( 流程图中的 fasta2tree .pl)。做一个分享~
#!/usr/bin/env perl
use strict;
use File::Copy;
use Getopt::Long;
my $usage = <<USAGE;
Usage:
perl $0 -in MultiSequences.fasta -out OutPrefix -type d/c/p
For example:
perl $0 -in AP3.fasta -out ap3_tree -type c
This Pipeline depends on the following software that can be run directly in terminal:
1. mafft (v7.310)
2. pal2nal (v14)
3. Gblocks (0.91b)
4. iqtree (1.55)
by Wang Tianpeng
Version 1.2
2018/11/12
USAGE
if (@ARGV==0){die $usage}
my($fasta,$out_prefix,$type);
GetOptions(
"in=s" => \$fasta,
"out=s" => \$out_prefix,
"type=s"=> \$type,
);
my $pwd0=`pwd`;
chomp $pwd0;
# Preliminary Test: Detecting the softwares dependency
## mafft
print STDERR "\nDetecting the dependency softwares\n\n";
my $software=`mafft --help 2>&1`;
if($software=~/MAFFT/){
print STDERR "MAFFT:\tOK\n";
}else{
die "Mafft\tfailed\n";
}
## pal2nal
my $software=`pal2nal.pl 2>&1`;
if($software=~/pal2nal.pl/){
print STDERR "PAL2NAL:\tOK\n";
}else{
die "pal2nal\tfailed\n";
}
## Gblocks
my $sofware1=`Gblocks -a -b >111`;
open IN,"111" or die "$!";
<IN>;my $software_info=<IN>;
if($software_info=~/^\*/){
print STDERR "Gblocks:\tOK\n";
}else{
die "Gblocks\tfailed\n";
}
close IN;system("rm 111");
## iqtree
$software=`iqtree 2>&1`;
if ($software=~/IQ-TREE/){
print STDERR "IQtree:\tOK\n";
}else{
die "iqtree\tfailed\n";
}
# step1 create temporary directory;
print "\n======================================\n";
mkdir "$out_prefix.tmp" unless -e "$out_prefix.tmp";
chdir "$out_prefix.tmp";
my $pwd1 = `pwd`;chomp($pwd1); #print STDERR "PWD:$pwd";
open IN,"$pwd0\/$fasta" or die "$!";
open OUT,">$fasta" or die "$!";
open OUT2, ">$fasta.pep" or die "$!";
my($seq,$seq_name,@seq_name,%hash_seq);
while(<IN>){
s/\r?\n//;
if(/^>/){
$seq_name=$_;
push @seq_name,$seq_name;
}else{
$hash_seq{$seq_name}.=$_;
}
}
foreach (@seq_name){
print OUT "$_\n$hash_seq{$_}\n";
print OUT2 "$_\n",&cds2pep($hash_seq{$_}),"\n";
}
close IN;close OUT;
#open OUT,">","genome.fasta" or die "cant create file genome.fasta,$!";
# step2 MultiSequences Alignment with mafft and pal2nal;
print "\n=====================================================\n";
print "STEP1 MultiSequences Alignment with mafft\n\n";
mkdir "1.Mafft" unless -e "1.Mafft";
unless (-e "1.Mafft.ok"){
chdir "1.Mafft";
my $pwd=`pwd`;print STDERR "PWD:$pwd";
my $command="mafft --auto $pwd1\/$fasta.pep >$out_prefix.aln.pep.fas 2>mafft.log";
print STDERR (localtime).": $command\n";
system($command)==0 or die "can not execute :$command\n";
my $command="pal2nal.pl $out_prefix.aln.pep.fas $pwd1\/$fasta -output fasta > $out_prefix.aln.cds.fas";
system($command)==0 or die "can not execute: $command\n";
my $pwd_mafft=`pwd`;chomp($pwd_mafft);
chdir "..";
open OUT,">1.Mafft.ok" or die "$!";close OUT;
}else{
print STDERR "CMD(skipped): mafft \n";
}
my $pwd_mafft="$pwd1\/1.Mafft";#print "$pwd_mafft\n";
# step3 informative alignment points filter with Gblocks
#my $pwd =`pwd`;print "$pwd\n";
print "\n===================================================\n";
print "STEP2 selection of informative blocks with Gblocks\n\n";
mkdir "2.Gblocks" unless -e "2.Gblocks";
unless (-e "2.Gblocks.ok"){
chdir "2.Gblocks";
my $pwd=`pwd`;print STDERR "PWD: $pwd\n";
my $command="Gblocks $pwd_mafft\/$out_prefix.aln.cds.fas -t=$type -b4=5 -b5=a >Gblocks.log";
print STDERR (localtime).": $command\n";
system($command) or die "can not execute : $command\n";
my $gb_files="$pwd_mafft"."\/$out_prefix.aln.cds.fas-gb\*"; #my $command_move="mv $test .";print "$command_move\n";
system("mv $gb_files \.")==0 or die "cant move file\n";
chdir "..";
open OUT,">2.Gblocks.ok" or die "$!";close OUT;
}else{
print STDERR "CMD(skipped):Gblocks \n";
}
my $pwd_gblock="$pwd1\/2.Gblocks";
# step4 Construct phylogeny tree with IQtree
print "\n====================================================\n";
print "STEP3 Construction of phylogeny tree with iqtree\n\n";
mkdir "3.IQtree" unless -e "3.IQtree";
unless (-e "3.IQtree.ok"){
chdir "3.IQtree";
my $pwd_iqtree=`pwd`;print STDERR "PWD: $pwd_iqtree\n";
my $command="iqtree -s $pwd_gblock\/$out_prefix.aln.cds.fas-gb -bb 10000 -pre $out_prefix -nt AUTO >iqtree.log";
print STDERR (localtime).": $command\n";
system($command)==0 or die "can not execute: $command\n";
system("cp $out_prefix.treefile ../..")==0 or die "can not copy file\n";
}
print STDERR "ALL commands complete! :-)\n";
sub cds2pep {
my %cds2pep = (
"TTT" => "F",
"TTC" => "F",
"TTA" => "L",
"TTG" => "L",
"TCT" => "S",
"TCC" => "S",
"TCA" => "S",
"TCG" => "S",
"TAT" => "Y",
"TAC" => "Y",
"TAA" => "*",
"TAG" => "*",
"TGT" => "C",
"TGC" => "C",
"TGA" => "*",
"TGG" => "W",
"CTT" => "L",
"CTC" => "L",
"CTA" => "L",
"CTG" => "L",
"CCT" => "P",
"CCC" => "P",
"CCA" => "P",
"CCG" => "P",
"CAT" => "H",
"CAC" => "H",
"CAA" => "Q",
"CAG" => "Q",
"CGT" => "R",
"CGC" => "R",
"CGA" => "R",
"CGG" => "R",
"ATT" => "I",
"ATC" => "I",
"ATA" => "I",
"ATG" => "M",
"ACT" => "T",
"ACC" => "T",
"ACA" => "T",
"ACG" => "T",
"AAT" => "N",
"AAC" => "N",
"AAA" => "K",
"AAG" => "K",
"AGT" => "S",
"AGC" => "S",
"AGA" => "R",
"AGG" => "R",
"GTT" => "V",
"GTC" => "V",
"GTA" => "V",
"GTG" => "V",
"GCT" => "A",
"GCC" => "A",
"GCA" => "A",
"GCG" => "A",
"GAT" => "D",
"GAC" => "D",
"GAA" => "E",
"GAG" => "E",
"GGT" => "G",
"GGC" => "G",
"GGA" => "G",
"GGG" => "G",
);
my $seq = shift @_;
my $fram = shift @_;
my $gene = shift @_;
$seq =~ s/\w{$fram}//;
my $pep;
while ((length $seq) >= 3) {
$seq =~ s/(\w{3})//;
$pep .= $cds2pep{$1};
}
return $pep;
}