使用anvi'o 进行 phylogenomics分析
Phylogenomics直接翻译是种系基因组学,下面是维基百科上它的解释:
Phylogenomics is the intersection of the fields of evolution and genomics. The term has been used in multiple ways to refer to analysis that involves genome data and evolutionary reconstructions. It is a group of techniques within the larger fields of phylogenetics and genomics. Phylogenomics draws information by comparing entire genomes, or at least large portions of genomes. Phylogenetics compares and analyzes the sequences of single genes, or a small number of genes, as well as many other types of data.
也就是进化学和基因组学的交叉领域,使用多种方法分析基因组数据和进化重构事件。
1.示例数据的下载
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216$ wget https://ndownloader.figshare.com/files/8628361 -O AnvioPhylogenomicsTutorialDataPack.tar.gz
AnvioPhylogenomicsT 100%[===================>] 30.99M 41.8KB/s in 14m 26s
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216$ ls
AnvioPhylogenomicsTutorialDataPack.tar.gz
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216$ tar -zxvf AnvioPhylogenomicsTutorialDataPack.tar.gz
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216$ ls
AnvioPhylogenomicsTutorialDataPack AnvioPhylogenomicsTutorialDataPack.tar.gz
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216$ cd AnvioPhylogenomicsTutorialDataPack
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack$ ls
closely-related distantly-related
2. 处理fasta文件
2.1 查看数据文件
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack$ cd distantly-related/
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ ls
Bacteroides_fragilis_2334.fa Prevotella_dentalis_19591.fa
Bacteroides_fragilis_2346.fa Prevotella_denticola_19594.fa
Bacteroides_fragilis_2347.fa Prevotella_intermedia_19600.fa
Escherichia_albertii_6917.fa Salmonella_enterica_21806.fa
Escherichia_coli_6920.fa Salmonella_enterica_22047.fa
Escherichia_coli_9038.fa Salmonella_enterica_22289.fa
2.2 将fasta文件转换为db格式的数据库文件
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ for i in *fa
> do
> anvi-script-FASTA-to-contigs-db $i
>
> done
2.3 构建上述contigs数据库文件信息列表
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ wget https://goo.gl/XuezQF -O external-genomes.txt
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ cat external-genomes.txt
name contigs_db_path
Bacteroides_fragilis_2334 Bacteroides_fragilis_2334.db
Bacteroides_fragilis_2346 Bacteroides_fragilis_2346.db
Bacteroides_fragilis_2347 Bacteroides_fragilis_2347.db
Escherichia_albertii_6917 Escherichia_albertii_6917.db
Escherichia_coli_6920 Escherichia_coli_6920.db
Escherichia_coli_9038 Escherichia_coli_9038.db
Prevotella_dentalis_19591 Prevotella_dentalis_19591.db
Prevotella_denticola_19594 Prevotella_denticola_19594.db
Prevotella_intermedia_19600 Prevotella_intermedia_19600.db
Salmonella_enterica_21806 Salmonella_enterica_21806.db
Salmonella_enterica_22047 Salmonella_enterica_22047.db
Salmonella_enterica_22289 Salmonella_enterica_22289.db
2.4 显示contigs数据库中可提取单拷贝基因序列或核糖体RNA
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ anvi-get-sequences-for-hmm-hits --external-genomes external-genomes.txt --list-hmm-sources
HMM SOURCES COMMON TO ALL 12 GENOMES
===============================================
* Rinke_et_al [type: singlecopy] [num genes: 1975]
* BUSCO_83_Protista [type: singlecopy] [num genes: 1741]
* Campbell_et_al [type: singlecopy] [num genes: 1731]
* Ribosomal_RNAs [type: Ribosomal_RNAs] [num genes: 244]
2.5 从contigs数据库中提取单拷贝氨基酸序列
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ anvi-get-sequences-for-hmm-hits --external-genomes external-genomes.txt \
> -o concatenated-proteins.fa \
> --hmm-source Campbell_et_al \
> --gene-names Ribosomal_L1,Ribosomal_L2,Ribosomal_L3,Ribosomal_L4,Ribosomal_L5,Ribosomal_L6 \
> --return-best-hit \
> --get-aa-sequences \
> --concatenate
CITATION
===============================================
Anvi'o will use 'muscle' by Edgar, doi:10.1093/nar/gkh340
(http://www.drive5.com/muscle) to align your sequences. If you publish your
findings, please do not forget to properly credit their work.
Hits .........................................: 1659 hits for 1 source(s)
Filtered hits ................................: 72 hits remain after filtering for 6 gene(s)
Filtered hits ................................: 72 hits remain after removing weak hits for multiple hits
Mode .........................................: AA seqeunces
Genes are concatenated .......................: True
Output .......................................: concatenated-proteins.fa
#查看一下输出文件concatenated-proteins.fa的内容
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ less concatenated-proteins.fa
2.6 基于单拷贝氨基酸序列构建系统进化树
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ anvi-gen-phylogenomic-tree -f concatenated-proteins.fa -o phylogenomic-tree.txt
Input aligment file path .....................: /home/czh/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related/concatenated-proteins.fa
Output file path .............................: /home/czh/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related/phylogenomic-tree.txt
Alignment names ..............................: Salmonella_enterica_21806, Prevotella_denticola_19594, Escherichia_coli_9038, Prevotella_intermedia_19600, Salmonella_enterica_22289, Escherichia_albertii_6917, Prevotella_dentalis_19591, Bacteroides_fragilis_2347, Bacteroides_fragilis_2346, Salmonella_enterica_22047, Bacteroides_fragilis_2334, Escherichia_coli_6920
Alignment sequence length ....................: 1,320
Version ......................................: FastTree Version 2.1.10 Double precision (No SSE3)
Alignment ....................................: standard input
Info .........................................: Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search .......................................: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits ......................................: 1.00*sqrtN close=default refresh=0.80
ML Model .....................................: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Info .........................................: Ignored unknown character X (seen 120 times)
Refining topology ............................: 12 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
Info .........................................: Total branch-length 0.918 after 0.03 sec
ML-NNI round 1 ...............................: LogLk = -8591.504 NNIs 0 max delta 0.00 Time 0.17
Info .........................................: Switched to using 20 rate categories (CAT approximation)
Info .........................................: Rate categories were divided by 0.748 so that average rate = 1.0
Info .........................................: CAT-based log-likelihoods may not be comparable across runs
Info .........................................: Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2 ...............................: LogLk = -8230.710 NNIs 0 max delta 0.00 Time 0.25
Info .........................................: Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 3 ...............................: LogLk = -8230.636 NNIs 0 max delta 0.00 Time 0.36 (final)
Optimize all lengths .........................: LogLk = -8230.633 Time 0.39
FastTree output newick file ..................: /home/czh/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related/phylogenomic-tree.txt
2.7 系统进化树可视化
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ anvi-interactive -p phylogenomic-profile.db -t phylogenomic-tree.txt --title "Phylogenomics Tutorial Example #1" --manual browser, this is the flag you need.
Interactive mode .............................: manual
* The server is now listening the port number "8080". When you are finished, press
CTRL+C to terminate the server.
[23294:23294:1216/200910.714795:ERROR:edid_parser.cc(313)] invalid EDID: human unreadable char in name
[23294:23294:1216/200910.714810:ERROR:edid_parser.cc(313)] invalid EDID: human unreadable char in name
[23294:23294:1216/200910.714815:ERROR:edid_parser.cc(313)] invalid EDID: human unreadable char in name
[23294:23294:1216/200910.714819:ERROR:edid_parser.cc(313)] invalid EDID: human unreadable char in name
Created new window in existing browser session.
2.8 增加系统进化树的信息
(anvio) czh@czh-ubuntu:~/Desktop/add_disk/Anvio_work/test_1216/AnvioPhylogenomicsTutorialDataPack/distantly-related$ anvi-interactive -p phylogenomic-profile.db -d view.txt -t phylogenomic-tree.txt --title "Phylogenomics Tutorial Example #2" --manual
Interactive mode .............................: manual
* The server is now listening the port number "8080". When you are finished, press
CTRL+C to terminate the server.
[25071:25071:1216/203225.744180:ERROR:edid_parser.cc(313)] invalid EDID: human unreadable char in name
[25071:25071:1216/203225.744204:ERROR:edid_parser.cc(313)] invalid EDID: human unreadable char in name
[25071:25071:1216/203225.744211:ERROR:edid_parser.cc(313)] invalid EDID: human unreadable char in name
[25071:25071:1216/203225.744218:ERROR:edid_parser.cc(313)] invalid EDID: human unreadable char in name
Created new window in existing browser session.