CheckV评估病毒序列
标题:CheckV assesses the quality and completeness of metagenome-assembled viral genomes
中文:checkv评估meta组装病毒基因组的完整性
杂志:Nature Biotechnology
时间:2020
CheckV 通过将序列与完整病毒基因组的大型数据库进行比较来估计完整性,其中包括从公开可用的宏基因组、宏转录组和宏病毒组的系统搜索中确定的 76,262 个。在对模拟数据集进行验证并与现有方法进行比较后,我们将 CheckV 应用于宏基因组组装病毒序列的大量不同集合,包括 IMG/VR 和全球海洋病毒。这揭示了 44,652 高质量病毒基因组(即,> 90% 完整)。
bitbucket: https://bitbucket.org/berkeleylab/checkv/src/master/
bioconda:https://anaconda.org/bioconda/checkv
安装checkv
conda install -c bioconda checkv
获取数据库
地址:https://portal.nersc.gov/CheckV/
wget -c https://portal.nersc.gov/CheckV/checkv-db-v1.0.tar.gz
# 不行就用win下载
tar -zxvf checkv-db-v1.0.tar.gz
数据库里有什么
cat checkv_reps.fna | grep '^>' | wc -l
52141 条核酸序列
cat checkv_reps.faa | grep '^>' | wc -l
2,885,048 条蛋白序列
一条核酸序列(病毒基因组)对应多条蛋白序列
核酸序列编号,长度,类型表 checkv_reps.tsv
cat checkv_reps.tsv | sed '1d' | awk '{print $2}' | sort | uniq -c
39070 circular
13071 genbank
一步法运行
以virsorter2病毒预测结果fasta文件作为输入
source /route/miniconda3/etc/profile.d/conda.sh
conda activate base
# 数据库地址
export CHECKVDB=/route/databases/checkv/checkv-db-v1.0
# one step
checkv end_to_end ../virsort/test.out/final-viral-combined.fa \
./output -t 1
过程
CheckV v0.8.1: contamination
[1/8] Reading database info...
[2/8] Reading genome info...
[3/8] Calling genes with Prodigal...
[4/8] Reading gene info...
[5/8] Running hmmsearch...
[6/8] Annotating genes...
[7/8] Identifying host regions...
[8/8] Writing results...
Run time: 128.65 seconds
Peak mem: 0.12 GB
CheckV v0.8.1: completeness
[1/8] Skipping gene calling...
[2/8] Initializing queries and database...
[3/8] Running DIAMOND blastp search...
[4/8] Computing AAI...
[5/8] Running AAI based completeness estimation...
[6/8] Running HMM based completeness estimation...
[7/8] Determining genome copy number...
[8/8] Writing results...
Run time: 28.43 seconds
Peak mem: 0.25 GB
CheckV v0.8.1: complete_genomes
[1/7] Reading input sequences...
[2/7] Finding complete proviruses...
[3/7] Finding direct/inverted terminal repeats...
[4/7] Filtering terminal repeats...
[5/7] Checking genome for completeness...
[6/7] Checking genome for large duplications...
[7/7] Writing results...
Run time: 0.13 seconds
Peak mem: 0.25 GB
CheckV v0.8.1: quality_summary
[1/6] Reading input sequences...
[2/6] Reading results from contamination module...
[3/6] Reading results from completeness module...
[4/6] Reading results from complete genomes module...
[5/6] Classifying contigs into quality tiers...
[6/6] Writing results...
Run time: 0.04 seconds
Peak mem: 0.25 GB
结果
quality_summary.tsv
provirus/virus、质量、完整度
virus.fna
这里的virus序列可能是游离病毒的,也可能是前病毒的一部分
provirus.fna
checkv对provirus序列切割后的产物,provirus region / contig sequence
contamination.tsv
checkv会进一步鉴定所有序列(含virsorter2 partial序列)中的host污染及位置信息
tmp/proteins.faa
病毒蛋白序列信息