NCBI-NR子库的构建

2021-03-14 本文已影响0人生信小书生

目前有很多的数据库都存储了蛋白序列，比如NCBI Refseq, protein, swissprot 等，在各个数据库之间，或者是在某个数据库中，蛋白序列有大量冗余；为了方便使用，NCBI 构建了NR 库，今天，我们就来看一下如何构建NR子库。

步骤如下：

一、下载nr库以及安装taxdump、accession2taxid、csvtk、taxonkit软件

ascp -P33001 -l 500m --mode recv -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh  -QTr -l6000m  anonftp@ftp-private.ncbi.nlm.nih.gov:/blast/db/FASTA/nr.gz ./
wget -t 0 -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
wget -t 0 -c https://ftp.nacbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
conda install -c bioconda csvtk -y
conda install taxonkit -c bioconda -y
mkdir taxdump
mv taxdump.tar.gz taxdump
tar -zxvf taxdump.tar.gz
rm taxdump.tar.gz

二、建库（nr库）

nohup makeblastdb -parse_seqids -in nr -dbtype prot -out nr &

三、使用TaxonKit提取特定taxons下的所有taxid(以病毒为例：病毒NCBI编号 txid:10239)

taxonkit list -j 2 --ids 10239 --indent "" --data-dir ./taxdump/ > Virus.list
cat prot.accession2taxid | csvtk -t grep -f taxid -P ../nr/Virus.list | csvtk -t cut -f accession.version > Virus.taxid.acc.txt
blastdb_aliastool -seqidlist Virus.taxid.acc.txt -db /public1/data/xxxx/data/parasite/nr/nr/nr -out nr_virues -title nr_virues
blastdbcmd -db /public1/data/xxxxx/data/parasite/nr/accession2taxid/nr_virues -entry all -dbtype prot -out nr_Virus.fa
diamond makedb --in nr_Virus.fa --db nr_Virus -p 10

至此nr_Virus子库已经构建完成
其余子库（植物，动物）参照病毒子库进行构建，构建需花费大量的时间。但是运算起来能节省数十倍的时间，尤其是随着NR库的日益增大！

注意：不推荐使用NR子库，检索出来假阳性率非常高！

NCBI-NR子库的构建

步骤如下：

一、下载nr库以及安装taxdump、accession2taxid、csvtk、taxonkit软件

二、建库（nr库）

三、使用TaxonKit提取特定taxons下的所有taxid(以病毒为例：病毒NCBI编号 txid:10239)

猜你喜欢

热点阅读