BLAST-The learning notes of the

2017-12-03 本文已影响26人 Hypdoctor

Basic Local Alignment Search Tool (BLAST)

个用来比对生物序列的一级结构(如不同蛋白质的氨基酸序列或不同基因的DNA序列)的算法。已知一个包含若干序列的数据库，BLAST可以让研究者在其中寻找与其感兴趣的序列相同或类似的序列。例如如果某种非人动物的一个以前未知的基因被发现，研究者一般会在人类基因组中做一个BLAST搜索来确认人类是否包含类似的基因（通过序列的相似性）。BLAST算法以及实现它的程序由美国国家生物技术信息中心（NCBI）的Warren Gish、David J. Lipman及Webb Miller博士开发的。(from wikipedia)

A suite of tools

blast-table.png

The key concepts of BLAST

-Search may take place in nucleotide and/or protein space or translated spaces where nucleotides are translated into proteins.
-Searches may implement search “strategies”: optimizations to a certain task. Different search strategies will return different alignments.
-Searches use alignments that rely on scoring matrices
-Searches may be customized with many additional parameters. BLAST has many subtle functions that most users never need.

使用BLAST 的基本步骤

1.使用makeblastdb建立BLAST数据库
2.合适的选择blastn、blastp、blsatx等工具
3.运行工具并在需要的时候格式化输出结果

Build a blast database

#建立database目录
mkdir -p ~/refs/ebola
#获取ebola病毒核酸序列
efetch -db nucleotide -id KM233118 --format fasta > ~/refs/ebola/KM233118.fa

makeblastdb命令建立ebola核酸序列database
makeblastdb -help | more

USAGE
  makeblastdb [-h] [-help] [-in input_file] [-input_type type]
    -dbtype molecule_type [-title database_title] [-parse_seqids]
    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
    [-mask_desc mask_algo_descriptions] [-gi_mask]
    [-gi_mask_name gi_based_mask_names] [-out database_name]
    [-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
    [-taxid_map TaxIDMapFile] [-version]
DESCRIPTION
   Application to create BLAST databases, version 2.7.1+
REQUIRED ARGUMENTS
 -dbtype <String, `nucl', `prot'>
   Molecule type of target db
OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -version
   Print version number;  ignore other arguments
 *** Input options
 -in <File_In>
   Input file/database name
   Default = `-'
 -input_type <String, `asn1_bin', `asn1_txt', `blastdb', `fasta'>
   Type of the data specified in input_file
   Default = `fasta'
> *** Configuration options
 -title <String>
   Title for BLAST database
   Default = input file name provided to -in argument
 -parse_seqids
   Option to parse seqid for FASTA input if set, for all other input types
   seqids are parsed automatically
 -hash_index
   Create index of sequence hash values.
 *** Sequence masking options
 -mask_data <String>
   Comma-separated list of input files containing masking data as produced by
   NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)
 -mask_id <String>
   Comma-separated list of strings to uniquely identify the masking algorithm
    * Requires:  mask_data
    * Incompatible with:  gi_mask
 -mask_desc <String>
   Comma-separated list of free form strings to describe the masking algorithm
   details
    * Requires:  mask_id
 -gi_mask
   Create GI indexed masking data.
    * Requires:  parse_seqids
    * Incompatible with:  mask_id
 -gi_mask_name <String>
   Comma-separated list of masking data output files.
    * Requires:  mask_data, gi_mask
 *** Output options
 -out <String>
   Name of BLAST database to be created
   Default = input file name provided to -in argumentRequired if multiple
   file(s)/database(s) are provided as input
 -max_file_sz <String>
   Maximum file size for BLAST database files
   Default = `1GB'
 -logfile <File_Out>
   File to which the program log should be redirected
 *** Taxonomy options
 -taxid <Integer, >=0>
   Taxonomy ID to assign to all sequences
    * Incompatible with:  taxid_map
 -taxid_map <File_In>
   Text file mapping sequence IDs to taxonomy IDs.
   Format:<SequenceId> <TaxonomyId><newline>
    * Requires:  parse_seqids
    * Incompatible with:  taxid

#创建ebola核酸序列数据库
makeblastdb -in ~/refs/ebola/KM233118.fa -dbtype nucl -out ~/refs/ebola/KM233118

创建PRJNA257197氨基酸序列数据库

#下载PRJNA257197所有蛋白质序列fasta文件
esearch -db protein -query PRJNA257197 | efetch -format fasta > index/all-proteins.fa
#创建氨基酸序列数据库
makeblastdb -in index/all-proteins.fa -dbtype prot -out index/all -parse_seqids
#列出数据库内的内容，以“%a”accession格式显示
blastdbcmd -db index/all -entry 'all' -outfmt "%a" | head

BLAST database的下载

NCBI提供许多物种和几乎所有的已知序列的数据库的下载
website

#创建目录用于存放下载的数据库
mkdir -p ~refs/refseq
cd ~/ref/refseq
#blast软件包中已有update_blastdb.pl用于下载NCBI已经做好的数据库
#查看所有数据库
update_blastdb.pl | more
#下载16 microbial database
update_blastdb.pl 16SMicrobial --decompress
#下载分类数据库
update_blastdb.pl taxdb --decompress
#将数据路径加入系统环境变量，这也是分类检索所必须的(for MAC)
echo "export BLASTDB=$BLASTDB:~/refs/refseq/" >> ~/.bahs_profile
source ~/.bash_profile
(未完待续)