基因家族鉴定 hmmer+blast (含hmmer安装)

2020-11-24 本文已影响0人山竹山竹px

策略

hmmsearch + blast

单独使用或者组合使用

hmmsearch可以做两次，第一次使用pfam中的多序列比对结果构建模型进行搜索，筛选过结构域后，使用本物种的该基因家族的多序列比对结果再次构建模型，进行搜索

blast常用拟南芥、水稻

准备

conda 安装hmmer

 conda  create -n protein  python=2 #创建环境
 conda activate protein #激活环境
 conda install -y hmmer 

 hmmer-3.2.1          | 7.1 MB    | ########################################################################################## | 100% 
 Preparing transaction: done
 Verifying transaction: done
 Executing transaction: done 
 ##安装成功

准备数据

基因组文件：包括cds\pep\gff

拟南芥的某基因家族蛋白序列

从Pfam库下载hidden Markov model (HMM) profile

http://pfam.xfam.org/

主页输入PF号；跳转后的页面选择Alignments条目；format选择stockholm格式；点击generate，会下载一个txt文件

步骤

`hmmsearch`

激活环境，前面会显示环境名称

conda activate protein

构建模型

hmmbuild :: profile HMM construction from multiple sequence alignments

hmmbuild pf00067.hmm  PF00067_seed.txt

搜索

hmmsearch: Search a protein profile HMM against a protein sequence database.

hmmsearch pf00067.hmm protein.fasta > out1

输出文件的解读见《hmmer使用手册》

得到候选基因集1

`blastp`

query：拟南芥基因家族蛋白

DB：你的基因组蛋白文件

program：blastp

建库

makeblastdb -in input_file -dbtype molecule_type -parse_seqids -out database_name -logfile File_Name

-in query 序列文件
-dbtype 后接序列类型，nucl为核酸，prot为蛋白 -parse_seqids 推荐加上 -out 后接数据库名 -logfile 日志文件，如果没有默认输出到屏幕

比对

blastp -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5

-query：输入文件路径及文件名 -out：输出文件路径及文件名 -db：格式化了的数据库路径及数据库名（就是建库时候的数据库名） -outfmt：输出文件格式，总共有12种格式，6是tabular格式对应之前BLAST的m8格式，不写该参数，默认输出比对文件 -evalue：设置输出结果的e-value值 -num_alignments 显示比对数Default = 250 -num_descriptions：单行描述的最大数目 default=50 -num_threads：线程

合并去重

excel表可以做到

使用TBtools提取候选id的蛋白序列

使用TBtools提取序列.png

验证结构域

Pfam

http://pfam.xfam.org/

SMART

http://smart.embl-heidelberg.de/#

Single模式支持没找到结果就预测并返回。而Batch模式，则只支持数据库中已收录结果的返回

使用 sequence analysis 模块，批量搜索入口在问号里。TBtools 有个插件可以批量

NCBI Conserved Domains

Domain 预测不到，或者 Domain 不完整，不代表不存在，可能只是算法敏感度的问题

多序列比对，手动检查保守结构域

https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi 支持蛋白序列、核酸序列，单条

https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi 支持多条(a file)，只支持蛋白序列

结果解读： https://www.sohu.com/a/216315762_419916

specific hits meet or exceed a domain-specific e-value threshold (illustrated example) and represent a very high confidence that the query sequence belongs to the same protein family as the sequences use to create the domain model non-specific hits meet or exceed the RPS-BLAST threshold for statistical significance (default E-value cutoff of 0.01,or an E-value selected by user via the advanced search options)

去除无结构域的序列

人工矫正结构注释

对于结构域不完整的（partial），或者length明显过长的（可能是2-3个基因合到一起了），需要手动进行基因结构注释的矫正。

根据GFF文件，找到该基因所在基因组上的位置

根据位置提取基因组上下游序列（看序列长度，如+-3000bp）

将序列blastx NR数据库，根据比对结果，比着序列翻译的3种读码框，寻找GT...AG内含子

同时可以用基因结构预测网站辅助

ref：

什么是conda: https://www.jianshu.com/p/edaa744ea47d

HMMER 官方手册

SMART: https://www.jianshu.com/p/cef209c015e5