生物信息

snpEff注释vcf前构建库的Error :CDS check

2022-03-21  本文已影响0人  vicLeo

conda activate java12 #进入还有java高版本的conda所属环境
snpEff #启动 snpEff

构建小鼠的数据库

先进入 snpeff-5.1-1 文件夹:
cd /home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1
mkdir data ##创建一个data文件夹
cd data
mkdir data/genomes ####新建genomes目录

软连接mm39.fa到文件夹genomes, 改名为 sequences.fa

ln -s /home/u20111230014/workspace/genome/mm10_to_mm39/mm39.fa sequences.fa
mkdir data/mm39 ####新建mouse目录

软连接mm39.gtf到文件夹mm39,改名为 genes.gtf

ln -s /home/u20111230014/workspace/genome/mm10_to_mm39/mm39.2020-10-27.ncbiRefSeq.gtf genes.gtf
或去 http://hgdownload.cse.ucsc.edu/goldenPath/mm39/bigZips/genes/mm39.ncbiRefSeq.gtf.gz 下载这个!
PS:1.下载后是 gtf 就保持 gtf 格式,参数选择 -gtf22 就行!!!
2. 无论下载的gtf叫啥名,必须软链接改名为 genes.gtf
3. 最后我怕死,想到一般Linux 软件还是喜欢所有文件在同一个地方最安全的习惯,还是将 sequences.fa 和 genes.gtf 一并放在了data/mm39 里面了!!!

回到文件夹snpEff目录下
cd /home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1
使用命令
echo "mm39.genome:mm39" >> snpEff.config

启动前给予权限

chmod +x snpEff.jar

snpEff build构建库

示例
java -jar snpEff.jar build -c snpEff.config -gtf22 -v mm39

参数说明

java -jar: Java环境下运行程序
-c snpEff.config配置文件路径
-gtf22 设置输入的基因组注释信息是gtf2.2格式
-gff3 设置输入基因组注释信息是gff3格式
-v 设置在程序运行过程中输出的日志信息
最后的mm39参数 设置输入的基因组版本信息,和~/snpEff/snpEff.config配置文件中添加的信息一致

简单跑了示例,报错出现:ERROR: CDS check file cds.fa' not found. Protein check file not found. Database check failed.

(java12) [u20111230014@cpu13 snpeff-5.1-1]$ tail -n 10 Out.snpEff_build 
        300000  ....................................................................................................
        400000  ................................................00:00:51 done.
00:00:51 [Optional] Rare amino acid annotations
WARNING_FILE_NOT_FOUND: Rare Amino Acid analysis: Cannot read protein sequence file '/home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1/./data/mm39/protein.fa', nothing done.
ERROR: CDS check file '/home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1/./data/mm39/cds.fa' not found.
ERROR: Protein check file '/home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1/./data/mm39/protein.fa' not found.
ERROR: Database check failed.
00:00:51 Logging
00:00:52 Checking for updates...
00:00:54 Done.

查看官网解释:https://pcingola.github.io/SnpEff/se_buildingdb/

#可以关掉那烦人的信息,把 -noCheckCds 加上就行
Checking CDS sequences
When building a database, SnpEff will try to check CDS sequences for all transcripts in the database when

building via GFT/GFF/RefSeq: A CDS sequences FASTA file is available.
building via GenBank file: CDS sequences are available within the GenBank file
Info

You can disable this check unsing command line option -noCheckCds
##把 protein.fa 那条信息也关掉 -noCheckProtein
Checking Protein sequences
This is very similar to the CDS checking in the previous sub-section. When building a database, SnpEff will also try to check Protein sequences for all transcripts when

building via GFT/GFF/RefSeq: A protein sequences FASTA file is available.
building via GenBank file: protein sequences are available within the GenBank file
Info

You can disable this check unsing command line option -noCheckProtein

完整实例

snpeff=/home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1
java -jar $snpeff/snpEff.jar build -c $snpeff/snpEff.config -gtf22 -v mm39 -noCheckCds -noCheckProtein

也可以直接下载数据库 (如果有合适你的版本)

cd /home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1

查看相关词条的库

java -jar snpEff.jar databases | grep -i musculus

选择下载

java -jar snpEff.jar download -v GRCm38.99
PS: 没有mm39的

实例
(java12) [u20111230014@cpu20 snpeff-5.1-1]$ java -jar snpEff.jar databases mm39 |grep mm39
mm39                                                            Mouse                                                                                                             [https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_mm39.zip, https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_mm39.zip]
URL:    [https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_mm39.zip, https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_mm39.zip]
PS: 如果这些都下载不了,放弃吧,自己构建!

自己构建库的结果

(java12) [u20111230014@cpu13 mm39]$ ll
total 528076
lrwxrwxrwx 1 u20111230014 u20111230014        78 Mar 21 21:18 genes.gtf -> /home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1/genes_ncbiref.gtf
-rw-rw-r-- 1 u20111230014 u20111230014  18358758 Mar 21 21:30 sequence.10.bin
-rw-rw-r-- 1 u20111230014 u20111230014  18814242 Mar 21 21:31 sequence.11.bin
-rw-rw-r-- 1 u20111230014 u20111230014  16244773 Mar 21 21:31 sequence.12.bin
-rw-rw-r-- 1 u20111230014 u20111230014  15552650 Mar 21 21:31 sequence.13.bin
-rw-rw-r-- 1 u20111230014 u20111230014  16799010 Mar 21 21:31 sequence.14.bin
-rw-rw-r-- 1 u20111230014 u20111230014  13908440 Mar 21 21:31 sequence.15.bin
-rw-rw-r-- 1 u20111230014 u20111230014  13143881 Mar 21 21:32 sequence.16.bin
-rw-rw-r-- 1 u20111230014 u20111230014  12986251 Mar 21 21:32 sequence.17.bin
-rw-rw-r-- 1 u20111230014 u20111230014  11873657 Mar 21 21:32 sequence.18.bin
-rw-rw-r-- 1 u20111230014 u20111230014   9351614 Mar 21 21:32 sequence.19.bin
-rw-rw-r-- 1 u20111230014 u20111230014  26283396 Mar 21 21:30 sequence.1.bin
-rw-rw-r-- 1 u20111230014 u20111230014  27458405 Mar 21 21:33 sequence.2.bin
-rw-rw-r-- 1 u20111230014 u20111230014  18794078 Mar 21 21:33 sequence.3.bin
-rw-rw-r-- 1 u20111230014 u20111230014  21625121 Mar 21 21:33 sequence.4.bin
-rw-rw-r-- 1 u20111230014 u20111230014  21946586 Mar 21 21:33 sequence.5.bin
-rw-rw-r-- 1 u20111230014 u20111230014  21864415 Mar 21 21:34 sequence.6.bin
-rw-rw-r-- 1 u20111230014 u20111230014  20342391 Mar 21 21:34 sequence.7.bin
-rw-rw-r-- 1 u20111230014 u20111230014  17718350 Mar 21 21:34 sequence.8.bin
-rw-rw-r-- 1 u20111230014 u20111230014  18536971 Mar 21 21:34 sequence.9.bin
-rw-rw-r-- 1 u20111230014 u20111230014    392955 Mar 21 21:35 sequence.bin
lrwxrwxrwx 1 u20111230014 u20111230014        48 Mar 21 21:09 sequences.fa -> /home/u20111230014/workspace/genome/mm39/mm39.fa
-rw-rw-r-- 1 u20111230014 u20111230014  14998439 Mar 21 21:35 sequence.X.bin
-rw-rw-r-- 1 u20111230014 u20111230014    831423 Mar 21 21:35 sequence.Y.bin
-rw-rw-r-- 1 u20111230014 u20111230014 182877620 Mar 21 21:30 snpEffectPredictor.bin

生成 .bin 文件代表成功!

注释

操作位置:/home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1
########## 如果注释时报错:

(java12) [u20111230014@cpu19 snpeff-5.1-1]$ more Out.eff 
java.lang.RuntimeException: Property: 'mm39.DPP-0-All.filtered.vcf.genome' not found
        at org.snpeff.interval.Genome.<init>(Genome.java:104)
        at org.snpeff.snpEffect.Config.readGenomeConfig(Config.java:784)
        at org.snpeff.snpEffect.Config.readConfig(Config.java:751)
        at org.snpeff.snpEffect.Config.init(Config.java:529)
        at org.snpeff.snpEffect.Config.<init>(Config.java:116)
        at org.snpeff.SnpEff.loadConfig(SnpEff.java:429)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:889)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:875)
        at org.snpeff.SnpEff.run(SnpEff.java:1141)
        at org.snpeff.SnpEff.main(SnpEff.java:160)

PS: 1、记得需要写清楚 参考基因组所在位置: -v mm39;
2、软链接需要分析的vcf文件到软件的所在位置,
/home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1
3、并将vcf 改名为参考基因组文件夹名字为开头:如下

#软链接
ln -s /home/u20111230014/workspace/genome/mm10_to_mm39/DPP-0-mm39_All.filtered.vcf mm39.DPP-0-All.filtered.vcf 
ln -s /home/u20111230014/workspace/genome/mm10_to_mm39/DPP-1-mm39_All.filtered.vcf
mm39.DPP-1-All.filtered.vcf 

##运行 eff 注释vcf文件 (记得需要写清楚 参考基因组所在位置: -v mm39 !!!)
snpeff=/home/u20111230014/miniconda3/envs/java12/share/snpeff-5.1-1

java -jar $snpeff/snpEff.jar eff -c $snpeff/snpEff.config -v mm39 mm39.DPP-0-All.filtered.vcf > mm39.DPP-0-All.filtered.positive.snp.eff.vcf -csvStats mm39.DPP-0-All.filt
ered.positive.csv -stats mm39.DPP-0-All.filtered.positive.html

java -jar $snpeff/snpEff.jar eff -c $snpeff/snpEff.config -v mm39 mm39.DPP-1-All.filtered.vcf > mm39.DPP-1-All.filtered.positive.snp.eff.vcf -csvStats mm39.DPP-1-All.filt
ered.positive.csv -stats mm39.DPP-1-All.filtered.positive.html

注释成功会生成4个文件:positive.csv;positive.genes.txt: 总结每个基因的突变位点数;positive.html: 总结突变的类型数;positive.snp.eff.vcf!!!

(java12) [u20111230014@cpu19 snpeff-5.1-1]$ ll
total 1049560
drwxrwxr-x 4 u20111230014 u20111230014      4096 Mar 21 16:57 data
-rwxr-xr-x 1 u20111230014 u20111230014      1711 Mar 22 11:03 eff.slurm
-rw-rw-r-- 1 u20111230014 u20111230014     80680 Mar 22 11:08 mm39.DPP-0-All.filtered.positive.csv
-rw-rw-r-- 1 u20111230014 u20111230014   8186198 Mar 22 11:08 mm39.DPP-0-All.filtered.positive.genes.txt
-rw-rw-r-- 1 u20111230014 u20111230014    398810 Mar 22 11:08 mm39.DPP-0-All.filtered.positive.html
-rw-rw-r-- 1 u20111230014 u20111230014 168440543 Mar 22 11:08 mm39.DPP-0-All.filtered.positive.snp.eff.vcf
lrwxrwxrwx 1 u20111230014 u20111230014        76 Mar 22 10:48 mm39.DPP-0-All.filtered.vcf -> /home/u20111230014/workspace/genome/mm10_to_mm39/DPP-0-mm39_All.filtered.vcf
drwxrwxr-x 3 u20111230014 u20111230014      4096 Mar 17 21:52 scripts
-rwxrwxr-x 2 u20111230014 u20111230014      3258 Feb 25 23:01 snpEff
-rwxr-xr-x 1 u20111230014 u20111230014       414 Mar 21 21:25 snpeff_build.slurm
-rw-rw-r-- 2 u20111230014 u20111230014  10774552 Mar 21 21:11 snpEff.config
-rwxrwxr-x 2 u20111230014 u20111230014  28977645 Jan 21 19:24 snpEff.jar

如不考虑上下游,UTR,或基因间区等信息,可以选用几个参数来简化输出

-no-downstream
-no-upstream
-no-utr
-no-intergenic
-no-intron

示例
java -jar snpEff.jar ann -no-utr -no-downstream -no-upstream -no-intergenic XXX input.vcf.gz > snpeff.vcf

参考: https://www.jianshu.com/p/0077beb890ed
https://www.jianshu.com/p/a6e46d0c07ee
https://blog.csdn.net/u012110870/article/details/105530364

上一篇 下一篇

猜你喜欢

热点阅读