igblast使用
简介:
IgBlast是NCBI设计开发的一种专一的blast工具,特定用于比对抗体( immunoglobulin ,IG)或T细胞受体( T cell receptor,TR)序列。
为什么要专门开发一个比对软件呢?他跟blastn等有什么区别,这得从IG和TR的结构说起,
IG和TR的结构类似,都是由2条轻链和2条重链构成,每条链可以分为可变区(variable domain)和恒定区(constant domain)。可变区还可以进一步分为骨架区(FR)和互补作用区(CDR)。
IG或TR识别抗原的关键在于可变区的高度可变性,这种可变性是由「基因重排」机制产生。
1,下载igblast程序和其他必要文件:
下载地址:https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST,
从下面地址下载 internal_data 和 optional_file 整个目录,这两个目录也是必需的: https://ftp.ncbi.nih.gov/blast/executables/igblast/release/
2,下载IMGT数据库,或者TCR数据库
/database/IMGT_V-QUEST_reference_directory/Homo_sapiens/TR
3,igblast具体使用例子:
igblast数据比对的input文件必须是fasta格式。所以如果你的数据是fastq格式,需要先转一下格式(seqkit fq2fa)
igblastn -query test.fa -out test.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux -organism human -domain_system imgt -ig_seqtype TCR -num_threads 4 -min_D_match 5 -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3
发现了一个新的软件:igblastwrapper
https://github.com/mikessh/migmap
Motivation
IgBlast is an excellent of V-(D)-J mapping tool able to correctly map even severely hypermutated antibody variants. While being a gold standard, the following limitations of IgBlast v1.4.0 have driven MIGMAP development:
-
It doesn't extract sequence of CDR3 region directly, neither provide coordinates for CDR3 region in reads. It reports reference Cys residue of Variable segment and Variable segment end in CDR3, but not Phe/Trp residue of J segment that marks the end of CDR3
-
Output is not straightforward to parse and summarize to a readable clonotype abundance table containing CDR3 sequences, segment assignments and list of somatic hypermutations
-
It doesn't account for sequence quality
-
It is somewhat hard to make it running with a custom segment reference and species other than human and mouse
# 查看帮助文档
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -h
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --use-kabat -q 70 ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --use-kabat -q 70 --illegal-access ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --by-read --data-dir ./ --use-kabat --allow-incomplete --details ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
都没有成功,先试一下测试文件吧
cd /software/biosoftware/migmap-1.0.3
./migmap -R TRA,TRB -S human ./test/sample.fasta.gz migmap.out.txt
成功
回到工作目录:
不再使用,忽略掉。
发现 igblast网页版和本地版不一致情况
igblastn -query test.igblast.fa -out test.fmt.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux -organism human -domain_system imgt -ig_seqtype TCR -min_D_match 5 -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3
本地版参数:
igblastn -query test.fa -out test.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux -organism human -domain_system imgt -ig_seqtype TCR -num_threads 4 -min_D_match 5 -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3
本地版结果:
# IGBLASTN
# Query: E00509:314:HNTL2CCXY:4:1101:8024:1924 1:N:0:CTCTCTAC
# Database: ./database/human_TR_V ./database/human_TR_D ./database/human_TR_J
# Domain classification requested: imgt
# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.
# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches, if present, are separated by a comma.
TRAV26-2*01 TRAJ33*01,TRAJ53*01 VA No Out-of-frame No -
# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated
GGCAA N/A AGCAA
# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT 84 106 23 19 4 0 82.6
Total N/A N/A 23 19 4 0 82.6
# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 6 hits found
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAV26-2*01 82.609 23 4 0 0 84 106 168 190 0.20 25.2
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRBV23/OR9-2*01 80.000 20 4 0 0 128 147 35 16 5.2 20.5
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRBV23/OR9-2*02 80.000 20 4 0 0 128 147 35 16 5.2 20.5
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ33*01 100.000 8 0 0 0 107 114 6 13 8.1 16.1
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ53*01 100.000 8 0 0 0 107 114 15 22 8.1 16.1
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ18*01 100.000 7 0 0 0 141 147 14 20 31 14.1
网页版结果:使用网页上默认的参数。
# IGBLASTN
# Query: E00509:314:HNTL2CCXY:4:1101:8024:1924 1:N:0:CTCTCTAC
# Database: IG_DB/imgt.TR.Homo_sapiens.V.f.orf.p IG_DB/imgt.TR.Homo_sapiens.D.f.orf IG_DB/imgt.TR.Homo_sapiens.J.f.orf.p
# Domain classification requested: imgt
# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.
# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches, if present, are separated by a comma.
TRAV26-2*01 TRAJ33*01,TRAJ53*01 VA No Out-of-frame No -
# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
GGCAA N/A AGCAA
# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT 84 106 23 19 4 0 82.6
Total N/A N/A 23 19 4 0 82.6
# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 6 hits found
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAV26-2*01 82.609 23 4 0 0 84 106 168 190 0.22 25.2
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRBV7-9*04 92.308 13 1 0 0 138 150 109 121 16 19.0
V reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRBV15*01 100.000 11 0 0 0 135 145 13 23 16 19.0
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ33*01 100.000 8 0 0 0 107 114 6 13 9.0 16.1
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ53*01 100.000 8 0 0 0 107 114 15 22 9.0 16.1
J reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924 TRAJ18*01 100.000 7 0 0 0 141 147 14 20 34 14.1
不一致的地方在结果倒数的第四,五行,比对结果不一致。
开始怀疑是设置evalue值的问题,于是本地版设置了evalue 20
igblastn -query one.fa -out one.out -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux -organism human -domain_system imgt -ig_seqtype TCR -min_D_match 5 -evalue 20 -show_translation -num_clonotype 100 -D_penalty -2 -J_penalty -2 -min_V_length 9 -min_J_length 0 -num_alignments_V 5 -num_alignments_D 5 -num_alignments_J 5
结果一致了,
最终版参数:
ls RP01G9E1L3_R{1,2}_*|while read line;do echo "igblastn -query $line -out outfile.igblast.$line.fmt7.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux -organism human -domain_system imgt -ig_seqtype TCR -num_threads 70 -min_D_match 5 -evalue 20 -show_translation -num_clonotype 100 -D_penalty -2 -J_penalty -2 -min_V_length 9 -min_J_length 0 -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3 ";done >run.sh
总结,网页版和本地版的主要不一样是evalue不一样,造成的原因
首先是参数,另一个是database导致的search space不一样.
目前igblast分析方法:
第一步比对:
igblastn -query b0005p1_S1_L001_R1_001.fa -out b0005p1_S1_L001_R1_001.fa.fmt7.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux -organism human -domain_system imgt -ig_seqtype TCR -num_threads 10 -min_D_match 5 -evalue 20 -show_translation -num_clonotype 100 -D_penalty -2 -J_penalty -2 -min_V_length 9 -min_J_length 0 -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3
第二步,将比对结果过滤,提取VDJ信息:过滤evalue大于1e-10,
cat _S1_L001_R1.fmt7.out |grep -E "V|J|^D" |awk '$(NF-1)<1e-10' >S1.R1.igblast.fmt7.out
第三步,统计每条read上比对的VDJ个数,自己写的脚本。
python deal_igblast.py S1.R1.igblast.fmt7.out S1.R1.igblast.4to1.total.out
第四步, 根据上一步结果统计VDJ组合的结果,这一步根据需要。
python filter_igblast_result.py S1.R1.igblast.4to1.total.out S1.R1.out.all.result.txt S1.R1.out.4gene.result.txt S1.R1.out.3gene.result.txt S1.R1.out.2gene.result.txt S1.R1.out.1gene.result.txt
可以统计出各种组合的信息:
gene_in_reads_4: 0
gene_in_reads_3: 61612
AVAJBV AVAJBJ AVBVBJ AJBVBJ
61612 0 0 0
gene_in_reads_2: 282005
AVAJ AVBV AVBJ AJBV AJBJ BVBJ
99723 26009 0 0 0 156273
gene_in_reads_1: 2234397
AV AJ BV BJ
3605 150 2230039 603
total calculate reads: 2578014
如果是PE read,比对是需要分开做的,在统计的时候可以一起统计。