生信笔记免疫组库

igblast使用

2019-07-23  本文已影响0人  11的雾

简介:
IgBlast是NCBI设计开发的一种专一的blast工具,特定用于比对抗体( immunoglobulin ,IG)或T细胞受体( T cell receptor,TR)序列。
为什么要专门开发一个比对软件呢?他跟blastn等有什么区别,这得从IG和TR的结构说起,

IG和TR的结构类似,都是由2条轻链和2条重链构成,每条链可以分为可变区(variable domain)和恒定区(constant domain)。可变区还可以进一步分为骨架区(FR)和互补作用区(CDR)。
IG或TR识别抗原的关键在于可变区的高度可变性,这种可变性是由「基因重排」机制产生。

1,下载igblast程序和其他必要文件:
下载地址:https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST
从下面地址下载 internal_data 和 optional_file 整个目录,这两个目录也是必需的: https://ftp.ncbi.nih.gov/blast/executables/igblast/release/

2,下载IMGT数据库,或者TCR数据库
/database/IMGT_V-QUEST_reference_directory/Homo_sapiens/TR

3,igblast具体使用例子:
igblast数据比对的input文件必须是fasta格式。所以如果你的数据是fastq格式,需要先转一下格式(seqkit fq2fa)

igblastn -query test.fa -out test.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux  -organism human -domain_system imgt  -ig_seqtype TCR -num_threads 4 -min_D_match 5  -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3

发现了一个新的软件:igblastwrapper
https://github.com/mikessh/migmap
Motivation
IgBlast is an excellent of V-(D)-J mapping tool able to correctly map even severely hypermutated antibody variants. While being a gold standard, the following limitations of IgBlast v1.4.0 have driven MIGMAP development:

# 查看帮助文档

java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -h
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --use-kabat -q 70 ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --use-kabat -q 70 --illegal-access ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
java -jar /software/biosoftware/migmap-1.0.3/migmap-1.0.3.jar -R TRA,TRB -S human --by-read --data-dir ./ --use-kabat --allow-incomplete --details  ../02.align/RP01G9E1L3_R1.fq.gz migmap.out.txt
都没有成功,先试一下测试文件吧
cd /software/biosoftware/migmap-1.0.3
./migmap -R TRA,TRB -S human ./test/sample.fasta.gz migmap.out.txt
成功
回到工作目录:
不再使用,忽略掉。

发现 igblast网页版和本地版不一致情况

igblastn -query test.igblast.fa -out test.fmt.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux -organism human -domain_system imgt -ig_seqtype TCR  -min_D_match 5  -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3

本地版参数:

igblastn -query test.fa -out test.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux  -organism human -domain_system imgt  -ig_seqtype TCR -num_threads 4 -min_D_match 5  -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3

本地版结果:

# IGBLASTN
# Query: E00509:314:HNTL2CCXY:4:1101:8024:1924 1:N:0:CTCTCTAC
# Database: ./database/human_TR_V ./database/human_TR_D ./database/human_TR_J
# Domain classification requested: imgt

# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches, if present, are separated by a comma.
TRAV26-2*01     TRAJ33*01,TRAJ53*01     VA      No      Out-of-frame    No      -

# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated
GGCAA   N/A     AGCAA

# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT        84      106     23      19      4       0       82.6
Total   N/A     N/A     23      19      4       0       82.6

# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 6 hits found
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAV26-2*01     82.609  23      4       0       0       84      106     168     190     0.20    25.2
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRBV23/OR9-2*01 80.000  20      4       0       0       128     147     35      16      5.2     20.5
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRBV23/OR9-2*02 80.000  20      4       0       0       128     147     35      16      5.2     20.5
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ33*01       100.000 8       0       0       0       107     114     6       13      8.1     16.1
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ53*01       100.000 8       0       0       0       107     114     15      22      8.1     16.1
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ18*01       100.000 7       0       0       0       141     147     14      20      31      14.1

网页版结果:使用网页上默认的参数。

# IGBLASTN
# Query: E00509:314:HNTL2CCXY:4:1101:8024:1924 1:N:0:CTCTCTAC
# Database: IG_DB/imgt.TR.Homo_sapiens.V.f.orf.p IG_DB/imgt.TR.Homo_sapiens.D.f.orf IG_DB/imgt.TR.Homo_sapiens.J.f.orf.p
# Domain classification requested: imgt

# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence.

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches, if present, are separated by a comma.
TRAV26-2*01     TRAJ33*01,TRAJ53*01     VA      No      Out-of-frame    No      -

# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
GGCAA   N/A     AGCAA   

# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT        84      106     23      19      4       0       82.6
Total   N/A     N/A     23      19      4       0       82.6

# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 6 hits found
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAV26-2*01     82.609  23      4       0       0       84      106     168     190     0.22    25.2
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRBV7-9*04      92.308  13      1       0       0       138     150     109     121     16      19.0
V       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRBV15*01       100.000 11      0       0       0       135     145     13      23      16      19.0
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ33*01       100.000 8       0       0       0       107     114     6       13      9.0     16.1
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ53*01       100.000 8       0       0       0       107     114     15      22      9.0     16.1
J       reversed|E00509:314:HNTL2CCXY:4:1101:8024:1924  TRAJ18*01       100.000 7       0       0       0       141     147     14      20      34      14.1

不一致的地方在结果倒数的第四,五行,比对结果不一致。

开始怀疑是设置evalue值的问题,于是本地版设置了evalue 20

igblastn -query one.fa -out one.out -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux -organism human -domain_system imgt -ig_seqtype TCR -min_D_match 5 -evalue 20 -show_translation -num_clonotype 100 -D_penalty -2 -J_penalty -2 -min_V_length 9 -min_J_length 0 -num_alignments_V 5 -num_alignments_D 5 -num_alignments_J 5
结果一致了,

最终版参数:

ls RP01G9E1L3_R{1,2}_*|while read line;do echo "igblastn -query $line -out outfile.igblast.$line.fmt7.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux  -organism human -domain_system imgt  -ig_seqtype TCR -num_threads 70 -min_D_match 5 -evalue 20 -show_translation -num_clonotype 100 -D_penalty -2 -J_penalty -2 -min_V_length 9 -min_J_length 0  -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3 ";done >run.sh

总结,网页版和本地版的主要不一样是evalue不一样,造成的原因
首先是参数,另一个是database导致的search space不一样.

目前igblast分析方法:

第一步比对:

igblastn -query b0005p1_S1_L001_R1_001.fa  -out b0005p1_S1_L001_R1_001.fa.fmt7.out -outfmt 7 -germline_db_V ./database/human_TR_V -germline_db_J ./database/human_TR_J -germline_db_D ./database/human_TR_D -auxiliary_data ./optional_file/human_gl.aux  -organism human -domain_system imgt -ig_seqtype TCR -num_threads 10 -min_D_match 5 -evalue 20 -show_translation -num_clonotype 100 -D_penalty -2 -J_penalty -2 -min_V_length 9 -min_J_length 0 -num_alignments_V 3 -num_alignments_D 3 -num_alignments_J 3

第二步,将比对结果过滤,提取VDJ信息:过滤evalue大于1e-10,
cat _S1_L001_R1.fmt7.out |grep -E "V|J|^D" |awk '$(NF-1)<1e-10' >S1.R1.igblast.fmt7.out

第三步,统计每条read上比对的VDJ个数,自己写的脚本。
python deal_igblast.py S1.R1.igblast.fmt7.out S1.R1.igblast.4to1.total.out

第四步, 根据上一步结果统计VDJ组合的结果,这一步根据需要。

python  filter_igblast_result.py S1.R1.igblast.4to1.total.out S1.R1.out.all.result.txt S1.R1.out.4gene.result.txt S1.R1.out.3gene.result.txt S1.R1.out.2gene.result.txt S1.R1.out.1gene.result.txt
可以统计出各种组合的信息:
gene_in_reads_4: 0
gene_in_reads_3: 61612
AVAJBV  AVAJBJ  AVBVBJ  AJBVBJ
61612   0       0       0
gene_in_reads_2: 282005
AVAJ    AVBV    AVBJ    AJBV    AJBJ    BVBJ
99723   26009   0       0       0       156273
gene_in_reads_1: 2234397
AV      AJ      BV      BJ
3605    150     2230039 603
total calculate reads: 2578014

如果是PE read,比对是需要分开做的,在统计的时候可以一起统计。

上一篇下一篇

猜你喜欢

热点阅读