基因融合 — step2 Delly 软件
一、概述
Delly is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read massively parallel sequencing data. It uses paired-ends, split-reads and read-depth to sensitively and accurately delineate genomic rearrangements throughout the genome. Structural variants can be visualized using Delly-maze and Delly-suave.
官网或下载地址:
https://github.com/dellytools/delly(https://tobiasrausch.com/delly/)
发表的文章、影响因子及被引用量:
Tobias Rausch, Thomas Zichner, Andreas Schlattl, Adrian M. Stuetz, Vladimir Benes, Jan O. Korbel. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 2012 28: i333-i339.
Delly由C++编写,发表时间:2012
被引量:307
期刊及影响因子:Bioinformatics 4.531(2018)
作者及单位:Tobias Rausch 欧洲分子生物学实验室EMBL Contact: tobias.rausch@embl.de
Delly是预测SV软件,融合基因本质属于SV一种,所以学习delly还需要先了解一些SV基本知识及预测方法。下面有篇大神文章介绍了SV的四种检测方法,可以参考。其中delly运用了其中两种,RP和SR。所以灵敏度高,应用广泛。
一篇文章说清楚基因组结构性变异检测的方法
二、下载及安装
wget https://github.com/dellytools/delly/releases/download/v0.8.1/delly_v0.8.1_linux_x86_64bit
chmod a+x delly_v0.8.1_linux_x86_64bit
这个链接是编译好的二进制文件,如果下载源码自己编译也支持,详细见官网:https://github.com/dellytools/delly
三、运行
运行分两步,一步是delly call SV,一步是bcf2vcf转化。当然如果分不同类型call SV,还需要merge步骤,这里就不介绍了。
1)delly call SV
delly_v0.8.1_linux_x86_64bit call -t ALL -g ucsc.hg19.fasta -o sample_name.bcf -x human.hg19.excl.tsv sample_name.bam
参数选择:
Usage: delly call [OPTIONS] -g <ref.fa> <sample1.sort.bam> <sample2.sort.bam> ...
Generic options:
-? [ --help ] show help message
-t [ --svtype ] arg (=ALL) SV type to compute [DEL, INS, DUP, INV,
BND, ALL]
-g [ --genome ] arg genome fasta file
-x [ --exclude ] arg file with regions to exclude
-o [ --outfile ] arg (="sv.bcf") SV BCF output file
Discovery options:
-q [ --map-qual ] arg (=1) min. paired-end (PE) mapping quality
-r [ --qual-tra ] arg (=20) min. PE quality for translocation
-s [ --mad-cutoff ] arg (=9) insert size cutoff, median+s*MAD (deletions
only)
-c [ --minclip ] arg (=25) min. clipping length
-m [ --minrefsep ] arg (=50) min. reference separation
-n [ --maxreadsep ] arg (=10) max. read separation
Genotyping options:
-v [ --vcffile ] arg input VCF/BCF file for genotyping
-u [ --geno-qual ] arg (=5) min. mapping quality for genotyping
-d [ --dump ] arg gzipped output file for SV-reads (optional)
其中,-t参数如果没有选择ALL,可以用merge命令进行合并
-x是输入的bed格式文件,排除这个区域
-x的下载见delly git地址:https://github.com/dellytools/delly/tree/master/excludeTemplates
2)bcftools bcf2vcf
bcftools view sample_name.bcf > sample_name.vcf
bcf文件转为vcf时,发现里面一片空白,可能是bcftools版本过低,建议更新后再转换格式即可。
如果需要融合结果,再从vcf结果中提取出融合即可(提取的方法会在后面简单介绍 )。
四、文献解读
DELLY: structural variant discovery by integrated paired-end and split-read analysis
1.引言
介绍了SV检测的四种方法:paired-end mapping methods、read-depth analysis、split-read analysis、sequence assembly。每个方法都有自己的优缺点,单一方法已经无法满足现有检测限。
并介绍了这几种方法的优劣:
These changes in sequencing parameters and strategy significantly affect SV discovery. Split read analysis methods should benefit from relatively long reads whereas read counting methods are expected to suffer from a 3-fold reduction in the number of sequenced reads. Compared to read depth and paired-end analysis, split-read analysis has so far been limited to the detection of small SVs (Ye et al., 2009) as well as SVs in ?unique? genomic regions, devoid of repeats and segmental duplications (Wang et al., 2011). Most importantly, certain classes of SVs, including tandem duplications, inversions, translocations and particularly highly complex events, are not yet sensitively ascertained in split-read-based analyses.
SR检测方法,更适用于读长长的reads,对于reads过短,检测受限;并且对于重复区域和其他更复杂的SV区域 SR方法效果不佳。
Delly结合了PE和SR两种方法,能够检测deletions,tandem duplications, inversions and translocations等SV。delly能够适用多种测序长度的文库。
工作方法如下图:
Paired-end predicted structural variants are then refined using split-reads and reported at single-nucleotide breakpoint resolution
PE方法call SV,SR方法用来进一步确认,并寻找断点。

2.方法
2.1 Paired-end mapping analysis
DELLY computes the default read-pair orientation and the paired-end insert size distribution characterized by the median and standard deviation of the library.
以文库的中值和标准差为特征,计算RP的方向和插入片段的分布
Based on these parameters, DELLY then identifies all discordantly mapped read-pairs that either have an abnormal orientation or an insert size greater than the expected range
根据以上参数,delly检测不一致的RP,如方向比对异常或者插入片段长度远大于预期范围。
DELLY hereby focuses on uniquely mapping paired-ends and the default insert size cutoff is three standard deviations from the median insert size
delly的关注点在uniq mapping的PE和插入长度在三个标准差以上的。

1)Deletions
PE方向正确,但两个mate reads之前的距离(ref计算出来的)偏离插入片段长度分布(实际的长度)
2)Inversions
PE有一段方向非正常
3)Tandem duplications
在某一位置,PE的相对顺序变化,发生互调。但是相对方向依然是相反的。
4)Tandem duplications
PE比对到不同染色体。
找到这些All discordantly mapped paired-ends 后,再通过delly设计的算法,进行处理。
2.2 Split-read analysis
基于上面的PE分析,接下来的SR分析包含以下几个步骤:
The splitread analysis is a multi-step process, including (i) candidate split-read search, (ii) SV reference extraction, (iii) indexing and k-mer counting, (iv) detecting the best supported breakpoint, (v) split-read consensus computation and (vi) an alignment of the split-read consensus sequence to the SV reference region
这里提到一个概念:A single-anchored paired-end is a read pair where one read maps to the reference and the other read is unmapped
如下图。SR会根据PE的结果,对SV进行进一步的确认。

五、结果解读
结果是经过bcftools转换得到的vcf,格式如下:
vcf中header(##部分)对这些字段都有解释:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT samplename
chr2 29451721 DUP00000016 G <DUP> . PASS IMPRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.8.1;CHR2=chr2;END=33141401;PE=41;MAPQ=44;CT=5to3;CIPOS=-50,50;CIEND=-50,50 GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV 0/1:-146.184,0,-879.731:10000:PASS:61344:49436:52:2:157:53:0:0
chr2 29451782 DEL00000019 C <DEL> . PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.8.1;CHR2=chr2;END=33141287;PE=0;MAPQ=0;CT=3to5;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=60;INSLEN=0;HOMLEN=1;SR=7;SRQ=0.968354;CONSENSUS=CAGGGAGGTGAGGAGCCTAGGACTAAGCAGGGAGGGAGTGACACCTTGAACACGAATCATCTTTACCTATATATCCTCCGCCTCCTCCACCTGAGGAGCGGCTCGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGG;CE=1.77251 GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV 0/0:0,-99.1513,-978.105:10000:PASS:61612:49086:52:2:0:0:337:1
chr2 33141419 BND00000040 G ]chr1:156836525]G . LowQual PRECISE;SVTYPE=BND;SVMETHOD=EMBL.DELLYv0.8.1;CHR2=chr1;END=156836525;PE=0;MAPQ=0;CT=5to3;CIPOS=-1,1;CIEND=-1,1;SRMAPQ=3;INSLEN=0;HOMLEN=0;SR=5;SRQ=0.980392;CONSENSUS=TTGTTGTCCCAGGGAAAGGAGTGGGACTCAGAAAGTCCCCGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG;CE=0.984313 GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV 0/1:-11.0961,0,-19.6233:111:PASS:83:224:2203:0:0:0:10:9
总共10列,分别来看:
第一列:染色体
第二列:突变起始位置
第三列:ID 包含了SV类型及编号
SV类型有五种:DEL(Deletion 大片段缺失),
INS(Insertion 大片段插入),
DUP(Duplication 串联重复),
INV(Inversion 染色体倒置),
BND(Translocation 染色体易位)
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=BND,Description="Translocation">
##ALT=<ID=INS,Description="Insertion">
第四列:突变起始位置的碱基(ref)
可能是A/T/C/G/N 五种情况
第五列:ALT 变异情况
这列信息,除了BND类型外,全部展示变异类型 <变异类型>
如果变异类型是BND,这一列展示变异结束位点的位置及碱基情况
第六列:QUAL 质量值
一般是"."
第七列:是否通过DELLY质控
通常是 LowQual (Poor quality and insufficient number of PEs and SRs )或者 PASS (通过)
##FILTER=<ID=LowQual,Description="Poor quality and insufficient number of PEs and SRs.">
第八列:INFO
具体可以查看Illumina的文档:https://github.com/Illumina/manta/blob/master/docs/userGuide/README.md
说明几个关键字段(除第一个字段外,字段是类型和具体信息用=分割):
1)第一个字段
IMPRECISE PRECISE 是否精确匹配 IMPRECISE:指示该SV是不准确的,无法获取准确的断点位置信息
2)SVTYPE
SV类型
3)SVMETHOD
Type of approach used to detect SV 这里会列出工具来源(EMBL 因为delly是EMBL发表的)及DELLY版本
4)CHR2
Chromosome for END coordinate in case of a translocation 之前介绍了vcf中第二列是突变的起始位置,这里就是突变的结束位置的染色体号
5)END
突变的结束位置
6)MAPQ
Median mapping quality of paired-ends
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="PE confidence interval around END">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="PE confidence interval around POS">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate in case of a translocation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=PE,Number=1,Type=Integer,Description="Paired-end support of the structural variant">
##INFO=<ID=MAPQ,Number=1,Type=Integer,Description="Median mapping quality of paired-ends">
##INFO=<ID=SRMAPQ,Number=1,Type=Integer,Description="Median mapping quality of split-reads">
##INFO=<ID=SR,Number=1,Type=Integer,Description="Split-read support">
##INFO=<ID=SRQ,Number=1,Type=Float,Description="Split-read consensus alignment quality">
##INFO=<ID=CONSENSUS,Number=1,Type=String,Description="Split-read consensus sequence">
##INFO=<ID=CE,Number=1,Type=Float,Description="Consensus sequence entropy">
##INFO=<ID=CT,Number=1,Type=String,Description="Paired-end signature induced connection type">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise structural variation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Type of approach used to detect SV">
##INFO=<ID=INSLEN,Number=1,Type=Integer,Description="Predicted length of the insertion">
##INFO=<ID=HOMLEN,Number=1,Type=Integer,Description="Predicted microhomology length using a max. edit distance of 2">
第九列:FORMAT 标签 :分割
一般是:GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV
下面说明几个重要的标签:
1)GT
Genotype
2)DR
high-quality reference reads
3)DV
high-quality variant reads
4)RR
high-quality reference junction reads 参考基因组断点支持数
5)RV
high-quality variant junction reads 突变断点支持数
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Log10-scaled genotype likelihoods for RR,RA,AA genotypes">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=FT,Number=1,Type=String,Description="Per-sample genotype filter">
##FORMAT=<ID=RC,Number=1,Type=Integer,Description="Raw high-quality read counts for the SV">
##FORMAT=<ID=RCL,Number=1,Type=Integer,Description="Raw high-quality read counts for the left control region">
##FORMAT=<ID=RCR,Number=1,Type=Integer,Description="Raw high-quality read counts for the right control region">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Read-depth based copy-number estimate for autosomal sites">
##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# high-quality reference pairs">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality variant pairs">
##FORMAT=<ID=RR,Number=1,Type=Integer,Description="# high-quality reference junction reads">
##FORMAT=<ID=RV,Number=1,Type=Integer,Description="# high-quality variant junction reads">
第十列及以后:各个样本的情况 对应FORMAT列
:分割 对应FORMAT列 表示各个样本FORMAT具体信息
举例:
GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV
0/1:-146.184,0,-879.731:10000:PASS:61344:49436:52:2:157:53:0:0
对应上面就是GT基因型是0/1;
DR ref reads 是157;
DV alt reads 是53;
RR ref junction reads 是0;
RV alt junction reads是0;
六、融合结果的提取
接下来从vcf中提取出融合基因结果即可(靠自己写个脚本啦)。
提取原则主要是,提取出起始位点终止位点并进行染色体区域注释,如果两个位点位于不同基因,则认为这个位点是一个融合基因。
其他筛选原则还可以选择,基因型是否为GT="0/0",FILTER列是否为PASS,是否精确匹配有明确断点PRECISE,突变频率是否为0,还可以用Bedtools筛选指定区域等等。这就看各位的需求是否严格啦~