融合检测之FACTERA

2019-06-05 本文已影响41人晓佥

FACTERA：https://factera.stanford.edu/download.php

FACTERA (Fusion And Chromosomal Translocation Enumeration and Recovery Algorithm) is a tool for detection of genomic fusions in paired-end targeted (or genome-wide) sequencing data.

Command

perl factera.pl [options] tumor.bam exons.bed hg19.2bit [optional: targets.bed]
主程序是perl脚本，可以自行修改一些内容，使检测到的融合更多。

Input

tumor.bam should consist of paired-end reads aligned by a mapping algorithm capable of soft-clipping, such as BWA. The BAM file does not need to be realigned or deduped, but should be position-sorted and have a corresponding index file (bam.bai created using SAMtools index) in the same directory in order to estimate the total sequencing depth in the neighborhood of each detected fusion.

exons.bed contains chromosomal coordinates (such as exon boundaries) in 3-column BED format (chr start end). The fourth column contains gene names, exon names, or any arbitrary identifier, and will be used to group corresponding coordinates. This allows the resolution of fusion detection to be restricted to inter-gene or inter-exon fusions, for example. Make sure to use coordinates from the same genome version as the 2bit reference sequence required in the third argument.

Users can download this exons.bed file, which combines hg19 RefSeq and Gencodev17 exon coordinates (downloaded from UCSC 02-23-14) with corresponding HUGO gene symbols in column 4. With this file, FACTERA will identify inter-gene fusions. To identify inter-exon fusions in hg19, use this exons.bed file.

hg19.2bit is a 2 bit encoded human reference genome, used for fast genome subsequence retrieval. Of note, FACTERA is not restricted to human sequences, and any 2bit reference genome can be used as long as coordinates in exons.bed are consistent. To create a 2bit file for a genome of interest, download the FASTA to 2BIT conversion tool from the appropriate system folder (<u>faToTwoBit</u>) and follow these <u>instructions</u>.

targets.bedis optional and allows the user to restrict the FACTERA search to genomic regions of interest, such as those targeted by a sequencing capture library. Format is a standard 3-column BED (chr start end). The use of a targets.bed file can greatly improve running time when only a subset of sequenced regions is known to be relevant for fusion detection.

Output

Each FACTERA run produces 9 main output files, each of which is described below:

parameters.txt = all input files and parameter values.

discordantpair.depth.txt = ranked list of discordant read clusters.

disordantpair.details.txt = discordant read positions.

fusiontargets.bed = bed coordinates of candidate fusions – used to restrict search space for soft-clipped reads.

blastreads.fa = used to build blast database of soft-clipped, improperly paired, and unmapped reads.

blastquery.fa = file used to search individual candidate fusion sequences (query) for hits in blastreads.fa (target database).

fusionseqs.fa = all detected breakpoints with 500bp of additional flanking sequences.

fusions.bed = bed output for detected fusions. Useful for comparing runs or somatic vs germline (column 4 is fusion ID).

fusions.txt = all detected fusion events, including details, described below:

Field	Description
Est_Type	Estimated structural variant type: TRA = translocation; INV = inversion; DEL = deletion; '-' = not determined
Region1	Name of genomic region closest to breakpoint 1 (e.g., gene 1, exon 1, etc.)
Region2	Name of genomic region closest to breakpoint 2 (e.g., gene 2, exon 2, etc.)
Break1	Chromosomal breakpoint 1
Break2	Chromosomal breakpoint 2
Break_support1	Number of reads supporting breakpoint 1
Break_support2	Number of reads supporting breakpoint 2
Break_Offset	Breakpoint adjustment in bases (e.g., owing to microhomology)
Order1	Orientation of read clipping with respect to breakpoint 1: CN, clipped followed by not clipped; NC, vice versa
Order2	Same as Order1, but for breakpoint 2
Break_depth	Number of breakpoint-spanning reads
Proper_pair_support	Number of properly paired and previously soft-clipped reads that map to fusion
Unmapped_support	Number of previously unmapped reads that map to fusion
Improper_pair_support	Number of previously discordantly paired reads that map to fusion
Paired_end_depth	Total number of paired-end reads that flank breakpoint
Total_depth	Mean total depth for regions flanking both breakpoints (+/-500bp by default)
Fusion_seq	Estimated fusion sequence (50 bases flanking breakpoint by default)
Non-templated_seq	Non-templated (i.e., non-reference) sequence segment (if any) enclosed in brackets

Requirements

Unix operating system (Linux, Mac OS X, etc.)
Perl 5, with the following external dependency: Statistics::Descriptive.
To install Statistics::Descriptive from CPAN, issue the following command:
sudo cpan Statistics::Descriptive
Other Perl dependencies are included in the Perl 5 Core Modules and should already be installed: IPC::Open3, List::Util, File::Spec, Symbol, Getopt::Std, File::Basename.
twoBitToFa
Find and download executable from the appropriate system folder, then copy/link/move to PATH (i.e., /usr/bin).
hg19.2bit to run FACTERA on the hg19 human genome.
Note that hg38.2bit is now available. To use another reference genome, make sure that input BED coordinates are consistent (the exons.bed file provided here is currently hg19 only).
blast+
After downloading, find blastn and makeblastdb in ncbi-blast-version/bin and copy/link/move to PATH (i.e., /usr/bin).
SAMtools
After downloading, find samtools and copy/link/move to PATH (i.e., /usr/bin).

Options (defaults):	描述
-o	Output directory (tumor.bam directory).
-r <int>	Minimum number of breakpoint-spanning reads needed for output (5).
-m <int>	Minimum number of discordant reads needed for a candidate fusion (2).
-x <int>	Maximum number of breakpoints to examine for any given pair of genomic regions (5).
-s <int>	Minimum number of reads with the same breakpoint (1).
-f <0-1>	Minimum fraction of read bases required for alignment to fusion template (0.9).
-S <0-1>	Minimum similarity required for alignment of read to fusion template (0.95).
-k <int>	k-mer size for fragment comparison (10 bases).
-c <int>	Minimum size of soft-clipped region to consider (16 bases).
-b <int>	Number of bases flanking breakpoint for fusion template (500).
-p <int>	Number of threads for blastn search (4; 10 or more recommended).
-a <int>	Number of bases flanking breakpoint to provide in output (50).
-e	Disable grouping of input coordinates by column 4 of exons.bed (off).
-v	Disable verbose output (off).
-t	Disable running time output (off).
-C	Disable addition of 'chr' prefix to chromosome names (off)***
-F	Force remake of BLAST database for a particular input (off).

Required if 'chr' is absent from all input files, including reference.2bit.

FAQ

1.Which aligners are supported?

Answer: FACTERA was developed and optimized using targeted sequencing data aligned by bwa aln, and we currently recommend that users employ bwa aln for best performance. While FACTERA can be applied to data mapped by bwa mem, users should be aware of the following considerations when interpreting results. The most notable difference between bwa aln and mem with respect to fusion detection is the use of hard clipping in addition to soft clipping by bwa mem. Absent from bwa aln, hard clipping enables bwa mem to improve the mapping rate by realigning (rather than truncating) sufficiently long read segments in chimeric sequences. In contrast, bwa aln will truncate such reads without realignment (soft-clipping), and FACTERA leverages soft clipped, but not hard clipped, reads for breakpoint detection. Hard clipped reads will be supported in a future release of FACTERA, and we will notify registered users when this version is available.

2.According to the paper, FACTERA has high specificity. Why does FACTERA report some fusions that appear to be false positives?

Answer: False positive calls may arise from mapping artifacts (due to repeat sequences), PCR template switching, and other sequencing errors, and are increasingly difficult to avoid as the sequencing space grows in size and complexity. While we have implemented a variety of post-processing algorithms to reduce the false positive rate compared to previous methods (paper), the elimination of all fusions with repetitive content would risk discarding genuine events. We therefore recommend that users inspect the FACTERA output for possible false positives by using BLAT and the UCSC human genome browser. This is particularly important when using FACTERA to analyze exome or genome-scale datasets. In cases where paired normal datasets are available, we recommend leveraging this information to reduce the FPR. Finally, we would welcome suggestions from users on how to best discriminate real fusions in repeat regions from sequencing artifacts. Please send us your feedback/suggestions along with fusion results that you suspect are not real. This will help us to compile a blacklist of poorly behaving genomic regions that might be useful as a post-processing filter.

Reference

Aaron M. Newman, Scott V. Bratman, Henning Stehr, Luke J. Lee, Chih Long Liu, Maximilian Diehn* and Ash A. Alizadeh* (2014) FACTERA: a practical method for the discovery of genomic rearrangements at breakpoint resolution, Bioinformatics DOI: 10.1093/bioinformatics/btu549.