2019龙星计划@PKU生信小白科研信息学

Dragon star Day 3 Pt.1 关于结构变异、C

2019-08-12  本文已影响13人  美式永不加糖

Dragon star Day 3 Pt.1

关于结构变异、CNV Calling、SNP genotyping、HMM、基于NGS的SV检测

Dragonstar2019 by Kai Wang

  1. Detection of structural variants in human
  2. Annotation and phenotype-driven interpretation of genetic variants

Part Ⅰ Detection of structural variants in human

1 Human genetic variation

Pollex et al, Circulation. 2007

http://doc.goldenhelix.com/SVS/tutorials/cnv_univariate_analysis/overview.html

2 Mechanisms underlying structural variant formation

Ottaviani D, LeCain M, Sheer D. The role of microhomology in genomic structural variation[J]. Trends in Genetics, 2014, 30(3): 85-94.

2.1 Recurrent structural variants

2.2 Nonrecurrent rearrangements

3 Technologies for CNV Detection

3.1 Karyotyping and cytogenetic analysis

4 SNP genotyping arrays

SNP genotyping array is a type of DNA microarray which is used to detect SNPs.

Schematic view of SNP array analysis by Affymetrix (right) and Illumina (left).

Iacobucci I, Lonetti A, Papayannidis C, et al. Use of single nucleotide polymorphism array technology to improve the identification of chromosomal lesions in leukemia[J]. Current cancer drug targets, 2013, 13(7): 791-810.

5 CNV Detection

There is a need to develop a high-resolution CNV detection algorithm using high-density SNP genotyping data:

5.1 Log R Ratio (LRR) and B Allele Frequency (BAF)

For both platforms, the computational algorithms convert the raw signals into Log R Ratio (LRR) and B Allele Frequency (BAF).

BAF = Y / (X + Y)
LRR = log2( (X + Y)sampleOfInterest / (X+Y)baselineSample)

https://www.biostars.org/p/199025/

The combination of LRR and BAF can be used together to determine different copy numbers and to differentiate copy-neutral LOH regions from normal copy regions.

Loss of heterozygosity (LOH) is a cross chromosomal event that results in loss of the entire gene and the surrounding chromosomal region.

https://en.wikipedia.org/wiki/Loss_of_heterozygosity

5.2 Detection of CNVs from SNP arrays using PennCNV

5.2.1 PennCNV Flowchart
5.2.2 SNP Signal Intensities

R=X_{A}+X_{B}, θ=(2/π)*arctan(X_{A}/X_{B}), LRR=log_2(R_{subject}/R_{expected})

XA and X~B~: normalized signal intensities for alleles A and B

R~expected~: calculated based on a reference dataset assuming copy number = 2

Infinium II is a two-channel assay and data consist of two intensity values (X, Y) for each SNP, with one intensity channel for each of the fluorescent dyes associated with the two alleles of the SNP.

Normalized allele intensities are transformed to a combined SNP intensity, R (R = X + Y), and an allelic intensity ratio, theta (θ = 2/π*arctan(Y/X)).

Staaf J, Vallon-Christersson J, Lindgren D, et al. Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios[J]. BMC bioinformatics, 2008, 9(1): 409.

5.2.3 Visualization of CNVs
5.2.4 Hidden Markov Model in PennCNV

Transition probability matrix aij: a(i,j)= P[q _{t+1}=j|q_{t}=i]

Emission probabilities ei(a) probability state i emits character a

http://www.cs.cmu.edu/~durand/03-711/2009/Lectures/hmm09-1.pdf

https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/download/lectures/PCB_Lect06_HMM.pdf

5.2.5 Copy Number States

6 states:

每种state在图中有不同的表现

5.2.6 Hidden states, copy numbers, CNV genotypes, and their descriptions

6 CNV Calling

6.1 Viterbi algorithm for calling

6.2 Other Types of Signal Data

PennCNV can be applied to data from other technical platforms:

6.3 PennCNV-Affy Pipeline

SNP→CNV→genotype call, 但 SNP array 效率较低

6.4 Joint Modeling on Family Data

6.5 Joint modeling of the CNVs in a trio

6.6 Likelihood of Signal Intensities

By treating the trio as a unit, this calling algorithm can avoid generating calls that are Mendelian inconsistent but preserve the ability to allow de novo events.

Likelihood of an observation sequence given a state sequence, or likelihood of an observation sequencealong a single path : given an observation sequence X = {x1, x2, · · · , xT } and a state sequence Q = {q1, · · · , qT } (of the same length) determined from a HMM with parameters Θ, the likelihood of X along the path Q is equal to:

p(X|Q, Θ) =\prod_{t=1}^T p(x_{i}|q_{i} , Θ) = b_{1}(x_{1}) · b_{2}(x_{2})· · · b_{T} (x_{T} )

https://www.cs.ubc.ca/~murphyk/Software/HMM/labman2.pdf

7 NGS-based SV detection

SV: Structural Variants

Escaramís G, et al. Briefings in Functional Genomics, 2015

特别复杂的SV需要 de novo assemble,contig比对到reference上。

( A ) Read depth . Reads are aligned into the reference genome and when compared to diploid regions they show a reduced number of reads in a deleted region or higher read depth in a duplicated region.

( B ) Paired reads. Pairs of sequence reads are mapped into the reference genome (from left to right): (1) no SV, pairs are aligned into correct order, correct orientation and spanned as expected based on the library’s insert size; (2) deletion, the aligned pairs span far apart from that expected based on library insert size; (3) tandem duplication, read pairs are aligned in unexpected order, where expected order means that the leftmost read should be aligned in the forward strand and the rightmost read in the reverse strand; (4) novel sequence insertion, the pairs are aligned closer from that expected based on library insert size; (5) inversion, read pairs are aligned in wrong orientation, both reads align either in forward or reverse strand; and (6) read pairs mapped to different chromosomes.

( C ) Split reads. Sequenced reads pointing to the same breakpoint are split at the nucleotide where the breakpoint occurs. The corresponding paired read is properly aligned to the reference genome.

( D ) De novo assembly. Sample reads from novel sequence insertions are assembled without a reference sequenced genome.

  • Read-Pair (RP) method is to estimate the likelihood of expected value of insert
    size variation associated with deletion and insertion.
  • Read-Depth based algorithm reports exact number of sequence copies in the genome.

Ye K, Hall G, Ning Z. Structural variation detection from next generation sequencing[J]. Next Generat Sequenc & Applic, 2016, 1(007).

7.1 Read count-based methods for SV detection

7.2 Detection of SVs from discordant read pairs

7.5 Detection of SVs using assembly-based methods

Most short read methods based on assembly for SV detection use a reference assisted approach. Reads with missing pair or unmapped reads after a reference alignment are collected and a local assembly is performed to generate contig that represents the actual local structural variation.

https://www.1010genome.com/sv-detection/

7.6 SV detection from long-read sequencing

Pacbio and Oxford nanopore platforms offer a different view of structural variation in a genome with help of their average >10kb read lengths. A low coverage of 10x for these long reads can help detect a high percentage of structural variations (>80%) in a complex genome.

https://www.1010genome.com/sv-detection/

7.7 Bionano optical mapping for SV detection

Optical mapping technique like Bionano genomics further enhance the ability of NGS based SVs to detect large and complex SVs. Optical mapping generates images of megabase size DNA molecules that in turn produce genome maps.

https://www.1010genome.com/sv-detection/

A nanopore array that detects a characteristic 6 or 7-nucleotide sequence along very long genomic segments.

SV detection from Single-molecule optical mapping

上一篇下一篇

猜你喜欢

热点阅读