生物信息学从零开始学2019龙星计划@PKU生物信息学与算法

Dragon star Day 1 关于测序技术、NGS数据格式

2019-07-31  本文已影响38人  美式永不加糖

Dragon star Day 1 20190729

关于测序技术、NGS数据格式、变异检测

中英混杂3000字超长补丁,一只酸菜鱼疯狂填补知识盲区的笔记


Dragonstar2019 by Kai Wang

  1. Genomic technologies in disease studies
  2. NGS data formats and variant calling

Part Ⅰ Genomic technologies in disease studies

1 Types of genetic variation

2 Revolution: single-molecule long-read sequencing

单分子长读长测序

2.1 Oxford Nanopore Sequencing

测量电流

The DNA/RNA is sequenced when it is going though a protein pore.

The nucleotides in the DNA/RNA block the ionic current and induce changes of current, which can be measured.

2.2 PacBio Single-molecule real-time (SMRT) sequencing

测量荧光

2.3 Linked-read Sequencing

依靠barcode

By adding a unique barcode to every short read generated from an individual molecule, the short reads are linked together.

10X Genomics 公司的 linked reads 技術本質上是將 barcode 序列引入長序列片段,透過將長片段分配到不同的油滴微粒中,利用 GemCode 平台對長片段序列進行擴增引入 barcode 序列以及定序接頭引物,然後將序列打斷成適合定序大小的片段進行定序,通過 barcode 序列資訊追踪來自每個大片段DNA模板的多個 Reads,從而獲得大片段的遺傳資訊。通過 linked reads 結合常規二代定序組裝得到的Scaffold,可建構準確度更長的Scaffold。

http://toolsbiotech.blog.fc2.com/blog-entry-37.html

What are molecular barcodes?
The concept of molecular barcodes is that each original DNA fragment, within the same sample, is attached to a unique sequence barcode.

https://pdfs.semanticscholar.org/310b/3bac42989485c98406848217418ff22c22e7.pdf

2.4 Single-molecule optical mapping (Bionano Genomics)

依靠限制性内切酶位点,测量光谱

Optical mapping

By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence.

https://en.wikipedia.org/wiki/Optical_mapping

Part Ⅱ NGS data formats and variant calling

1 Basic Concepts in NGS

2 Data formats

2.1 The Rawset of Raw Data

Typically: images 根据不同碱基发射出不同颜色的荧光

base calling: call nucleotides at each base of each read.

2.2 NGS data formats

FASTQ: The raw sequence data format

Millions of short reads from unknown genetic locations.

2.3 Formats use different coordinate systems, which adds confusion

不同文件格式对染色体坐标的起始位置定义不同,造成了一定的困扰

BED: 0-based, half-open
GFF: 1-based, fully closed
SAM: 1-based, fully closed
BAM: 0-based, half-open
VCF: 1-based, fully closed

3 Visualization of genomic data: IGV

强大的IGV

IGV: a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.

4 Coverage

4.1 What is coverage?

4.2 Coverage: how many reads we need to cover the genome?

4.3 Overdispersion

4.4 Question on coverage

Why do we need average 30-50x in a typical WGS experiment, and 100x in WES?

WGS is less biased compared to WES. We do not need as much depth to call a variant confidently. Check the following publications for more detailed information.

" Exome-seq achieves 95% SNP detection sensitivity at a mean on-target depth of 40 reads, whereas WGS only requires a mean of 14 reads. "

Article Variant detection sensitivity and biases in whole genome and...

Shu-Hong Lin

https://www.researchgate.net/post/Why_is_NGS_coverage_of_seemingly_less_complex_exome_sequences_higher_than_that_of_whole_genome_sequencing

"WGS is less biased compared to WES."的原因:WES的样品首先需要经过PCR

Why 30X WGS beats 100X WES for variant coverage

https://www.variantyx.com/variantyx-posts/why-30x-wgs-beats-100x-wes-for-variant-coverage/

5 General strategy for variant calling

5.1 Possible Genotypes

  1. 当只有参考基因组时,各种情形的概率均为1

    P(reads|A/A, read mapped) = P(C observed|A/A, read mapped) = 1.0

    P(reads|A/C, read mapped) = P(C observed|A/C, read mapped) = 1.0

    P(reads|C/C, read mapped) = P(C observed|C/C, read mapped) = 1.0

  2. 假定error rate为0.01

  3. 当第1条read的该位点为C

    P(reads|A/A, read mapped) = 0.01

    P(reads|A/C, read mapped) = 0.5

    P(reads|C/C, read mapped) = 1 - 0.01 = 0.99

  4. 当第2条read的该位点为C

    P(reads|A/A, read mapped) = 0.012 = 0.0001

    P(reads|A/C, read mapped) = 0.52 = 0.25

    P(reads|C/C, read mapped) = 0.992 = 0.9801

  5. 当第3条read的该位点为C

    P(reads|A/A, read mapped) = 0.013 = 0.000001

    P(reads|A/C, read mapped) = 0.53 = 0.125

    P(reads|C/C, read mapped) = 0.993 = 0.970299

  6. 当第4条read的该位点为A

    P(reads|A/A, read mapped) = 0.013 * 0.99 = 0.00000099

    P(reads|A/C, read mapped) = 0.54 = 0.0625

    P(reads|C/C, read mapped) = 0.993 * 0.01 = 0.00970299

  7. 当第5条read的该位点为A

    P(reads|A/A, read mapped) = 0.013 * 0.992 = 0.00000098

    P(reads|A/C, read mapped) = 0.55 = 0.03125

    P(reads|C/C, read mapped) = 0.993 * 0.012 = 0.0000970299 ≈ 0.000097

总结出贝叶斯公式:

Combine these likelihoods with a prior incorporating information from other individuals and flanking sites to assign a genotype.

5.2 From Sequence to Genotype: Individual Based Prior

Individual Based Prior: Evry site has 1/1000 probability of varying.

Ingredients That Go Into Prior

  • Most sites don’t vary

    P(non-reference base) ~ 0.001

  • When a site does vary, it is usually heterozygous
    P(non-reference heterozygote) ~ 0.001 * 2/3
    P(non-reference homozygote) ~ 0.001 * 1/3

  • Mutation model
    Transitions account for most variants (C↔T or A↔G)
    Transversions account for minority of variants

https://pdfs.semanticscholar.org/0156/04c3ba76cf247a8010f74ec1386e58ceb530.pdf

因此,各位点的先验概率为:

Prior(A/A) = 0.001 * 1/3 ≈ 0.00033 0.00034?

Prior(A/C) = 0.001 * 2/3 ≈ 0.00067 0.00066?

Prior(C/C) = 1 - 0.001 = 0.999

琢磨了两个小时并把google搜索翻到了第十页,依然没看懂这里的后验概率是怎么算的,如果有统计大神看到这里,求赐教🙏

5.3 From Sequence to Genotype: Population Based Prior

Population Based Prior: Use frequency information from examining others at the same site. In the example above, we estimated P(A) = 0.20

因此,各位点的先验概率为:

Prior(A/A) = 0.2 * 0.2 = 0.04

Prior(A/C) = 1 - 0.04 - 0.64 = 0.32

Prior(C/C) = 0.8 * 0.8 = 0.64

同样的迷惑:

5.4 Sequence Based Genotype Calls

  • Individual Based Prior
    • Assumes all sites have an equal probability of showing polymorphism
    • Specifically, assumption is that about 1/1000 bases differ from reference
    • If reads where error free and sampling Poisson …
    • … 14x coverage would allow for 99.8% genotype accuracy
    • … 30x coverage of the genome needed to allow for errors and clustering
  • Population Based Prior
    • Uses frequency information obtained from examining other individuals
    • Calling very rare polymorphisms still requires 20-30x coverage of the genome
    • Calling common polymorphisms requires much less data

https://pdfs.semanticscholar.org/0156/04c3ba76cf247a8010f74ec1386e58ceb530.pdf


最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
上一篇 下一篇

猜你喜欢

热点阅读