生物信息学与算法2019龙星计划@PKU生物信息学从零开始学

Dragon star Day 2 Pt.2 关于基因组组装、

2019-08-11  本文已影响17人  美式永不加糖

Dragon star Day 2 Pt.2

关于基因组组装、人类基因组计划、七桥问题、图论

Dragonstar2019 by Kai Wang

  1. Alignment of short/long-read sequencing data
  2. Genome assembly by short/long-read sequencing

Part Ⅱ Genome assembly by short/long-read sequencing

1 genome assembly

1.1 What is genome assembly

1.2 Why genome assembly

1.3 What types of genome assembly?

1.3.1 de novo assembly

The entire assembled sequence is resolved from raw sequence data without comparison to a reference genome sequence.

Assemblers need copious sequencing data and informatic exertion to put the genome back together. Image: Kelly Howe, Lawrence Berkeley Laboratory

Involves sequencing a novel genome from scratch, without the aid of external data.

https://era7bioinformatics.com/en/page.cfm?id=1500

1.3.2 Comparative assembly

Each read is aligned to the reference genome. In this approach, the new completely different sequences are lost.

1.3.3 De Novo Assembly paradigms

2 Graph theory

2.1 Directed graph

A directed graph is an ordered pair G = (V, E) comprising:

  • V a set of vertices (also called nodes or points);
  • E ⊆ {(x, y) | (x, y) ∈ V2 ∧ xy} a set of edges

https://en.wikipedia.org/wiki/Graph_theory#Directed_graph

2.2 Overlap-based approach (OLC)

3 Comparison of two approaches for large genomes such as human genomes

3.1 U.S. Human Genome Project

3.2 Draft human genome

两股力量:

双方发文互怼,竞争之下,2001年分别在Nature (2001.02.15), Science (2001.02.16)发布成果,时间仅相差一天,比原定计划提前了两年。

3.3 How Perl Saved the Human Genome Project

Random web quotes: “Perl and the human genome are almost perfectly matched; both are almost incomprehensible, with no central design, accreted haphazardly over a long time.”

"Perl一夜之间火起来,人类基因组完成后又在一夜之间被抛弃。" 🌚

3.4 Back to today

4 Lander-Waterman statistics in genome assembly

4.1 Coverage at a position can be modelled by Poisson distribution

P(k, λ) = e^{-λ} *λ^k/k!

4.2 What’s the fraction of genome that are covered by reads?

4.3 How many contigs are generated?

4.4 Assembling large genomes today

Since 2013, de novo assembly of large genomes has shifted from short-read sequencing to synthetic or true long-read sequencing.

Jung et al, Trends in Plant Science, 2019

4.5 Canu assembler for long-read assembly

基于Celara开发的软件

Koren et al, Genome Research, 2017

4.6 Wtdbg2 assembler for long-read assembly

Wtdbg2 groups 256 base pairs into a bin. (并不是直接用ATCG)

Ruan et al, BioRxiv, 2019

5 Seven Bridges of Königsberg Problem

柯尼斯堡七桥问题(Seven Bridges of Königsberg)是图论中的著名问题。这个问题是基于一个现实生活中的事例:当时东普鲁士柯尼斯堡(今日俄罗斯加里宁格勒)市区跨普列戈利亚河两岸,河中心有两个小岛。小岛与河的两岸有七条桥连接。在所有桥都只能走一遍的前提下,如何才能把这个地方所有的桥都走遍?

https://zh.wikipedia.org/wiki/柯尼斯堡七桥问题

5.1 Euler’s Theorem on directed graphs

5.2 Euler’s Theorem on undirected graphs

5.3 The Königsberg graph is not Eulerian

For the existence of Eulerian trails

所以欧拉认为七桥问题无解

5.4 Hamiltonian vs Eulerian path

6 Constructing the de Bruijn Graph

Create the de Bruijn graph for the following string, using k=3

TTATGCCATGGGATGAA

6.1 Error correction in de Bruijn graph

6.2 A fun hypothetical case study: Frederick Sanger’s insulin sequencing study

"Sanger是个很有趣的人,没有学校要他,但是他家里很有钱,自己进实验室自己掏钱买试剂,得了诺贝尔奖后这辈子也就这么回事了,也不发文章也不看文献...."

*穷鬼爆笑

用 de Bruijn graph 的方法,也可以将二肽及三肽片段组装出胰岛素contig.

上一篇下一篇

猜你喜欢

热点阅读