生物信息学从零开始学2019龙星计划@PKU生物信息学与算法

Dragon star Day 2 Pt.1 关于序列比对、计

2019-08-10  本文已影响8人  美式永不加糖

Dragon star Day 2 20190730

关于序列比对、计分方法、BLAST原理及判断值、BWT原理

补丁笔记之终于上手画了
-为什么这么难?
-因为你是指南针呀
-。


Dragonstar2019 by Kai Wang

  1. Alignment of short/long-read sequencing data
  2. Genome assembly by short/long-read sequencing

Part Ⅰ Alignment of short/long-read sequencing data

1 Sequence similarity

Sequence similarity is a measure of an empirical relationship between sequences.

A similarity score is therefore aimed to approximate the evolutionary distance between a pair of nucleotide or protein sequences.

Heringa, Jaap(Jul 2008) Sequence Similarity. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005317.pub2]

2 Sequence alignment

2.1 Pairwise alignment

Sequence alignment between two sequences

3 Dynamic programming

动态规划(英语:Dynamic programming,简称DP)是一种在数学管理科学计算机科学经济学生物信息学中使用的,通过把原问题分解为相对简单的子问题的方式求解复杂问题的方法。

动态规划常常适用于有重叠子问题和最优子结构性质的问题,动态规划方法所耗时间往往远少于朴素解法。

动态规划背后的基本思想非常简单。大致上,若要解一个给定问题,我们需要解其不同部分(即子问题),再根据子问题的解以得出原问题的解。

https://zh.wikipedia.org/wiki/动态规划

3.1 Global alignment: Needleman-Wunsch

3.1.1 Working of Needleman -Wunsch Algorithm

The dynamic programming matrix is defined with three different steps.

  1. Initialization of the matrix with the scores possible.
  2. Matrix filling with maximum scores.
  3. Trace back the residues for appropriate alignment.

vlab.amrita.edu,. (2012). Global alignment of two sequences - Needleman-Wunsch Algorithm. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1

3.2 Local alignment: Smith-Waterman

史密斯-沃特曼算法(Smith-Waterman algorithm)是一种进行局部序列比对(相对于全局比对)的算法,用于找出两个核苷酸序列或蛋白质序列之间的相似区域。该算法的目的不是进行全序列的比对,而是找出两个序列中具有高相似度的片段。

该算法由坦普尔·史密斯(Temple F. Smith)和迈克尔·沃特曼(Michael S. Waterman)于1981年提出。史密斯-沃特曼算法是尼德曼-翁施算法的一个变体,二者都是动态规划算法。这一算法的优势在于可以在给定的打分方法下找出两个序列的最优的局部比对(打分方法使用了置换矩阵和空位罚分)。该算法和尼德曼-翁施算法的主要区别在于该算法不存在负分(负分被替换为零),因此局部比对成为可能。回溯从分数最高的矩阵元素开始,直到遇到分数为零的元素停止。分数最高的局部比对结果在此过程中产生。

https://zh.wikipedia.org/史密斯-沃特曼算法

3.2.1 Working of Smith-Waterman Algorithm

The basic steps for the algorithm are similar to that of Needleman-Wunsch algorithm. The steps are:

  1. Initialization of a matrix.
  2. Matrix Filling with the appropriate scores.
  3. Trace back the sequences for a suitable alignment.

vlab.amrita.edu,. (2012). Smith-Waterman Algorithm - Local Alignment of Sequences. Retrieved 9 August 2019, from vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1

可见步骤都是一致的,用同样的序列和 score function 进行比对

4 BLAST: The seed-index-map-extend-merge strategy

The BLAST = Basic Local Alignment Search Tool algorithm is a heuristic for computing optimal "local alignments" between a query sequence and a database containing one or more subject sequences.

https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf

4.1 BLAST Basics

4.2 Process: The seed-index-map-extend-merge strategy

Most popular algorithms use a seed-and-extend approach that operates in two steps:

  1. Find a set of short exact matches called seeds.
  2. Try to extend each seed match to obtain a long inexact match

These matching words are called seeds and BLAST then attempts to extend each seed to a HSP (High-Scoring Segment Pair).

https://ab.inf.uni-tuebingen.de/teaching/ws2012/bioinf1/03-BLAST-BLAT.pdf>

4.2.1 The Process
4.2.2 Statistical significance of alignment

5 BWT

Technically, it is a lexicographical reversible permutation of the characters of a string.

The most important application of BWT is found in biological sciences where genomes(long strings written in A, C, T, G alphabets) don’t have many runs but they do have many repeats.

https://www.geeksforgeeks.org/burrows-wheeler-data-transform-algorithm/

5.1 Bowtie/BWA

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).

http://bowtie-bio.sourceforge.net/index.shtml

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.

http://bio-bwa.sourceforge.net/

BWA and Bowtie are the most famous implementations of BWT in sequence alignment.

5.2 BWT procedures

5.3 BWT reverse transformation

With a suffix array, it is easy to recover the original string S


最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
上一篇 下一篇

猜你喜欢

热点阅读