RNA-Seq 入门:Fastaq文件解析

2017-11-20  本文已影响73人  jlyq617

FASTAQ format stores short-read sequences and Phred qualities from NGS platform into a single file.
Every 4 lines represent for a short-read.

图片 1.png

Four lines per FASTAQ record

1. @ indicates the sequence id(above is longer than sequence itself) 描述行
eg2.png

通常,仪器的使用次数在200-9999次比较适合。

2. the sequence content of the read 测到的碱基,A/G/T/C/N,其中N表示无法确定的碱基
3.+ optionally repeat the sequence id (often left empty)
4.quality string 质量评判

A quality score is a number.
One character encodes a number using AscII table
A quality score represents an error probability.
Quality scores are used to represent base calling accuracy, alignment accuracy and other probabilities.
由于如果使用数字表示质量的话,当表示质量的数字为两位及以上时,无法做到一位对应一个数字。因此我们需要用其他的方法将表示质量的数字转换位单个字符,在fastaq的质量评判中我们使用了Ascll table。

ascll.png
The number can be convert to probability based on following formula:
P=10^[-(Q-33)/10]
Start the scale at character 33 (so Q should minus 33)
Quality value (Q) range between 33 to 126
Character range between ‘!’ to ‘~’
Currently, most NGS platform only produce quality value (Q) in the range from 33 to 73. (from ‘!’ to ‘I’).
For P value, from 10^0 to 10^-4 (from 1 to 0.0001).
举例而言:
比如时质量评判给了一个‘!’:
查询Ascll table,‘!’对应的数值为33,将其带入P-value的计算公式,即P=10^[-(33-33)/10] =10^0=1
Various formats for NGS data:

Input data (raw data): .fasta, .fastq (.SRA)
Annotation data: .gff, .gtf, .bed
Alignment result: .sam, .bam, .wig, .bed
Variant call result: .vcf

上一篇下一篇

猜你喜欢

热点阅读