RNA-Seq 入门:Fastaq文件解析
2017-11-20 本文已影响73人
jlyq617
图片 1.pngFASTAQ format stores short-read sequences and Phred qualities from NGS platform into a single file.
Every 4 lines represent for a short-read.
Four lines per FASTAQ record
1. @ indicates the sequence id(above is longer than sequence itself) 描述行
eg2.png通常,仪器的使用次数在200-9999次比较适合。
2. the sequence content of the read 测到的碱基,A/G/T/C/N,其中N表示无法确定的碱基
3.+ optionally repeat the sequence id (often left empty)
4.quality string 质量评判
A quality score is a number.
One character encodes a number using AscII table
A quality score represents an error probability.
Quality scores are used to represent base calling accuracy, alignment accuracy and other probabilities.
由于如果使用数字表示质量的话,当表示质量的数字为两位及以上时,无法做到一位对应一个数字。因此我们需要用其他的方法将表示质量的数字转换位单个字符,在fastaq的质量评判中我们使用了Ascll table。
The number can be convert to probability based on following formula:
P=10^[-(Q-33)/10]
Start the scale at character 33 (so Q should minus 33)
Quality value (Q) range between 33 to 126
Character range between ‘!’ to ‘~’
Currently, most NGS platform only produce quality value (Q) in the range from 33 to 73. (from ‘!’ to ‘I’).
For P value, from 10^0 to 10^-4 (from 1 to 0.0001).
举例而言:
比如时质量评判给了一个‘!’:
查询Ascll table,‘!’对应的数值为33,将其带入P-value的计算公式,即P=10^[-(33-33)/10] =10^0=1
Various formats for NGS data:
Input data (raw data): .fasta, .fastq (.SRA)
Annotation data: .gff, .gtf, .bed
Alignment result: .sam, .bam, .wig, .bed
Variant call result: .vcf