生信基础知识一些需要知道的概念

双端测序fastq及sam文件理解

2020-08-06  本文已影响0人  genomics1990

最近在做一些bam文件质控的时候,发现没有深入理解一些文件格式,只是参考别人代码跑数据,感觉心里是很虚的,仔细查了些资料,才有了初步的了解,主要解决了以下几个问题。

1、fastq双端测序read1和read2对应情况

2、sam文件read1和read2关系和插入片段计算


sample1_R1.fq.gz
sample1_R2.fq.gz

gzcat sample1_R1.fq.gz |head

@E00487:566:HCK35CCX2:7:1101:21095:1819 1:N:0:CGAGGCTG+TATGCAGT
GCAGAGGGGCAGATTAACCCTCTGCCTCACTTGACAAG
+
AAFFFJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJFJ

@E00487:566:HCK35CCX2:7:1101:22252:1819 1:N:0:CGAGGCTG+TATGCAGT
ATTTATGTCGCTCTTTCGGCCCTCCAGTCTCCCCATGGCGGGTTCTAGCAGGGAATGTCTCTTTTCTCTTTCTTCTGTGTTTGTTGTAGCTGCTGAGACGAGGTCAGCTTTATGCTTGGGGCTTTTTTTCTTTTTTCTTACTGATAGCTG
+
AAFFFJJJJJJJJJJJJJJJAFJJFFJJFJJ7J<FJJFJJFJFJJ<FJFFFFF-FFAF-AJFJAJJJAFAFFJFFF<7<777FJJJJJJFJ<FJ<-AFFJ<FJJJFJJ<JJJJJFJFJJJJFJ7FFJJJ7FJJJJJ<AF77FJFJAFJFA

gzcat sample1_R2.fq.gz |head

@E00487:566:HCK35CCX2:7:1101:21095:1819 2:N:0:CGAGGCTG+TATGCAGT
NTTGTCAAGTGAGGCAGAGGGTTAATCTGCCCCTCTGC
+
#AAFAJJAJJJJFJJFJJJJJJJJJJJJJFFFJJJFJJ
@E00487:566:HCK35CCX2:7:1101:22252:1819 2:N:0:CGAGGCTG+TATGCAGT
NGGCTGGCCAAGGTCACAGCTTACAGCTATCAGTAAGAAAAAAGAAAA
+
#-<A-A777-7F<--FJJJA7FFF<FA7JFAJJA-JAFAF<AJ<FFJJ

第一行E00487:566:HCK35CCX2:7:1101:21095:1819 在R1和R2文件中是一一对应的


知道了read1和read2的标记是一样的,所以选择插入片段长一些的提取出来

E00487:566:HCK35CCX2:7:2111:14752:22194 99 chr1 540614 30 150M = 540747 283 GGAGCGCAAAGCGAAGGGGCGGCCTGTGGCGTTTTCCTGTAAAGTTGGGCACACGCTTCCCACATGACTCAGCAATTGCACTTCTGGGTATGTACCCGAGAGAAACAAAAGCTTATGTTCACACAAAAACCTACAACGCAAATGCACAAA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ`

JJJJJJJJJJJJJJJJ AS:i:0 XS:i:-6 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:0 YT:Z:CP

E00487:566:HCK35CCX2:7:2111:14752:22194 147 chr1 540747 30 150M = 540614 -283 CAACGCAAATGCACAAACAGCTCTATCCAACAACCCTGGAAGCAACCCAAACACGCTTCAGCGGCACAGGCGCCTCCACGCGGAACCCCACGCGGCGCTCAGCACGGACGAGGAGGGAGCCGCGCACGCGCGGTCGGCTCGGCGAGGAGC AF<JJJJFFFJJFJJFAJFJJJJAF<JJF7JFAJJJFFJJJJJFJJJJJFJF<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJAJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA AS:i:0 XS:i:-29 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:0 YT:Z:CP

查看在对应fastq文件中read信息

gzcat sample1_R1.fq.gz | grep "E00487:566:HCK35CCX2:7:2111:14752:22194" -A 3 -n
26576029:@E00487:566:HCK35CCX2:7:2111:14752:22194 1:N:0:CGAGGCTG+TATGCAGT

26576030-GGAGCGCAAAGCGAAGGGGCGGCCTGTGGCGTTTTCCTGTAAAGTTGGGCACACGCTTCCCACATGACTCAGCAATTGCACTTCTGGGTATGTACCCGAGAGAAACAAAAGCTTATGTTCACACAAAAACCTACAACGCAAATGCACAAA

26576031-+

26576032-AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
gzcat sample1_R2.fq.gz | grep "E00487:566:HCK35CCX2:7:2111:14752:22194" -A 3 -n
26576029:@E00487:566:HCK35CCX2:7:2111:14752:22194 2:N:0:CGAGGCTG+TATGCAGT

26576030-GCTCCTCGCCGAGCCGACCGCGCGTGCGCGGCTCCCTCCTCGTCCGTGCTGAGCGCCGCGTGGGGTTCCGCGTGGAGGCGCCTGTGCCGCTGAAGCGTGTTTGGGTTGCTTCCAGGGTTGTTGGATAGAGCTGTTTGTGCATTTGCGTTG

26576031-+

26576032-AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<FJFJJJJJFJJJJJFFJJJAFJ7FJJ<FAJJJJFJAFJJFJJFFFJJJJ<FA

read1 FLAG值为 99,read2 FLAG值为147,意思是read1和read2是合理配对,且99为read1,与其配对的147为read2,匹配在反义链上
https://www.samformat.info/sam-format-flag

截屏2020-08-06下午9.46.29.jpg

将fastq文件read1和read2序列在UCSC中比对

fastq : read1

截屏2020-08-06下午9.43.37.jpg

fastq : read2

截屏2020-08-06下午9.42.57.jpg

提取两条序列信息,导进IGV查看

samtools view sample1.bam |grep "E00487:566:HCK35CCX2:7:2111:14752:22194" > test.sam

IGV


截屏2020-08-06下午9.25.12.jpg

插入片段长度计算

read1 起始点 540614

read2 起始点 540747

read2 起始点是5’端,所以加上read2长度150,就是read2末端540897

插入片段长度为540897-540614=283bp

上一篇下一篇

猜你喜欢

热点阅读