双端测序fastq及sam文件理解
2020-08-06 本文已影响0人
genomics1990
最近在做一些bam文件质控的时候,发现没有深入理解一些文件格式,只是参考别人代码跑数据,感觉心里是很虚的,仔细查了些资料,才有了初步的了解,主要解决了以下几个问题。
1、fastq双端测序read1和read2对应情况
2、sam文件read1和read2关系和插入片段计算
- fastq双端测序read1和read2对应情况
sample1_R1.fq.gz
sample1_R2.fq.gz
gzcat sample1_R1.fq.gz |head
@E00487:566:HCK35CCX2:7:1101:21095:1819 1:N:0:CGAGGCTG+TATGCAGT
GCAGAGGGGCAGATTAACCCTCTGCCTCACTTGACAAG
+
AAFFFJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJFJ
@E00487:566:HCK35CCX2:7:1101:22252:1819 1:N:0:CGAGGCTG+TATGCAGT
ATTTATGTCGCTCTTTCGGCCCTCCAGTCTCCCCATGGCGGGTTCTAGCAGGGAATGTCTCTTTTCTCTTTCTTCTGTGTTTGTTGTAGCTGCTGAGACGAGGTCAGCTTTATGCTTGGGGCTTTTTTTCTTTTTTCTTACTGATAGCTG
+
AAFFFJJJJJJJJJJJJJJJAFJJFFJJFJJ7J<FJJFJJFJFJJ<FJFFFFF-FFAF-AJFJAJJJAFAFFJFFF<7<777FJJJJJJFJ<FJ<-AFFJ<FJJJFJJ<JJJJJFJFJJJJFJ7FFJJJ7FJJJJJ<AF77FJFJAFJFA
gzcat sample1_R2.fq.gz |head
@E00487:566:HCK35CCX2:7:1101:21095:1819 2:N:0:CGAGGCTG+TATGCAGT
NTTGTCAAGTGAGGCAGAGGGTTAATCTGCCCCTCTGC
+
#AAFAJJAJJJJFJJFJJJJJJJJJJJJJFFFJJJFJJ
@E00487:566:HCK35CCX2:7:1101:22252:1819 2:N:0:CGAGGCTG+TATGCAGT
NGGCTGGCCAAGGTCACAGCTTACAGCTATCAGTAAGAAAAAAGAAAA
+
#-<A-A777-7F<--FJJJA7FFF<FA7JFAJJA-JAFAF<AJ<FFJJ
第一行E00487:566:HCK35CCX2:7:1101:21095:1819 在R1和R2文件中是一一对应的
- Sam文件中read1和read2对应情况
知道了read1和read2的标记是一样的,所以选择插入片段长一些的提取出来
E00487:566:HCK35CCX2:7:2111:14752:22194 99 chr1 540614 30 150M = 540747 283 GGAGCGCAAAGCGAAGGGGCGGCCTGTGGCGTTTTCCTGTAAAGTTGGGCACACGCTTCCCACATGACTCAGCAATTGCACTTCTGGGTATGTACCCGAGAGAAACAAAAGCTTATGTTCACACAAAAACCTACAACGCAAATGCACAAA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ`
JJJJJJJJJJJJJJJJ AS:i:0 XS:i:-6 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:0 YT:Z:CP
E00487:566:HCK35CCX2:7:2111:14752:22194 147 chr1 540747 30 150M = 540614 -283 CAACGCAAATGCACAAACAGCTCTATCCAACAACCCTGGAAGCAACCCAAACACGCTTCAGCGGCACAGGCGCCTCCACGCGGAACCCCACGCGGCGCTCAGCACGGACGAGGAGGGAGCCGCGCACGCGCGGTCGGCTCGGCGAGGAGC AF<JJJJFFFJJFJJFAJFJJJJAF<JJF7JFAJJJFFJJJJJFJJJJJFJF<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJAJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA AS:i:0 XS:i:-29 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:0 YT:Z:CP
查看在对应fastq文件中read信息
gzcat sample1_R1.fq.gz | grep "E00487:566:HCK35CCX2:7:2111:14752:22194" -A 3 -n
26576029:@E00487:566:HCK35CCX2:7:2111:14752:22194 1:N:0:CGAGGCTG+TATGCAGT
26576030-GGAGCGCAAAGCGAAGGGGCGGCCTGTGGCGTTTTCCTGTAAAGTTGGGCACACGCTTCCCACATGACTCAGCAATTGCACTTCTGGGTATGTACCCGAGAGAAACAAAAGCTTATGTTCACACAAAAACCTACAACGCAAATGCACAAA
26576031-+
26576032-AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
gzcat sample1_R2.fq.gz | grep "E00487:566:HCK35CCX2:7:2111:14752:22194" -A 3 -n
26576029:@E00487:566:HCK35CCX2:7:2111:14752:22194 2:N:0:CGAGGCTG+TATGCAGT
26576030-GCTCCTCGCCGAGCCGACCGCGCGTGCGCGGCTCCCTCCTCGTCCGTGCTGAGCGCCGCGTGGGGTTCCGCGTGGAGGCGCCTGTGCCGCTGAAGCGTGTTTGGGTTGCTTCCAGGGTTGTTGGATAGAGCTGTTTGTGCATTTGCGTTG
26576031-+
26576032-AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<FJFJJJJJFJJJJJFFJJJAFJ7FJJ<FAJJJJFJAFJJFJJFFFJJJJ<FA
read1 FLAG值为 99,read2 FLAG值为147,意思是read1和read2是合理配对,且99为read1,与其配对的147为read2,匹配在反义链上
https://www.samformat.info/sam-format-flag
将fastq文件read1和read2序列在UCSC中比对
fastq : read1
截屏2020-08-06下午9.43.37.jpgfastq : read2
截屏2020-08-06下午9.42.57.jpg提取两条序列信息,导进IGV查看
samtools view sample1.bam |grep "E00487:566:HCK35CCX2:7:2111:14752:22194" > test.sam
IGV
截屏2020-08-06下午9.25.12.jpg
插入片段长度计算
read1 起始点 540614
read2 起始点 540747
read2 起始点是5’端,所以加上read2长度150,就是read2末端540897
插入片段长度为540897-540614=283bp