如何在Rawdata的fq.gz中找到一条read序列
最近有个任务:在原始数据fq.gz文件中找到特定的read序列!!太难了,555.。。。。
事情是这样的:
在IGV中查看我的bam文件时,找到了一对有趣的reads:
E00582:592:HHF5YCCX2:3:2101:5863:59604这个bam文件用的是clean的reads,但我想看看Rawdata中,这个reads的情况,现在我只知道这个read叫“E00582:592:HHF5YCCX2:3:2101:5863:59604
”怎么办呢?
方法一:直接grep
grep -A3 '@E00582:592:HHF5YCCX2:3:2101:5863:59604' My_R1.fq.gz
然后,你也猜到了吧,反应了好久好久好久好久。。。
我实在受不了了,直接杀掉了!害。。。。
有耐心的勇士宝子可以试试看,欢迎揭晓用时!
只能找其他方法了;于是就有了:
方法二:seqkit grep
zcat My_R1.fq.gz | seqkit grep -r -p ^E00582:592:HHF5YCCX2:3:2101:5863:59604 #不到三分钟就好了
GGTAACAATTTCATAATTTTTTCTTCCGTAGTAACAGAACAAAGACTGTCTCTTATACACATCTCCGAGCCCACGAGACCGTACTAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAGAGAGGGGCAGGAAGGCGCCAGGCACCGGGGC
+
<AAFFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJFFJJJFJJJJ<AAJJJJJJJJJJJJJFJJJFAFJJFJ7--7-<----7--7-7-7---777----7---7
zcat My_R2.fq.gz | seqkit grep -r -p ^E00582:592:HHF5YCCX2:3:2101:5863:59604 # 如法炮制read2
TCTTTGTTCTGTTACTACGGAAGAAAAAATTATGAAATTGTTACCCTGTCTCTTATACACATCTGACGCTGCCGACGAATAGAGAGGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAGAAAAAGGGGGGAGAGGGGGCGGGGGGAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJFJJFJJJJJJJJJ7FJJJJJJJJJJJJFJJJJJJFJFJJJAJJJFJ7-----77-77<-7-7--7--A<----))
seqkit是shenwei爪哥开发的处理Fasta/Fastq文件的万能工具。
处理fq/fa文件时花时间写的一些脚本,在seqkit里直接能一行命令就解决。实在是提升效率,整合流程中十分好的工具。
Seqkit官方(https://bioinf.shenwei.me/seqkit/usage/),有兴趣的同学可以自己学习学习。
我后边也会出一期学习经验分享给大家_
————————————————
---------------------------------------------------------------------------------------------------------------------------------------------------I`m a line ! Thanks !-------------------------------------------------------------------------------------------------------------------------------------------------------------------------