LinuxCHIP分析

数据分析2-.sra数据拆分与过滤

2021-01-19  本文已影响0人  王铄_d468

1.拆分

>fastq-dump --gzip --split-3 SRR6449842

2.fastqc质量检测

参考:
https://zhuanlan.zhihu.com/p/20731723

fastqc -o [输出目录] -t [线程数] SRR2050895.fastq.gz

3.过滤

3.1对于双端测序不知道接头序列

fastp --detect_adapter_for_pe -w 8 -i SRR6449842_1.fastq.gz -I SRR6449842_2.fastq.gz -o clean_SRR6449842_1.fastq.gz -O clean_SRR6449842_2.fastq.gz -j SRR6449842_report.json -h SRR6449842_report.html

3.2对于单端测序不知道接头序列

fastp  -w 16 -i SRR2050895.fastq.gz -o clean_SRR2050895.fastq.gz  -j SRR2050895_report.json -h SRR2050895_report.html

输出结果为:

Detecting adapter sequence for read1... GTGTAAGCATCTGGGTAGTCTGAGTAGCGTCGTGGTATTCCTGAAAGGCCCAGGAAATGT Read1 before filtering: 

total reads: 45230766 

total bases: 2236673891 

Q20 bases: 2219017453(99.2106%) 

Q30 bases: 2172754983(97.1422%) 

 Read1 after filtering:

 total reads: 45099144 

total bases: 2229459599 

Q20 bases: 2211862439(99.2107%) 

Q30 bases: 2165728977(97.1414%) 

 Filtering result: 

reads passed filter: 45099144 

reads failed due to low quality: 352 

reads failed due to too many N: 64 

reads failed due to too short: 131206 

reads with adapter trimmed: 206061

bases trimmed due to adapters: 7694138 

 Duplication rate (may be overestimated since this is SE data): 57.3777% 

 JSON report: SRR2050895_report.json 

HTML report: SRR2050895_report.html 

 fastp -w 16 -i SRR2050895.fastq.gz -o clean_SRR2050895.fastq.gz -j SRR2050895_report.json -h SRR2050895_report.html 

fastp v0.20.0, time used: 226 seconds

再次使用fastqc进行质控,发现前11个碱基的GC含量有问题

无所谓了,今天2021年1月20日先把剩余的.fq下载下来。

上一篇下一篇

猜你喜欢

热点阅读