数据分析2-.sra数据拆分与过滤
1.拆分
>fastq-dump --gzip --split-3 SRR6449842
2.fastqc质量检测
参考:
https://zhuanlan.zhihu.com/p/20731723
fastqc -o [输出目录] -t [线程数] SRR2050895.fastq.gz
3.过滤
3.1对于双端测序不知道接头序列
fastp --detect_adapter_for_pe -w 8 -i SRR6449842_1.fastq.gz -I SRR6449842_2.fastq.gz -o clean_SRR6449842_1.fastq.gz -O clean_SRR6449842_2.fastq.gz -j SRR6449842_report.json -h SRR6449842_report.html
3.2对于单端测序不知道接头序列
fastp -w 16 -i SRR2050895.fastq.gz -o clean_SRR2050895.fastq.gz -j SRR2050895_report.json -h SRR2050895_report.html
输出结果为:
Detecting adapter sequence for read1... GTGTAAGCATCTGGGTAGTCTGAGTAGCGTCGTGGTATTCCTGAAAGGCCCAGGAAATGT Read1 before filtering:
total reads: 45230766
total bases: 2236673891
Q20 bases: 2219017453(99.2106%)
Q30 bases: 2172754983(97.1422%)
Read1 after filtering:
total reads: 45099144
total bases: 2229459599
Q20 bases: 2211862439(99.2107%)
Q30 bases: 2165728977(97.1414%)
Filtering result:
reads passed filter: 45099144
reads failed due to low quality: 352
reads failed due to too many N: 64
reads failed due to too short: 131206
reads with adapter trimmed: 206061
bases trimmed due to adapters: 7694138
Duplication rate (may be overestimated since this is SE data): 57.3777%
JSON report: SRR2050895_report.json
HTML report: SRR2050895_report.html
fastp -w 16 -i SRR2050895.fastq.gz -o clean_SRR2050895.fastq.gz -j SRR2050895_report.json -h SRR2050895_report.html
fastp v0.20.0, time used: 226 seconds
再次使用fastqc进行质控,发现前11个碱基的GC含量有问题
无所谓了,今天2021年1月20日先把剩余的.fq下载下来。