RNA-seq上游分析DNA-seq学习生物信息软件

Fastp:过滤二代测序数据

2021-11-18  本文已影响0人  胡童远

导读

Fastp能检测和去除adapter,PE序列overlap区碱基矫正,slide window修剪头尾,polyG/X尾修剪,UMI预处理。多功能合一,速度快,结果好,生成可读报表。Fastp完全可以代替Trimmomatic, FastQC, Cutadapt, AfterQC, SOAPnuke。

Fastp文章

标题:fastp: an ultra-fast all-in-one FASTQ preprocessor
中文:超快的多合一fastq数据预处理器
杂志:Bioinformatics
引用:2414 (谷歌学术2021.11.18)

工作流程

速度快

去adapter更准确、高效

匹配hg19人参考基因组mismatch base, clip read, single-read map 最少

高效预处理UMI

fastp地址

Github: https://github.com/OpenGene/fastp

安装Fastp

conda create -n readqc
conda install fastp
fastp --version
# fastp 0.23.1

运行Fastp

conda activate readqc
time fastp \
--in1 ./input/E100032181_L01_29_1.fq.gz \
--in2 ./input/E100032181_L01_29_2.fq.gz \
--out1 ./fastp/E100032181_L01_29_1.fq.gz \
--out2 ./fastp/E100032181_L01_29_2.fq.gz \
--json ./fastp/fastp.json \
--html ./fastp/fastp.html \
--trim_poly_g --poly_g_min_len 10 \
--trim_poly_x --poly_x_min_len 10 \
--cut_front --cut_tail --cut_window_size 4 \
--qualified_quality_phred 15 \
--low_complexity_filter \
--complexity_threshold 30 \
--length_required 30 \
--thread 4

参数

--trim_poly_g  切ployG
--poly_g_min_len 10  最短为10bp
--trim_poly_x  切ployX
--poly_x_min_len 10 最短为10bp
--cut_front  从5端扫描
--cut_tail  从3端扫描
--cut_window_size 4  窗口设为4bp
--cut_mean_quality 20 窗口内最低平均碱基质量值为20
--qualified_quality_phred 15  最低碱基质量值15
--low_complexity_filter  启动过滤低复杂序列
--complexity_threshold 30  复杂度阈值为30%
--length_required 30  切后最短长度阈值30bp

过程

Read1 before filtering:
total reads: 68871423
total bases: 6887142300
Q20 bases: 6788565208(98.5687%)
Q30 bases: 6516393608(94.6168%)

Read2 before filtering:
total reads: 68871423
total bases: 6887142300
Q20 bases: 6752497708(98.045%)
Q30 bases: 6459072061(93.7845%)

Read1 after filtering:
total reads: 68870151
total bases: 6579451130
Q20 bases: 6490255475(98.6443%)
Q30 bases: 6233038928(94.7349%)

Read2 after filtering:
total reads: 68870151
total bases: 6570653779
Q20 bases: 6449906989(98.1623%)
Q30 bases: 6173217216(93.9513%)

Filtering result:
reads passed filter: 137740302
reads failed due to low quality: 32
reads failed due to too many N: 936
reads failed due to too short: 1480
reads failed due to low complexity: 96
reads with adapter trimmed: 24272074
bases trimmed due to adapters: 604687721
reads with polyX in 3' end: 698520
bases trimmed in polyX tail: 6954246

Duplication rate: 69.3962%

Insert size peak (evaluated by paired-end reads): 141

JSON report: ./fastp/fastp.json
HTML report: ./fastp/fastp.html

fastp --in1 ./input/E100032181_L01_29_1.fq.gz --in2 ./input/E100032181_L01_29_2.fq.gz --out1 ./fastp/E100032181_L01_29_1.fq.gz --out2 ./fastp/E100032181_L01_29_2.fq.gz --json ./fastp/fastp.json --html ./fastp/fastp.html --trim_poly_x --poly_x_min_len 10 --cut_front --cut_tail --cut_window_size 4 --qualified_quality_phred 15 --low_complexity_filter --complexity_threshold 30 --length_required 30 --thread 4
fastp v0.23.1, time used: 567 seconds

real    9m28.522s
user    39m31.517s
sys     0m37.690s

Fastp结果

结果html例:http://opengene.org/fastp/fastp.html
结果json例:http://opengene.org/fastp/fastp.json

更多:
2000+引用的fastp推出重磅更新,再提速一倍!
生信软件工具-fastp
测序数据质控和预处理之fastp
UMI的处理
UMI-unique molecular identifiers

上一篇下一篇

猜你喜欢

热点阅读