生信学习RNA CHIP ATAC -seq

2022你还在用fastqc?超高速fastq前处理工具教程

2022-04-08  本文已影响0人  Jason数据分析生信教室

拿到NGS全基因组下机序列以后肯定是Fastqc+Cutadapt+Trimmomatic去引物序列,匹配序列对原数据进行一波操作猛如虎的过滤。然而这个需要多次读取和写出数据,生产效率很低。所以在此推荐一款集成这三款工具功能于一体的更加智能化的工具fastp。
fastp不仅可以自动识别fastq数据里的引物,匹配序列,还能自动识别数据是single end还是pair end支持长/短read序列。常用测序平台的引物和匹配序列fastp都会自动识别不需要手动指定。并且还能自动识别读序错误进行删除。计算效率是fastqc的2~5倍。

主要特长

引用一下原文:

  1. filter out bad reads (too low quality, too short, or too many N...)
  2. cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
  3. trim all reads in front and tail
  4. cut adapters. Adapter sequences can be automatically detected,which means you don't have to input the adapter sequences to trim them.
  5. correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
  6. trim polyG in 3' ends, which is commonly seen in NovaSeq/NextSeq data. Trim polyX in 3' ends to remove unwanted polyX tailing (i.e. polyA tailing for mRNA-Seq data)
  7. preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.
  8. report JSON format result for further interpreting.
  9. visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
  10. split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing. Two modes can be used, limiting the total split file number, or limitting the lines of each split file.
  11. support long reads (data from PacBio / Nanopore devices).

安装

可以git获取,也可以conda安装。

# git
git clone [https://github.com/OpenGene/fastp.git](https://github.com/OpenGene/fastp.git) 
cd fastp 
make 
sudo make install

#bioconda
conda install -c bioconda -y fastp

软件运行

默认的功能里面包含了Quality filtering、Length filtering、Low complexity filter、Adapter trimming。

fastp -i single.fq -o cleaned.fq.gz -w 3 -q 15 -n 10

也可以是pair end。同时输出html和json格式的结果报告。剪掉tail末端的一个序列。删除20bp以下的序列,CPU16线程。

fastp -i pair1.fq -I pair2.fq -3\
 -o out_pair1.fq.gz -O out_pair2.fq.gz\
 -h report.html -j report.json  -q 15 -n 10 -t 1 -T 1 -l 20 -w 16

运行完成以后你可以看到结果的报告。


序列质量分布

碱基含量

k-mer的overrepresentation分析

操作非常简单,妈妈再也不用担心我不会fastq前处理了。

引用

fastp: an ultra-fast all-in-one FASTQ preprocessor
Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu
Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890,

fastp: an ultra-fast all-in-one FASTQ preprocessor
Shifu Chen1,2,*, Yanqing Zhou1, Yaru Chen1, Jia Gu
bioRxiv preprint first posted online Mar. 1, 2018;
doi: http://dx.doi.org/10.1101/274100.

PDF
https://www.biorxiv.org/content/biorxiv/early/2018/03/01/274100.full.pdf

上一篇 下一篇

猜你喜欢

热点阅读