Trimmomatic:用于Illumina NGS数据的灵活读
数据质控软件:
官网:http://www.usadellab.org/cms/index.php?page=trimmomatic
手册:http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
个人觉得,直接看上面pdf,然后尝试提交一个代码,理解代码的参数含义,比较质控前后fastqc的结果就可以完成学习了。但是首先你要明白质控在做哪些事情?什么是接头,什么是N。。。
安装
conda安装
$ conda create -n rna3 python=3
$ source activate rna3
$ conda install trimmomatic
$ trimmomatic --help
Usage:
PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
or:
SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
or:
-version
$ trimmomatic -version
0.39
二进制源码安装
$ wget -c http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
$ unzip Trimmomatic-0.39.zip
$ cd Trimmomatic-0.39/
$ java -jar trimmomatic-0.39.jar
Usage:
PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
or:
SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
or:
-version
$ alias Trimmomatic="java -jar /home/qmcui/Trimmomatic-0.39/trimmomatic-0.39.jar"
# 可以写入.bashrc里,永久有效
$ Trimmomatic --help
双端任务1
$ Trimmomatic PE \
/teach/project/1.rna/2.raw_fq/SRR1039510_1.fastq.gz \
/teach/project/1.rna/2.raw_fq/SRR1039510_2.fastq.gz \
./output_forward_paired.fq.gz ./output_forward_unpaired.fq.gz \
./output_reverse_paired.fq.gz ./output_reverse_unpaired.fq.gz \
ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10:2:keepBothReads \
LEADING:3 \
TRAILING:3 \
MINLEN:36 \
$ ps -ef|grep qmcui
结果
TrimmomaticPE: Started with arguments:
/teach/project/1.rna/2.raw_fq/SRR1039510_1.fastq.gz /teach/project/1.rna/2.raw_fq/SRR1039510_2.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Multiple cores found: Using 4 threads
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 22852619 Both Surviving: 22393983 (97.99%) Forward Only Surviving: 285634 (1.25%) Reverse Only Surviving: 138619 (0.61%) Dropped: 34383 (0.15%)
TrimmomaticPE: Completed successfully
# 时间8min,输入read1和read2压缩文件为1.3G,均为91410476条reads
双端任务2
# Trimmomatic等于java -jar trimmomatic-0.35.jar
$ Trimmomatic PE -phred33 \
input_forward.fq.gz input_reverse.fq.gz \
output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36
单端任务1
java -jar trimmomatic-0.35.jar SE -phred33 input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
参数解释
ILLUMINACLIP 去除adapter,后面有6个参数;如adapters/TruSeq3-PE.fa:2:30:10:2:keepBothReads (index.fa、错配个数2、比对分值30、最低比对分值10、切除接头最短序列2(default8)
、keepBothReads设置为true还是false(双端序列最好加上,再比如bowtie下游处理软件必需))
LEADING:3 去除领先的低质量或N碱基(低于质量3)
TRAILING:3 去除拖尾的低质量或N个碱基(低于质量3)(TRAILING:3)
SLIDINGWINDOW:4:15 使用4个底宽的滑动窗口扫描读数,当每个基础的平均质量降至15以下时切割
MINLEN:36 丢掉长度小于36nt长的reads
全部参数
不理解滑窗的可以适当google了解一点
# --help出现的参数
[-threads <threads>] 控制线程数,如果不提供,自动选择
[-phred33 | -phred64] illumina的锅,目前数据测得都是-phred33
# 你可以去网上搜索一段代码,怎么判断自己的数据是phred33还是phred64
# 运行任务的时候,提示信息里会出现程序判断的数据类型,显示信息如下
# Quality encoding detected as phred33
[-trimlog <trimLogFile>] 对每条reads出一条log,一般不用这个参数,日志条数太多了
[-basein <inputBase> | <inputFile1> <inputFile2>] 输入文件
[-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] 输出文件,注意顺序
## 其他调节参数
ILLUMINACLIP:从读取中剪切适配器和其他Illumina特定序列。
SLIDINGWINDOW:执行滑动窗口修剪,一旦窗口内的平均质量低于阈值,就切割。
LEADING:如果低于阈值质量,则在读取开始时切断碱基
TRAILING:如果低于阈值质量,则在读取结束时剪切碱基
CROP:将读取切割为指定的长度
HEADCROP:从读取开始剪切指定的碱基数
MINLEN:如果读取低于指定长度,则删除读取
TOPHRED33:将质量得分转换为Phred-33
TOPHRED64:将质量得分转换为Phred-64
-trimlog
指定trimlog文件会创建所有读取修剪的日志,指示以下详细信息:
- the read name
- the surviving sequence length
- the location of the first surviving base, aka. the amount trimmed from the start the location of the last surviving base in the original read
- the amount trimmed from the end
ILLUMINACLIP:
为TruSeq2(用于GAII机器)和TruSeq3(用于HiSeq和MiSeq机器)提供,用于单端和双端模式。这个有仪器决定,一般你用的PCR的接头的试剂盒。
ILLUMINACLIP:<fastaWithAdaptersEtc>:<种子不匹配>:<palindrome clip threshold>:<简单剪辑阈值>
- fastaWithAdaptersEtc:指定包含所有adapters,PCR序列等的fasta文件的路径。(the path to a fasta file containing all the adapters, PCR sequences etc.)
- seedMismatches:指定仍允许执行完全匹配的最大不匹配数目,如2
- palindromeClipThreshold:指定两个“适配器连接”读取之间的匹配对于PE回文读取对齐的准确程度。
- simpleClipThreshold:指定任何适配器等序列之间的匹配必须与读取的准确程度。
SLIDINGWINDOW:<windowSize>:<requiredQuality>
滑动窗口:<windowSize>:<requiredQuality>
windowSize:指定平均值的基数
requiredQuality:指定所需的平均质量。
LEADING:<quality>
LEADING:<质量>
quality:指定保持基础所需的最低质量。
TRAILING:<quality>
TRAILING:<质量>
quality:指定保持基础所需的最低质量。
CROP:<length>
CROP:<长度>
length:从读取开始时要保留的碱基数。
HEADCROP:<length>
HEADCROP:<长度>
length:从读取开始时删除的碱基数。
MINLEN:<length>
MINLEN:<长度>
length:指定要保留的最小读取长度。
输入文件要求
For input and output files adding .gz/.bz2 to an extension tells Trimmomatic that the file is
provided in gzip/bzip2 format or that Trimmomatic should gzip/bzip2 the file, respectively.
双端输出文件要求
对照着--help的4个文件理解输出文件
For paired-end data, two input files, and 4 output files are specified, 2 for the 'paired' output
where both reads survived the processing, and 2 for corresponding 'unpaired' output where a
read survived, but the partner read did not.
- mySampleFiltered_1P.fq.gz - for paired forward reads
- mySampleFiltered_1U.fq.gz - for unpaired forward reads
- mySampleFiltered_2P.fq.gz - for paired reverse reads
- mySampleFiltered_2U.fq.gz - for unpaired reverse reads
质量值类型:自动选择
以前版本默认64,现在版本自动根据数据来做判断
-phred33 or -phred64 specifies the base quality encoding. If no quality encoding is specified,it will be determined automatically (since version 0.32). The prior default was -phred64.
线程数:自动选择
-threads indicates the number of threads to use, which improves performance on multi-core
computers. If not specified, it will be chosen automatically.
结束
修剪按照在命令行上指定步骤的顺序进行。在大多数情况下,建议尽可能早地完成适配器剪辑。