医学学习笔记生信分析流程生物信息学与算法

RNA-Seq数据分析—cutadapt质量控制

2022-02-10  本文已影响0人  生信助手

可对fastq数据进行质量控制的软件有很多,今天我们先学习cutadapt。

安装

## cutadapt是一款基于python的软件
#使用pip安装
$ pip install --user --upgrade cutadapt
#使用conda安装
$ conda install -c bioconda cutadapt

官方帮助文档

cutadapt -h
cutadapt version 3.4

Copyright (C) 2010-2021 Marcel Martin <marcel.martin@scilifelab.se>

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. All reads from input.fastq will be written to
output.fastq with the adapter sequence removed. Adapter matching is
error-tolerant. Multiple adapter sequences can be given (use further -a
options), but only the best-matching adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

Citation:

Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
http://dx.doi.org/10.14806/ej.17.1.200

Run "cutadapt --help" to see all command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.

Options:
  -h, --help            Show this help message and exit
  --version             Show version number and exit
  --debug               Print debug log. Use twice to also print DP matrices
  -j CORES, --cores CORES
                        Number of CPU cores to use. Use 0 to auto-detect. Default: 1

Finding adapters:
  Parameters -a, -g, -b specify adapters to be removed from each read (or from the first read in a
  pair if data is paired). If specified multiple times, only the best matching adapter is trimmed (but
  see the --times option). When the special notation 'file:FILE' is used, adapter sequences are read
  from the given FASTA file.

  -a ADAPTER, --adapter ADAPTER
                        Sequence of an adapter ligated to the 3' end (paired data: of the first read).
                        The adapter and subsequent bases are trimmed. If a '/pre> character is appended
                        ('anchoring'), the adapter is only found if it is a suffix of the read.
  -g ADAPTER, --front ADAPTER
                        Sequence of an adapter ligated to the 5' end (paired data: of the first read).
                        The adapter and any preceding bases are trimmed. Partial matches at the 5' end
                        are allowed. If a '^' character is prepended ('anchoring'), the adapter is only
                        found if it is a prefix of the read.
  -b ADAPTER, --anywhere ADAPTER
                        Sequence of an adapter that may be ligated to the 5' or 3' end (paired data: of
                        the first read). Both types of matches as described under -a and -g are allowed.
                        If the first base of the read is part of the match, the behavior is as with -g,
                        otherwise as with -a. This option is mostly for rescuing failed library
                        preparations - do not use if you know which end your adapter was ligated to!
  -e E, --error-rate E, --errors E
                        Maximum allowed error rate (if 0 <= E < 1), or absolute number of errors for
                        full-length adapter match (if E is an integer >= 1). Error rate = no. of errors
                        divided by length of matching region. Default: 0.1 (10%)
  --no-indels           Allow only mismatches in alignments. Default: allow both mismatches and indels
  -n COUNT, --times COUNT
                        Remove up to COUNT adapters from each read. Default: 1
  -O MINLENGTH, --overlap MINLENGTH
                        Require MINLENGTH overlap between read and adapter for an adapter to be found.
                        Default: 3
  --match-read-wildcards
                        Interpret IUPAC wildcards in reads. Default: False
  -N, --no-match-adapter-wildcards
                        Do not interpret IUPAC wildcards in adapters.
  --action {trim,retain,mask,lowercase,none}
                        What to do if a match was found. trim: trim adapter and up- or downstream
                        sequence; retain: trim, but retain adapter; mask: replace with 'N' characters;
                        lowercase: convert to lowercase; none: leave unchanged. Default: trim
  --rc, --revcomp       Check both the read and its reverse complement for adapter matches. If match is
                        on reverse-complemented version, output that one. Default: check only read

Additional read modifications:
  -u LENGTH, --cut LENGTH
                        Remove bases from each read (first read only if paired). If LENGTH is positive,
                        remove bases from the beginning. If LENGTH is negative, remove bases from the
                        end. Can be used twice if LENGTHs have different signs. This is applied *before*
                        adapter trimming.
  --nextseq-trim 3'CUTOFF
                        NextSeq-specific quality trimming (each read). Trims also dark cycles appearing
                        as high-quality G bases.
  -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff [5'CUTOFF,]3'CUTOFF
                        Trim low-quality bases from 5' and/or 3' ends of each read before adapter
                        removal. Applied to both reads if data is paired. If one value is given, only
                        the 3' end is trimmed. If two comma-separated cutoffs are given, the 5' end is
                        trimmed with the first cutoff, the 3' end with the second.
  --quality-base N      Assume that quality values in FASTQ are encoded as ascii(quality + N). This
                        needs to be set to 64 for some old Illumina FASTQ files. Default: 33
  --length LENGTH, -l LENGTH
                        Shorten reads to LENGTH. Positive values remove bases at the end while negative
                        ones remove bases at the beginning. This and the following modifications are
                        applied after adapter trimming.
  --trim-n              Trim N's on ends of reads.
  --length-tag TAG      Search for TAG followed by a decimal number in the description field of the
                        read. Replace the decimal number with the correct length of the trimmed read.
                        For example, use --length-tag 'length=' to correct fields like 'length=123'.
  --strip-suffix STRIP_SUFFIX
                        Remove this suffix from read names if present. Can be given multiple times.
  -x PREFIX, --prefix PREFIX
                        Add this prefix to read names. Use {name} to insert the name of the matching
                        adapter.
  -y SUFFIX, --suffix SUFFIX
                        Add this suffix to read names; can also include {name}
  --rename TEMPLATE     Rename reads using TEMPLATE containing variables such as {id}, {adapter_name}
                        etc. (see documentation)
  --zero-cap, -z        Change negative quality values to zero.

Filtering of processed reads:
  Filters are applied after above read modifications. Paired-end reads are always discarded pairwise
  (see also --pair-filter).

  -m LEN[:LEN2], --minimum-length LEN[:LEN2]
                        Discard reads shorter than LEN. Default: 0
  -M LEN[:LEN2], --maximum-length LEN[:LEN2]
                        Discard reads longer than LEN. Default: no limit
  --max-n COUNT         Discard reads with more than COUNT 'N' bases. If COUNT is a number between 0 and
                        1, it is interpreted as a fraction of the read length.
  --max-expected-errors ERRORS, --max-ee ERRORS
                        Discard reads whose expected number of errors (computed from quality values)
                        exceeds ERRORS.
  --discard-trimmed, --discard
                        Discard reads that contain an adapter. Use also -O to avoid discarding too many
                        randomly matching reads.
  --discard-untrimmed, --trimmed-only
                        Discard reads that do not contain an adapter.
  --discard-casava      Discard reads that did not pass CASAVA filtering (header has :Y:).

Output:
  --quiet               Print only error messages.
  --report {full,minimal}
                        Which type of report to print: 'full' or 'minimal'. Default: full
  -o FILE, --output FILE
                        Write trimmed reads to FILE. FASTQ or FASTA format is chosen depending on input.
                        Summary report is sent to standard output. Use '{name}' for demultiplexing (see
                        docs). Default: write to standard output
  --fasta               Output FASTA to standard output even on FASTQ input.
  -Z                    Use compression level 1 for gzipped output files (faster, but uses more space)
  --info-file FILE      Write information about each read and its adapter matches into FILE. See the
                        documentation for the file format.
  -r FILE, --rest-file FILE
                        When the adapter matches in the middle of a read, write the rest (after the
                        adapter) to FILE.
  --wildcard-file FILE  When the adapter has N wildcard bases, write adapter bases matching wildcard
                        positions to FILE. (Inaccurate with indels.)
  --too-short-output FILE
                        Write reads that are too short (according to length specified by -m) to FILE.
                        Default: discard reads
  --too-long-output FILE
                        Write reads that are too long (according to length specified by -M) to FILE.
                        Default: discard reads
  --untrimmed-output FILE
                        Write reads that do not contain any adapter to FILE. Default: output to same
                        file as trimmed reads

Paired-end options:
  The -A/-G/-B/-U options work like their -a/-b/-g/-u counterparts, but are applied to the second read
  in each pair.

  -A ADAPTER            3' adapter to be removed from second read in a pair.
  -G ADAPTER            5' adapter to be removed from second read in a pair.
  -B ADAPTER            5'/3 adapter to be removed from second read in a pair.
  -U LENGTH             Remove LENGTH bases from second read in a pair.
  -p FILE, --paired-output FILE
                        Write second read in a pair to FILE.
  --pair-adapters       Treat adapters given with -a/-A etc. as pairs. Either both or none are removed
                        from each read pair.
  --pair-filter (any|both|first)
                        Which of the reads in a paired-end read have to match the filtering criterion in
                        order for the pair to be filtered. Default: any
  --interleaved         Read and/or write interleaved paired-end reads.
  --untrimmed-paired-output FILE
                        Write second read in a pair to this FILE when no adapter was found. Use with
                        --untrimmed-output. Default: output to same file as trimmed reads
  --too-short-paired-output FILE
                        Write second read in a pair to this file if pair is too short.
  --too-long-paired-output FILE
                        Write second read in a pair to this file if pair is too long.

常用参数

-g: #剪切reads 5'端adapter(双端测序第一条read),加$表示adapter锚定在reads 5'端
-a: #剪切reads 3'端adapter(双端测序第一条read),加$表示adapter锚定在reads3'端
-O MINLENGTH, --overlap=MINLENGTH #adapter与reads最小overlap,才算成功识别; Default: 3
-m LENGTH, --minimum-length=LENGTH: 根据最短长度筛选reads;Default: 0
--discard-untrimmed, --trimmed-only #丢掉不包含adapter的reads
 -e ERROR_RATE, --error-rate=ERROR_RATE  #adapter匹配允许的最大错配率(错配/匹配片段长度);Default: 0.1
--no-trim: 不修剪adapter,直接输出满足跳进啊的reads
-u LENGTH, --cut=LENGTH:  #修剪reads 5'/3'端碱基,正数:从开始除移除碱基;负数:从末尾处移除碱基;
-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF: #修剪低质量碱基
-l LENGTH, --length=LENGTH: #将reads修剪的最终长度
--trim-n: #修剪reads末端的'N'
-o FILE, --output=FILE: #输出文件
--info-file=FILE:每条reads和与reads匹配的adapter的信息
--too-short-output=FILE: #为reads长度最小值设定阈值筛选reads后,要丢弃的部分输出到文件;长度依据m值设定;   
--too-long-output=FILE:#为reads长度最大值设定阈值筛选reads后,要丢弃的部分输出到文件;长度依据M值设定; 
--untrimmed-output=FILE: #将没有adapter未做修剪的reads输出到一个文件;默认输出到trimmed reads结果文件
--max-n=COUNT:#reads中N的数量,设定整数或小数(N的占比)

双端测序参数
-A ADAPTER:  #第二条reads 3'adapter
-G ADAPTER:#第二条reads 5'adapter
-U LENGTH: #从第二条reads上修剪的长度
-p FILE, --paired-output=FILE: #第二条reads的输出结果
--untrimmed-paired-output=FILE:#第一条reads没有adapter时,将第二条reads输出到文件;默认输出到trimmed reads结果文件 

我处理的数据

我的数据(GSE176393)是应用Illumina NovaSeq 6000测得的一组双端150bp测序数据。进行质控前,我对其进行了fastqc查看了数据质量,该数据质量非常不错(赚着了)。

Fig.1

SRA的数据有些数据上传时就是已经质控过的可能不带有接头,有些则可能带有接头或者其他问题,因此FastQC结果对明确后续分析所需要的操作及使用参数尤为重要。

依据FastQC结果,确定所需的质控命令(虽然我感觉都不用质控了)

cutadapt -O 1 -a AGATCGGAAGAG -A AGATCGGAAGAG -e 0.1 -q 20,20 -m 75 --max-n=0.1 -o *_clean_R1.fq -p *_clean_R2.fq *_1.fq.gz *_2.fq.gz

批量运行命令如下:

cat SRR_Acc_List.txt | while read line
do 
cutadapt -O 1 -a AGATCGGAAGAG -A AGATCGGAAGAG -e 0.1 -q 20,20 -m 75 --max-n=0.1 -o 1-cleandata/${line}_clean_1.fastq.gz -p 1-cleandata/${line}_clean_2.fastq.gz 0-rawdata/${line}_1.fastq.gz 0-rawdata/${line}_2.fastq.gz
done

获得以下结果

Fig.2

PS

关于illumina测序仪使用的接头信息,要依据试剂盒确定,详情可前往Illumina Adapter Sequences查看。

其中这个表比较有用,截图下来以备后用。

Fig.3

对数据进行质量控制后就要开始比对啦!
坚持日更,加油!!!

上一篇 下一篇

猜你喜欢

热点阅读