文章复现-全外显子数据分析学习4 去接头与质控

2022-04-21  本文已影响0人  jiarf

教程在:肿瘤外显子数据处理系列教程(二)质控与去接头 (qq.com)

去接头

1.安装TrimGalore

curl https://codeload.github.com/FelixKrueger/TrimGalore/zip/0.6.1 TrimGalore.zip
unzip TrimGalore.zip
ln -s ~/tools/TrimGalore-0.6.1/trim_galore ~/tools/bin/trim_galore

在查找这个软件的时候发现了一个博主介绍的很好,此处复制一点自己觉得有用的

Trim Galore是对FastQCCutadapt的包装。适用于所有高通量测序,包括RRBS(Reduced Representation Bisulfite-Seq ), Illumina、Nextera 和smallRNA测序平台的双端和单端数据。主要功能包括两步:
第一步首先去除低质量碱基,然后去除3' 末端的adapter, 如果没有指定具体的adapter,程序会自动检测前1million的序列,然后对比前12-13bp的序列是否符合以下类型的adapter:

但是这个作者并没有用conda安装成功

(wes) 16:45:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
conda install trim-galore
Collecting package metadata (current_repodata.json): -
 fontconfig         conda-forge/linux-64::fontconfig-2.14.0-h8e229c2_0
  freetype           conda-forge/linux-64::freetype-2.10.4-h0708190_1
  libnsl             conda-forge/linux-64::libnsl-2.0.0-h7f98852_0
  libpng             conda-forge/linux-64::libpng-1.6.37-h21135ba_2
  libuuid            conda-forge/linux-64::libuuid-2.32.1-h7f98852_1000
  openjdk            conda-forge/linux-64::openjdk-8.0.312-h7f98852_0
  perl               conda-forge/linux-64::perl-5.32.1-2_h7f98852_perl5
  pigz               conda-forge/linux-64::pigz-2.6-h27826a3_0
  trim-galore        bioconda/noarch::trim-galore-0.6.7-hdfd78af_0
  xopen              bioconda/noarch::xopen-0.7.3-py_0


Proceed ([y]/n)? y
Downloading and Extracting Packages
pigz-2.6             | 87 KB     | ##################################### | 100%
fontconfig-2.14.0    | 305 KB    | ##################################### | 100%
xopen-0.7.3          | 11 KB     | ##################################### | 100%
fastqc-0.11.9        | 9.7 MB    | ##################################### | 100%
perl-5.32.1          | 14.4 MB   | ##################################### | 100%
freetype-2.10.4      | 890 KB    | ##################################### | 100%
libnsl-2.0.0         | 31 KB     | ##################################### | 100%
libpng-1.6.37        | 306 KB    | ##################################### | 100%
cutadapt-1.18        | 199 KB    | ##################################### | 100%
expat-2.4.8          | 187 KB    | ##################################### | 100%
openjdk-8.0.312      | 97.6 MB   | ##################################### | 100%
bz2file-0.98         | 9 KB      | ##################################### | 100%
libuuid-2.32.1       | 28 KB     | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

可以看到依赖的环境很多,要一个一个下载

(wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
trim_galore --help

···
 within the other read).
                        NOTE: If you are planning to use Bowtie2, BWA etc. you don't need to specify this option.

--retain_unpaired       If only one of the two paired-end reads became too short, the longer
                        read will be written to either '.unpaired_1.fq' or '.unpaired_2.fq'
                        output files. The length cutoff for unpaired single-end reads is
                        governed by the parameters -r1/--length_1 and -r2/--length_2. Default: OFF.

-r1/--length_1 <INT>    Unpaired single-end read length cutoff needed for read 1 to be written to
                        '.unpaired_1.fq' output file. These reads may be mapped in single-end mode.
                        Default: 35 bp.

-r2/--length_2 <INT>    Unpaired single-end read length cutoff needed for read 2 to be written to
                        '.unpaired_2.fq' output file. These reads may be mapped in single-end mode.
                        Default: 35 bp.

Last modified on 07 October 2020.

安装成功

(wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cutadapt --help
cutadapt version 1.18

Copyright (C) 2010-2018 Marcel Martin <marcel.martin@scilifelab.se>

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

Citation:

Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
http://dx.doi.org/10.14806/ej.17.1.200

Run "cutadapt --help" to see all command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --debug               Print debugging information.
  ···

但是这个版本不得行

(wes) 17:28:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cutadapt --version
1.18

作者那时候2019年都2.6了,现在会更新
更新一下

(wes) 17:29:28 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
pip3 install --user --upgrade cutadapt
Looking in indexes: http://pypi.douban.com/simple

更新失败了,算了算了,我的pip总感觉这个环境的有点问题

term_galore运行的参数


image.png

2.去接头

这是教程的代码:

cd 1.raw_fq
touch ../3.clean/trimgalore.log

## trim_galore.sh
cat ../tmp | while read id; do
    fq1=${id}_1.fastq.gz
    fq2=${id}_2.fastq.gz
    trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../3.clean  $fq1  $fq2 >> trimgalore.log 2>&1
done

nohup bash trim_galore.sh &

multiqc ./ -n trim_galore_report -p -i " Trim REPORT OF SRP070662" -o multiqc

nohup fastqc --outdir ../2.qc/post --threads 16 *.fq.gz > ../2.qc/post/fastqc.log 2>&1

multiqc ./ -n post_qc_report -p -i " QC REPORT OF SRP070662" -o multiqc

这是教程的,自己改脚本

cat trim_galore.sh
cd /data1/jiarongf/learning/cancer-WES/0.sra/raw_data

## trim_golore.sh
cat ../../data/case | while read id; do
    fq1=${id}_1.fastq.gz
    fq2=${id}_2.fastq.gz
    trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean  $fq1  $fq2 > ../../log/trimgalore.log 2>&1
done

这里报错了,试一下跑一个,发现试python的问题,之前创建的wes环境是2的版本,所以跑完之后会报错

Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz


  >>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
ERROR: Running in parallel is not supported on Python 2


Cutadapt terminated with exit signal: '256'.
Terminating Trim Galore run, please check error message(s) to get an idea what went wrong...

现在更改已经创建的环境的python版本


image.png
conda install python==版本号

但是有一个奇怪的就是再这个环境里面的conda show config是python是3.8.8的,为什么默认的python的版本就是2的,哎可能是创建环境的问题吧
网有点不好,慢慢等待


image.png
Proceed ([y]/n)? y


Downloading and Extracting Packages
cutadapt-4.0         | 208 KB    | ##################################### | 100%
pbzip2-1.1.13        | 114 KB    | ##################################### | 100%
certifi-2021.10.8    | 145 KB    | ##################################### | 100%
dnaio-0.8.1          | 76 KB     | ##################################### | 100%
python-isal-0.11.1   | 141 KB    | ##################################### | 100%
setuptools-62.1.0    | 1.2 MB    | ##################################### | 100%
xopen-1.5.0          | 25 KB     | ##################################### | 100%
isa-l-2.30.0         | 192 KB    | ##################################### | 100%
python-3.8.8         | 26.1 MB   | ##################################### | 100%
pip-22.0.4           | 1.5 MB    | ##################################### | 100%
python_abi-3.8       | 4 KB      | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

下载成功
并且也更新了cutadapt

(wes) 14:46:05 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
cutadapt -v
This is cutadapt 4.0 with Python 3.8.8
Command line parameters: -v
Run "cutadapt --help" to see command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.

cutadapt: error: unrecognized arguments: -v
(wes) 14:46:11 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
python --version
Python 3.8.8

再试着运行一下

(wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean  case3_techrep_2_WES_1.fastq.gz  case3_techrep_2_WES_2.fastq.gz

Number of cores used for trimming: 8
Quality Phred score cutoff: 28
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 3 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 30 bp
Output file(s) will be GZIP compressed

Cutadapt seems to be fairly up-to-date (version 4.0). Setting -j 8
Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz


  >>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
10000000 sequences processed
^C
(wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data

全部拿去跑试试
这里提醒,每次拿去跑的时候都要删掉之前的,不然不知道新生成的是哪个

(wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
nohup sh trim_galore.sh > trim_galore.sh.log 2>&1 &
(wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
top

image.png

好了再后台跑起来了
去看看log有没有问题,这里我生成了两个log,一个是运行sh的log一个是运行trim_galore的log
先去检查sh的log


image.png

嗯呢果然很简洁,没啥
在检查另一个log


image.png
成功的标志啊,哈哈哈哈哈,下面有再检查每一个adapter了

也要跑很久,趁着跑的这段时间,去学习一下term_galore 发现这里不同的测序平台会有不同的adaper,那么如何知道这批数据是什么平台呢,生成fq文件之后不是做了一次fastqc质控嘛,随意打开一个


image.png

就可以看到这个样本的encoding了,

学习trim_galore

教程:使用trim_galore对数据进行质量控制-过滤 - 简书 (jianshu.com)

image.png image.png
参数说明

--quality/-q<INT>:设定Phred quality score阈值,默认为Phred 20 切除质量得分低于设定值的序列
--phred33/64:使用ASCII+33/64质量得分作为Phred得分选择-phred33或者-phred64,表示-测序平台使用的Phred quality score。
(需要确认:anger/Illumina 1.9+ encoding为33; Illumina 1.5 encoding为64)
--adapter/-a :输入adapter序列。也可以不输入,Trim Galore!会自动寻找可能性最高的平台对应的adapter。自动搜选的平台三个,也直接显式输入这三种平台,即--illumina、--nextera和--small_rna。
-s/--stringency<INT>:设定可以忍受的前后adapter重叠的碱基数,默认为1(非常苛刻)。可以适度放宽,因为后一个adapter几乎不可能被测序仪读到。
--length<INT>:设定输出reads长度阈值小于设定值会被抛弃,默认值20bp;即小于20bp的被去除。注意,在pe150下,不要设置太高,可以50或36(默认20)
--max_length : 设置长度大于此值被丢弃
-e <ERROR RATE> :默认0.1
--paired:对于双端测序结果,一对reads中,如果有一个被剔除,那么另一个会被同样抛弃,而不管是否达到标准。
--retain_unpaired:对于双端测序结果,一对reads中,如果一个read达到标准,但是对应的另一个要被抛弃,达到标准的read会被单独保存为一个文件。
--gzip和--dont_gzip:清洗后的数据zip打包或者不打包。
-o/--output_dir:输入目录 [需要提前建立目录,否则运行会报错]。
-- trim-n : 移除read一端的reads
-j :使用线程数, 注意假设已使用Python 3并安装了Pigz,那么内核设置为4,实际使用内核是15,因此,最高设为4.
--fastqc #当分析结束后,使用默认选项对结果文件进行fastqc分析

fasqc的结果

image.png

Per base sequence quality : 每个read各位置碱基的测序质量。
横轴碱基的位置,纵轴是质量分数,
Quality score= -10log10p
(p代表错误率),所以当质量分数为40的时候,p就是0.0001,质量算高了。
[红色线]代表中位数,
[蓝色线]代表平均数
[黄色]是25%-75%区间,
[权]是10%-90%区间。
若任一位置的下 [四分位数] 低于10或者 [中位数] 低于25---->出现 “警告”
若任一位置的下 [四分位数] 低于5或者 [中位数] 低于20,出现“失败,Fail”

跑完啦

昨天跑了一下去接头和质控,质控的脚本:

(wes) 08:38:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat post_fastqc.sh
nohup fastqc --outdir ../2.qc/post --threads 16 ../3.clean/*.fq.gz > ../log/post.fastqc.log 2>&1 &
image.png

multi一下

(wes) 08:41:12 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat post_qc_multiqc.sh
multiqc /data1/jiarongf/learning/cancer-WES/2.qc/post  -n post_qc_report -p -i " QC REPORT OF SRP070662" -o /data1/jiarongf/learning/cancer-WES/2.qc/post/
(wes) 08:41:21 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh post_qc_multiqc.sh

  /// MultiQC 🔍 | v1.13.dev0

|           multiqc | Report title:  QC REPORT OF SRP070662
|           multiqc | Search path : /data1/jiarongf/learning/cancer-WES/2.qc/post
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 120/120
|            fastqc | Found 60 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : ../2.qc/post/post_qc_report.html
|           multiqc | Data        : ../2.qc/post/post_qc_report_data
|           multiqc | Plots       : ../2.qc/post/post_qc_report_plots
|           multiqc | MultiQC complete
(wes) 08:42:07 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
image.png

结果分析:

去接头前qc
去接头后qc

有看到dup的比例确实有减少
M seqs也有相应的降低




看到质量有明显的提升











上一篇下一篇

猜你喜欢

热点阅读