mRNA-seq转录组二代测序从raw reads到表达矩阵：上

2020-05-07 本文已影响0人 ZZZZZZ_XX

mRNA-seq上中游分析需要用到Linux系统的terminal！

尽管TCGA数据库中已经提供了大量的处理好的表达矩阵数据，但是无论是自身实验需要，还是从NCBI的GEO数据库（储存了许多发表文章样本的测序原始数据）下载文件，都会经常有处理转录组测序下机原始数据的需求。因此我尝试总结出一个具有一定普适性的mRNA-seq上中游pipeline，可以供有需求的朋友进行参考、学习和交流。第一次写shell脚本，望包涵！

转录组应该大家都很熟悉了，就不多介绍了，通常我们可以使用针对RNA的NGS（next generation sequencing）来高通量检测不同条件下细胞组织内的基因（/non coding RNA）的表达情况，此外更高深度的RNA测序还可以用来检测融合基因、可变剪接等。本教程着眼于适用面最广的表达量检测，尝试对GEO数据库中的一批二代RNA-seq数据进行下载和自己重新处理。

从原始文件到表达矩阵！

mRNA-seq的上中游数据处理流程大致可以分为如下步骤：
原始数据质检→低质量碱基去除+低质量reads过滤→（过滤数据的再次质检）→reads比对到参考基因组→格式转换→去除多重比对的低质量比对结果和去重复reads→counts计算（或RPKM/FPKM/TPM计算）

至于这些流程为何被称为上中游分析，是相对以表达矩阵为起点的差异分析、通路富集分析等后续分析流程命名的~

软件准备

由于生信的迅速发展，现在每一步流程都有不少已经开发好的软件可供选择，我这里都是使用了较多人使用的经典软件。

（1）原始数据质检：FastQC

FastQC在reads质检界的地位已经是无可撼动的了。关于FastQC的具体参数和报告解读，字数原因不多赘述，推荐简书教程：利用fastqc检测原始序列的质量

安装

Anaconda依赖python，其中的bioconda子库包含了很多常见的生信软件，并且通过它下载可以直接添加到环境变量（调用时不需要输入绝对路径），非常方便！那么我们需要下载anaconda，添加bioconda查找路径，然后就可以愉快地下载这些软件了。

下载并配置anaconda3.

#download and install the software.
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh -P ~/Downloads
bash ~/Downloads/Anaconda3-2020.02-Linux-x86_64.sh

#Then we have to go through a long agreement, and enter yes and ENTER to complete the installation.
#Then anaconda3 has been installed at /home/xuzihan/anaconda3

#We need to add the command to the environment.
vi ~/.bashrc
#Go to the bottom of this file, press "i" to start text insertion. And add this command:
export PATH="$PATH:/home/xuzihan/anaconda3/bin"
#Press esc to exit the texting mode, then press ":", and enter "wq!" to save the modulation and exit the vim texter.

#reload the environment setting.
source ~/.bashrc

#add some databases to anaconda3 for bioinfo software searching and downloading.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

anaconda安装过程的截图

fastqc下载

conda install fastqc
#then press "y" to confirm the installation.

然后就安装好并已经加入环境变量了！是不是很方便！

使用

普通用法很简单，输入fastq文件，指定输出目录即可。我们可以得到每个样本的html网页式的QC报告，里面对这一样本的测序数据各项参数进行了解读。

fastqc raw.reads.fastq -o ./fastqc.output.directory

（2）低质量碱基去除+低质量reads过滤：trimmomatic

Trimmomatic是基于java的使用非常广泛的去除接头和质量过滤的软件，通过它的reads去除了部分reads中也测到的测序建库时连接的外源接头，和可能影响reads序列准确性的低质量碱基，使数据质量更好。关于它的工作原理和模式可参考简述教程：NGS 数据过滤之 Trimmomatic 详细说明

安装

从开发者网站上下载压缩包解压后就可以直接得到jar文件了。

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip -P /home/xuzihan/
unzip Trimmomatic-0.39.zip
mv ~/Trimmomatic-0.39 ~/trimmomatic
#Then we will get a folder containing adapters and trimmomatic java package.

使用

本教程使用的测序数据是单端测序（Single End），所以使用SE模式就好。

java -jar /home/xuzihan/trimmomatic/trimmomatic-0.39.jar SE -threads 8 -phred33 input_reads.fastq output_reads.fastq ILLUMINACLIP:/home/xuzihan/trimmomatic/adapters/TruSeq3-SE.fa:4:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

参数含义：用java执行jar文件，开启8线程，使用单端模式，测序质量分数体系为phred33（现在的illumina测序都是），输入input_reads.fastq，输出output_reads.fastq。过滤参数包括：使用trimmomatic自带的TruSeq3-SE.fa作为测序接头序列进行剪切；对reads头尾均以碱基质量分数为3作为阈值进行剪切；进行滑窗质检，通过标准为4个碱基构成的滑窗内平均质量分数大于15；剪切后可接受的read最小长度为50.

（3）参考基因组比对：bowtie2

对于有参转录组测序，我们需要将测到的reads回贴到参考基因组上，从而鉴定其所属的基因组位置。这一步的选择其实挺多的，比如适合短序列的bowtie，相对更适合50bp以上长序列的bowtie2，还有BWA，HISAT2，Tophat等。Bowtie2在运行时间上比较有优势，还是很受欢迎的。关于bowtie的详细解释和使用参数，请移步bowtie2 manual：bowtie2 manual

安装

基于Anaconda3就可以直接安装了，一步到位。

conda install bowtie2
#Then press "y" to confirm the installation.

使用

第一次进行参考基因组比对前，需要对各数据库（ENSEMBL/NCBI/UCSC等）发布的标准人类基因组数据进行index。完成index后我们会获得6个以bt2为后缀的index文件，一劳永逸，以后就可以直接使用了。

我个人比较喜欢ENSEMBL数据库的版本，会进行定期更新，此外版本号比较规整，每个版本的参考基因组（fasta文件）和基因注释（gtf文件）会同步发布，不会怕版本间出现错误。

本教程会使用到的参考基因组是刚刚（4.29.20）发布的release100中的Homo_sapiens.GRCh38.dna.primary_assembly.fa，以及对应的release100中的Homo_sapiens.GRCh38.100.gtf注释文件（在最后数counts时才会用到）。（ftp://ftp.ensembl.org/pub/release-100/）

bowtie2-build /The_directory_containig_downloaded_reference_genome_fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa The_unified_index_prefix
#Then we can get 6 index files: The_unified_index_prefix.1.bt2 ~.2.bt2 ~3.bt2 ~4.bt2 The_unified_index_prefix.rev.1.bt2 ~.rev.2.bt2

然后就是用测序得到的reads进行比对了。由于要下载的是单端测序数据，所以使用-U参数。-p为线程数，-x定义参考基因组index，-S定义输出sam文件名。其他更复杂和相对不常用的参数使用参见manual。

bowtie2 -p 8 -x /directory/The_unified_index_prefix -U input_reads.fastq -S output.sam

（4）格式转化：samtools

比对后输出的格式为sam文件，我们选择将其压缩到二进制的bam文件中。我们还需要对比对上的bam格式的reads按照基因组的位置进行排序，方便进行基因组注释。samtools绝对是这一过程的不二选择。此外，在有其他需求的时候，它也含有其他功能。详细参数解读可右转至：samtools常用命令详解

安装

conda install samtools
#Then press "y" to confirm the installation.

使用

-S参数指输入格式为.sam，-b参数指输出.bam。-o参数指输出文件。

#format transformation.
samtools view -bS input.sam > output.bam
#sort the reads by genomic sites.
samtools sort input.bam -o sorted_output.bam

（5）过滤低质量比对结果：sambamba

相比picard，好像sambamba这个软件小众很多了。不过它的语法挺有意思的，在基于关键词的基础上支持一点自然语言。sambamba的处理速度确实是挺快的！这也是它的一个亮点。具体参数解读见：Sambamba: process your BAM data faster!

安装

conda install sambamba
#Then press "y" to confirm the installation.

使用

-t参数指线程；-h指保留bam文件的header；-f指输入文件类型；-F配合后面的语法表示过滤的条件。这里是表示过滤多重比对/未比对上/鉴定为PCR复制的reads。为了减少假阳性，本教程就将条件设置地严苛一些，仅仅只保留比对到参考基因组上一次的非PCR重复reads。（sam文件中，XS参数是指reads比对到多个参考基因组时，除了最佳比对位置外的第二好得分，它存在也就意味着read存在多次比对）

sambamba view -h -t 8 -f bam -F "[XS] == null and not unmapped and not duplicate" ordered.bam > ordered_filtered.bam

（6）从比对后bam文件得到各个样本的表达矩阵：htseq

TCGA数据库可供下载的开放数据包括htseq-counts，就可以说明这一软件的使用广泛程度了。过滤得到的bam通过htseq处理后得到的counts矩阵就是可以进行差异分析的标准形式了！

安装

conda install htseq
#Then press "y" to confirm the installation.

使用

最常用的就是通过htseq获得各基因组注释下的count数了。-s规定是否为链特异性建库（默认为yes，不懂为啥要用这个默认...）；-f指输入文件形式；输入已排序和过滤的bam文件，以及能和参考基因组对应的基因组注释（这里就是ENSEMBL-release100的Homo_sapiens.GRCh38.100.gtf）；最后输出该样本的counts文件。

htseq-count -f bam -s no ordered_filtered.bam /directory/Homo_sapiens.GRCh38.100.gtf > count.txt

（7）从GEO数据库下载：sra-toolkit

通过sra-toolkit中的prefetch指令，我们只需要知道GEO数据库中想下载样本的SRR序列号，就可以直接下载数据了。GEO数据库的结构和使用不是本教程的重点，所以大家可以自行补充学习！

虽然sra-toolkit可以通过apt-get install命令下载，也可以通过conda install下载，但它在安装后常需要configure（比如重新设置下载文件的所属路径），因此我们需要明确sra-toolkit的安装路径。然而不巧的是好像这俩命令的安装路径不是很好找，所以就还是使用麻烦一点但是更透明的办法：自行下载安装包-设置环境变量。

安装

#download the software.
mkdir ~/sratoolkit && cd ~/sratoolkit
wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.5/sratoolkit.2.10.5-ubuntu64.tar.gz -P ~/sratoolkit
tar -xzvf sratoolkit.2.10.5-ubuntu64.tar.gz

#modulate the environment.
vi ~/.bashrc
#add the directory containing commands of sratoolkit into the environment.
export PATH="$PATH:/home/xuzihan/sratoolkit/sratoolkit.2.10.5-ubuntu64/bin"
#restart
source ~/.bashrc

#Configure the software: First change the directory to 'bin' of sratoolkit, then use the command to open an interactive window.
cd /home/xuzihan/sratoolkit/sratoolkit.2.10.5-ubuntu64/bin
vdb-config --interactive

弹出的交互式窗口可以使用键盘和鼠标结合操作。

使用

sratoolkit的默认下载路径是~/ncbi/public。下载的原始数据格式为.sra格式。

此外，我使用了美国的服务器，所以使用起来没有网速问题。但是据说国内的朋友直接使用prefetch下载数据的速度是非常慢的，解决办法就是使用aspera协助下载（本处不提及，网上解决办法超多的）。

#download a single SRR file.
prefetch SRRxxxxxx

#download multiple files.
prefetch --option-file the_text_containing_one_series_no_per_row.txt

默认路径下载的sra文件放在了~/ncbi/public/sra中。

测序raw数据通常为.fastq，所以我们继续用sratoolkit中的fastq-dump对下载得到的.sra文件进行格式转换。由于下载的是单端测序数据，所以直接用如下命令就可以在同文件夹中获得相应fastq文件，作为一系列分析的起点。双端测序的格式转换命令是不同的！那再说吧...

fastq-dump /path/SRRxxxx.sra

开始分析

我把以上的命令集成了写成了一个shell脚本，通过输入GEO SRR_ID，GEO SRR_ID_phenotype，phenotype的字符数，和使用的接头序列路径，就可以进行一步完整分析了。

单端测序比双端要简单一些，那就先拿单端练练手吧。教程使用的原文件是2013年在ncbi上发表的卵母细胞（n=3）和合子（n=3）的转录组单端测序数据。

首先我们需要查询得到想从GEO下载的原始数据编号，然后使用nano文本编辑器编写成每行含有一个SRR ID的文件（id.txt），以及对应的分组表型文件：我这里自定义了一下，格式为SRR ID_表型分组信息，每行一个（group.txt）。

注：关于表型分组信息需要特别说明一下：由于脚本编写时文件的内部重命名依赖了分组表型的字符数，所以运行这个脚本需要分组名称的字符数相等。如本例中的分组：oocyte_1/2/3和zygote_1/2/3，都分别是8个字符。所以将8作为第三个参数输入。

mkdir ~/rna.seq.tutorial && cd ~/rna.seq.tutorial
nano id.txt
#SRR445718 SRR445719 SRR445720 SRR445721 SRR445722 SRR445723
nano group.txt
#SRR445718_oocyte_1 SRR445719_oocyte_2 SRR445720_oocyte_3 SRR445721_zygote_1 SRR445722_zygote_2 SRR445723_zygote_3

两个输入文档的展示

接下来编写脚本。

这个脚本不需要超级用户root。它依赖的系统目录构架如下：

系统中的目录构架

所以我们只需要配置好软件；把sratoolkit的下载路径设置为~/ncbi/public或是默认；把rna.seq.tutorial目录建立，并在其中写入脚本，id.txt和group.txt，就可以开始愉快运行脚本了。

脚本内容如下，还是建议通读一下的。

#!/bin/bash
#nano mrna.seq.pipeline.sh
#we can use it by entering: bash mrna_seq_pipeline.sh ~/rna.seq.tutorial/id.txt ~/rna.seq.tutorial/group.txt length_of_the_group_name absolute_directory_of_adapters_for_trimmomatic

#download sra files 
prefetch --option-file $1
cd ../ncbi/public/sra

#get the fastq raw data.
for files in `cat $1`
do 
echo ${files}' is being processed'
fastq-dump ${files}'.sra'
seriesngroup=`grep ${files} $2`
total_length=${#seriesngroup}
group_this_file=${seriesngroup:10:$3} #I postulate the SRR series are 10-digit series numbers.
mv ${files}'.fastq' ${group_this_file}'.fastq'
done
group_names=`ls *.fastq`
echo "fastq raw data of all samples have been all fetched."

#quality control by fastqc.
if [ -d './fastqc.output' ] 
then
    echo 'fastqc.output directory has existed. Good!'
else
    echo 'fastqc.output directory has not been established. Now we will make it!'
    mkdir ./fastqc.output
fi
fastqc *.fastq -o ./fastqc.output
echo "QC on raw reads has been done."

#use trimmomatic to filter the reads with low quality and trim low-quality bases.
if [ -d './trimmed_fastq' ] 
then
    echo 'trimmed_fastq directory has existed. Good!'
else
    echo 'trimmed_fastq directory has not been established. Now we will make it!'
    mkdir ./trimmed_fastq
fi
for files in ${group_names}
do 
echo ${files}' is being filtered and trimmed'
java -jar ~/trimmomatic/trimmomatic-0.39.jar SE -threads 8 -phred33 "./${files}" "./trimmed_fastq/clean_${files}" ILLUMINACLIP:$4:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
done
echo "Reads trimming has been done."

#perform quality control again on these clean data.
cd ./trimmed_fastq
if [ -d './clean.fastqc.output' ] 
then
    echo 'clean.fastqc.output directory has existed. Good!'
else
    echo 'clean.fastqc.output directory has not been established. Now we will make it!'
    mkdir ./clean.fastqc.output
fi
fastqc *.fastq -o ./clean.fastqc.output
echo "QC on trimmed reads has been done."

#use bowtie2 to align the reads to the reference.
if [ -d '~/hg.38.ref' ] 
then
    cd ~/hg.38.ref
    if [ -e 'Homo_sapiens.GRCh38.dna.primary_assembly.fa' ]
    then
    echo 'the reference genome has existed. Good!'
    else
    echo 'Now we will download the reference genome.'
        wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
        gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    fi
else
    echo 'Now we will make the directory and download the reference genome.'
    mkdir ~/hg.38.ref
    wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
fi
if [ -d '~/hg.38.ref/hg38.index' ] 
then
    cd ~/hg.38.ref/hg38.index
    if [ -e 'hg.38.1.bt2' ] && [ -e 'hg.38.2.bt2' ] && [ -e 'hg.38.3.bt2' ] && [ -e 'hg.38.4.bt2' ] && [ -e 'hg.38.rev.1.bt2' ] && [ -e 'hg.38.rev.2.bt2' ]
    then
        echo 'The reference genome index has existed. Good!'
    else
        echo 'Bowtie2 is working on reference index establishment.'
        bowtie2-build ~/hg.38.ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa hg.38
    fi
else
    echo 'Bowtie2 is working on reference index establishment.'
    mkdir ~/hg.38.ref/hg38.index && cd ~/hg.38.ref/hg38.index
    bowtie2-build ~/hg.38.ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa hg.38
fi
cd ~/ncbi/public/sra/trimmed_fastq
for trimmed_files in `ls *.fastq`
do
    echo ${trimmed_files}' is being aligned with hg38 reference.'
    bowtie2 -p 8 -x ~/hg.38.ref/hg38.index/hg.38 -U ${trimmed_files} -S "${trimmed_files}.sam" --no-unal
done
echo "The bowtie2 alignment on samples has been done."

#use samtools to transform and order the aligned results.
if [ -d '../bam' ]
then
    echo 'bam directory has existed.'
else
    mkdir ../bam
fi
for sams in `ls *.sam`
do
    cd ../bam
    charlength=$((6+$3))
    bam_file=${sams:0:$charlength}
    samtools view -bS "../trimmed_fastq/$sams" > "${bam_file}.bam"
    samtools sort "${bam_file}.bam" -o "${bam_file}_ordered.bam"
    rm "${bam_file}.bam"
    sambamba view -h -t 8 -f bam -F "[XS] == null and not unmapped and not duplicate" "${bam_file}_ordered.bam" > "${bam_file}_ordered_filtered.bam"
    rm "${bam_file}_ordered.bam"
done
echo "alignment reordering and filtering has been done."

#get the annotation for the genome and transform .gff file to .gtf file for quantification.
if [ -e '~/hg.38.ref/Homo_sapiens.GRCh38.100.gtf' ]
then
    echo "GTF file has all set!"
else
    echo "Now download the GTF annotation file."
    wget ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz -P ~/hg.38.ref
    gunzip Homo_sapiens.GRCh38.100.gtf.gz
fi

#Use HTSeq to assemble the reads and complete the annotation.
if [ -d '../count_matrix' ]
then
    echo 'count_matrix directory has existed.'
else
    mkdir ../count_matrix
fi
cd ~/ncbi/public/sra/bam
for bams in `ls * | grep "filtered.bam"`
do
    cd ../count_matrix
    count_file_name=${bams:6:$3}
    htseq-count -f bam -s no "../bam/$bams" ~/hg.38.ref/Homo_sapiens.GRCh38.100.gtf > "${count_file_name}_count.txt"
done
echo "gene annotation and counts extraction have been done."

最终开始运行：共4个参数，第一个为id.txt，第二个group.txt，第三个为表型分组的字符数，第四个为测序用到的接头序列fasta。

bash ~/rna.seq.tutorial/mrna.seq.pipeline.sh ~/rna.seq.tutorial/id.txt ~/rna.seq.tutorial/group.txt 8 ~/trimmomatic/adapters/TruSeq3-SE.fa

运行过程中terminal的一些显式进程大致如下，我截了一些图：

脚本运行过程中的一些进程显示

80%的修剪后保留率，80%的比对率都提示数据的总体质量还不错。

我们看看数据的两次fastqc结果：

两次fastqc结果的比较（左-原始数据、右-trimmomatic修剪后数据）

左边是原始数据的QC report，右边是trimmomatic处理后的QC report。

（1）碱基质量分数：可见原始数据中，随着reads的延伸，碱基的测序质量快速降低，总体已经达到不能接受的程度。在过滤修剪后，每个碱基位置的总体质量分数分布均在绿色范围，说明过滤很好地提升了测序reads的质量。

（2）reads长度：可见原始数据中，测到的reads都是100bp长度。在修剪后，reads长度出现差异，分布在50-100bp间，说明部分reads中有碱基修剪；这也与我们设置可接受最小reads长度为50bp相符合。

（3）接头内容：可见原始数据中，reads的3’端检测到少量的接头内容，可能是由于mRNA片段本身过短所导致的接头被测序。通过修剪后，右边的report显示已经基本检测不到接头序列的存在，说明修剪成功。

最后我们可以得到如下文件，我在terminal中查看了相关文件夹中的文件：

Screen Shot 2020-05-11 at 11.10.30 AM.png

count_matrix中的6个count.txt就是每个样本的表达矩阵了！如下图所示，展示了其中一个样本的表达矩阵：

一个样本的表达矩阵概览

通过htseq获得的每个样本的表达矩阵有2列，第一列为基因的ENSEMBL ID，第二列为每个注释的raw counts。此外，最后五行描述了htseq计算的结果，从__not_aligned和__alignment_not_unique均为0即可知bowtie2和sambamba的过滤是有效的（在脚本中已经设置Bowtie2在比对时不输出未必对的reads，以及sambamba过滤多次比对的reads）。

最后我们将所有样本的表达矩阵合并成一个总矩阵后，就可以进行下游一系列分析了。

有兴趣的朋友可以学习，欢迎大家相互交流！觉得有帮助就点个赞吧！