（全）WES（全外显子测序）分析流程

2019-01-24 本文已影响155人黄晶_id

1.安装软件配置环境

先安装conda，相当于先给我们的电脑安装一个软件管家
安装conda的方法

# 先去清华镜像站搜索你想安装版本的下载链接
mkdir src && cd src
wget -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda2-4.5.12-Linux-x86_64.sh

回车后系统会自动下载，下载完成后会有一个 Miniconda2-4.5.12-Linux-x86_64.sh bash文件

bash Miniconda2-4.5.12-Linux-x86_64.sh
#这样就相当于激活了Miniconda这个软件，软件安装成功。

更改镜像源配置如下：

#直接复制粘贴运行这些代码，镜像就配置好了
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes

然后就可以用conda来安装一系列软件了
注意：安装软件前一定先搜索一下，conda有没有这个软件或者这个软件是不是你以为的名字

conda  create -n wes  python=2 bwa #创建wes环境
conda info --envs  #查看当前环境命令，看看是否添加上了wes环境
source activate wes #启动当前环境

**必须保证所有的软件都是安装在 wes 环境下面**
conda install sra-tools
conda install samtools
conda install -y bcftools vcftools  snpeff
conda install -y multiqc qualimap
conda install -y gatk4

2.raw fastq

从NCBI上下载SRA数据转成fastq，此时就得到了原始的fastq数据。（下载sra数据特别慢）

## 找到SRA数据，下载SRR list
nohup cat SRR_Acc_List.txt|while read id; do (prefetch ${id});done &
## sra转为fastq
ls SRR5660416.sra |fastq-dump -gzip --split-3 -O ./ SRR5660416.sra #单个测试
cat >sra.sh #写脚本
ls SRR* |while read id; do (fastq-dump -gzip --split-3 -O ./ ${id}) 1>./${id}.sra.log 2>&1;done #多个循环
nohup bash sra.sh & #挂后台运行

得到原始的fastq数据

3.质控得到dean fastq

软件：
• fastqc
• cutadapt
• Trim Galore

fastqc质量报告

mkdir 1.fastq_qc && cd 1.fastq_qc
fastqc /home/qmcui/7E5240_L1_A001.L1_1.fastq.gz /home/qmcui/7E5240_L1_A001.L1_2.fastq.gz -o ./tmp

fastqc得到的文件

去接头

mkdir 1.fastq_qc && cd 1.fastq_qc
#单个测试
trim_galore --phred33 -q 25 -e 0.1 --length 36 --stringency 3 --paired -o ~/ ../sra/tmp/7E5241.L1_1.fastq.gz ../sra/tmp/7E5241.L1_2.fastq.gz

我们来看下这些数字的含义：
• 去除接头
• 去除低质量碱基(-q 25)
• 最大允许错误率(默认-e 0.1)
• 去除<36的reads(--length 36)
• 切除双端的overlap>3的碱基
• 去除reads以对为单位(--paired)

#多个循环 trim_galore
ls /home/qmcui/*1.fastq.gz>./1;cat 1
ls /home/qmcui/*2.fastq.gz>./2;cat 2
paste 1 2 >config
cat >trim.sh
cat config|while read id;do arr=(${id}); fq1=${arr[0]}; fq2=${arr[1]}; echo $fq1 $fq2 trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o ../clean  $fq1 $fq2 >./${id}.log 2>&1;done
nohup bash mvadaptor.sh &

下面跑流程的时候我们用少量sample序列做分析，所以我们每一步会给它先赋值

4.比对mapping

比对目的：
• 将打断测序的reads比回参考基因组
• 得到比对结果sam文本，用于后续分析
软件：
• bwa(Burrows-Wheeler Aligner对准器 )
bwa 是一款将序列比对到参考基因组上的软件
比对策略：
• index先建立索引
• mem算法(maximal exact matches)

bwa软件的作用是将序列比对到参考基因组上，在比对之前，首先需要对参考基因组建立索引。

#1. 建立project路径
mkdir 0.config 2.mapping 
#2.建立索引
ln -s /public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta /home/jhuang/project/wes/0.config/hg38.fa
bwa index /home/jhuang/project/wes/0.config/hg38.fa
## 人类参考基因组 hg38.fa

索引建立好之后，会生成5个文件，后缀分别为 .bwt / .pac / .ann / .amb / .sa

索引建好之后我们开始用baw比对，比对结果是要生成排好序的bam，有两个策略：
• bwa生成sam转bam,然后再排序。
• bwa比对时直接利用管道符‘|’直接生成排序bam（推荐）
这里这两种方法都介绍，但跑的时候建议用第二种

bwa首先生成sam文件：

## 先赋值
sample=“7E5241”
INDEX=“/home/qmcui/database/reference/index/bwa/hg38” （比对基因组所放的位置）
fq1=“/home/qmcui/7E5239.L1_1_val_1.fq.gz”
fq2=“/home/qmcui/7E5239.L1_2_val_2.fq.gz”
（这里的的fq1、fq2是经过过滤得到的fastq文件）

**单个样品比对
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2 1>7E5241.sam 2>7E5241.log
##`1>7E5241.sam 2>7E5241.log`条命令是说正确的运行结果放入命名为7E5241.sam的文档，错误的放入7E5241.log文档。

再将上步生成的sam文件转成bam
bam文件是二进制文件，占内存小，一般存bam删sam

## sam转bam
samtool view -bS -h 7E5241.sam > 7E5241.bam
或者
samtool view -bS -h 7E5241.sam -o 7E5241.bam

bam查看方法

samtool view -h smaple.bam |less -S
或者
samtool view smaple.bam |less -S

bam文件排序（sort）

sort目的：
• 用于后续软件分析
• 提取bam序列
软件：samtools(sort参数)

## 建立路径
mkdir 3.sort_bam && cd 3.sort_bam
## 排序bam
samtools sort -@ 5 ../2.mapping/7E5241.bam -o 7E5241.sort.bam

下面介绍一步到位生成sort.bam

## 先赋值
sample=“7E5241”
INDEX=“/home/qmcui/database/reference/index/bwa/hg38” （比对基因组所放的位置）
fq1=“/home/qmcui/7E5239.L1_1_val_1.fq.gz”
fq2=“/home/qmcui/7E5239.L1_2_val_2.fq.gz”
##
bwa mem -t 5 -R
"@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX
$fq1 $fq2 | samtools sort -@ 5 -o $sample.sort.bam -
### bam统计
*单个测试
samtools flagstat /home/qmcui/7E5241.chr1_2.sort.bam
*多个
ls *.bam | while read id ;do (samtools flagstat $id > $(basename$id ".bam").stat);done

5.Mark PCR重复

此步的目的：
• 标记/删除PCR重复的reads
• 为后续call变异位点增加可信度，去掉假阳性
软件：GATK4(MarkDuplicates)

mkdir 4.gatk_markdup && cd 4.gatk_markdup

sample="7E5241"
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I /home/qmcui/7E5241.chr1_2.sort.bam -O ${sample}_marked.bam -M ${sample}.metrics 1>${sample}_log.mark 2>&1

6.找变异

目的：
• 得到vcf前矫正的bam步骤
参数：
• sample赋值
• -I 输入文件
• -O 输出文件
• 1> 重定向标准输出
• 2> 重定向错误输出到1的文
件内(目的：记录程序运行日志)
软件：
• GATK 子命令FixMateInformation
• samtools 子命令index 建索引

6.1标记flagstat

### 建立路径
mkdir 6.gatk_bam && cd 6.gatk_bam
### 找变异
sample="7E5241"
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" FixMateInformation -I ${sample}_marked.bam -O ${sample}_marked_fixed.bam -SO coordinate 1>${sample}_log.fix 2>&1
samtools index ${sample}_marked_fixed.bam

6.2找变异

sample="7E5241" 
ref="/home/qmcui/database/reference/index/hisat/hg38/hg38.fa"
indel="/home/qmcui/tmp/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
snp="/home/qmcui/dbsnp_146.hg38.vcf.gz"

gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" BaseRecalibrator -R $ref -I ${sample}_marked_fixed.bam --known-sites $snp --known-sites $indel -O ${sample}_recal.table

6.3矫正bam

sample="7E5241" 
ref="/home/qmcui/database/reference/index/hisat/hg38/hg38.fa"
indel="/home/qmcui/tmp/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"

gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" ApplyBQSR -R $ref -I ${sample}_marked_fixed.bam -bqsr ${sample}_recal.table -O ${sample}_bqsr.bam 1>${sample}_log.ApplyBQSR 2>&1

6.4calling variation得到vcf

mkdir gatk_single_vcf && cd gatk_single_vcf
ref="/home/qmcui/database/reference/index/hisat/hg38/hg38.fa"
"snp="/home/qmcui/dbsnp_146.hg38.vcf.gz
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller -R $ref -I ${sample}_bqsr.bam --dbsnp $snp -O ${sample}_raw.vcf 1>${sample}_log.HC 2>&1

得到_raw.vcf文件找变异这步就算成功

6.5 得到gvcf文件

3个样本循环（一个样本1.5h的耗时）所以把它挂后台运行吧（nohup）。

参数：
• ERC 指定类型
• -R 参考基因组
• -I 输入的bam(bqsr bam)
• dbsnp数据库
• -O 输出文件
软件：GATK

## sample是不同的样本，可以把样本写入一个txt文档中
cat bq_bam.txt
7E5239
7E5240
7E5241
^C
cat >bq_bam.sh
ref="/home/qmcui/database/reference/index/hisat/hg38/hg38.fa"
snp="/home/qmcui/dbsnp_146.hg38.vcf.gz"
cat bq_bam.txt| while read sample
do gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller -ERC GVCF -R $ref -I ${sample}.bqsr.bam --dbsnp $snp -O ${sample}_raw.gvcf 1>${sample}.gvcf.log 2>&1
done
nohup bash bq_bam.sh &
samtools index ${sample}.bqsr.bam # 核查bam是否建索引

6.6合并上步生成的gvcf，按染色体来找变异

参数：
• -L 区域(可多个)
• -R 参考基因组
• -V 一个样本的gvcf
• -V 再一个，可追加
• --genomicsdb先生成db文件
软件：
• GATK
• GenomicsDBImport合并gvcf
• GenotypeGVCFs工具对gvcf文件进行 joint genotyping

#先将染色体按条数分开
seq 22|while read id;do echo chr${id};done >bed.txt
cat >>bed.txt
chrX
chrY
chrM
#合并样本的gvcf
ref="/home/qmcui/database/reference/index/hisat/hg38/hg38.fa"
snp="/home/qmcui/dbsnp_146.hg38.vcf.gz"
cat bed.txt|while read bed;do gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" GenomicsDBImport -L $bed -R $ref -V 7E5239_raw.gvcf -V 7E5240_raw.gvcf -V 7E5241_raw.gvcf --genomicsdb-workspace-path gvcfs_${bed}.db
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" GenotypeGVCFs \ -R $ref -V gendb://gvcfs_${bed}.db --dbsnp $snp -O final_${bed}.vcf

#合并gvcf
gatk MergeVcfs -I final_chr2.vcf -I final_chr1.vcf -O raw.combine.vcf.gz

7.过滤

由raw vcf经过滤得到HQ vcf

过滤流程

7.1 vcf snp过滤

目的：
• 过滤snp的vcf
参数：
• -select-type 筛选类型
• -V 输入文件
• --filter-expression过滤参数
• -I 输入文件
• -O 输出
• exclude-filtered true：过滤
非pass
软件：
SelectVariants 还可以设置---min-indel-size 2、--max-indel-size 5 -R $ref

mkdir 7.vcf_filter && cd 7.vcf_filter
mv ../raw.combine.vcf.gz ./

##意思就是把上步生成的合并的gvcf文件`raw.combine.vcf.gz `放入到当前目录下。

##提取snp为单独文件，
gatk SelectVariants -select-type SNP -V raw.combine.vcf.gz -O raw.snp.vcf.gz
#设置过滤参数进行过滤
# -filter-expression "QUAL < 30.0 || QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"`过滤参数

gatk VariantFiltration -V raw.snp.vcf.gz --filter-expression "QUAL < 30.0 || QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "Filter" -O filter.snp.vcf.gz

标记过滤参数下要被过滤及合格的序列

#过滤
#`--exclude-filtered true`过滤掉非pass的
gatk SelectVariants --exclude-filtered true -V filter.snp.vcf.gz -O filtered.snp.vcf.gz

zless -S filter.indel.vcf.gz|grep -v '^#'|cut -f 7 |sort|uniq -c
zless -S filtered.snp.vcf.gz|grep -v '^#'|cut -f 7 |sort|uniq -c
#查看过滤前和过滤后文件的内容，第七列标记了pass或是要被过滤的信息

过滤前后序列数量比较

7.2 vcf indel过滤

步骤与snp过滤相同，过滤参数不同

gatk SelectVariants -select-type INDEL -V raw.combine.vcf.gz -O raw.indel.vcf.gz
gatk VariantFiltration -V raw.indel.vcf.gz --filter-expression "QUAL < 30.0 | QD < 2.0 || FS > 200.0 || SOR > 10.0 || Inbreeding Coeff < -0.8 || ReadPosRankSum < -20.0"
--filter-name "Filter" -O filter.indel.vcf.gz
gatk SelectVariants --exclude-filtered true -V filter.indel.vcf.gz \
-O filtered.indel.vcf.gz
#过滤后的snp与indel合并（可以不合并）
gatk MergeVcfs -I filtered.snp.vcf.gz -I filtered.indel.vcf.gz -O combine.filtered.vcf.gz

8.ANNOVAR注释

目的：
• 注释vcf
参数：
• convert2annovar perl脚本生成注释需要文件
• annotate_variation 注释
软件：
• annovar 的perl脚本

8.1输入文件格式转换

# convert2annovar.pl 可将多种格式转为ANNOVAR使用.avinput格式;
# -format vcf4  表明输入文件`filtered.indel.vcf.gz`是vcf格式的内容
# -allsample  将输入filtered.indel.vcf.gz中3个样本按每个样本输出到单一的.aviput文件
dir=/home/qmcui/software/ANNOVAR/annovar/
$dir/convert2annovar.pl -format vcf4 -allsample filtered.indel.vcf.gz -outfile anno

8.2 ANNOVAR注释功能


#根据基因注释
cat >id.txt
anno.7E5239
anno.7E5240
anno.7E5241

# -geneanno 基于基因的注释
# -buildver hg38 默认使用hg18
#`$id.avinput`输入文件
dir=/home/qmcui/software/ANNOVAR/annovar/
db=$dir/humandb/
ls $db
cat id.txt|while read id; do ($dir/annotate_variation.pl -geneanno -buildver hg38 $id.avinput $db);done