RNAseq分析流程:从测序Rawdata到Count输出
2019-07-03 本文已影响201人
dming1024
在之前学习snakemake的基础上,这里重新又对之前编写的pipline进行了优化,只需要在config.yaml中输入相应的samples,以及参考基因组路径,即可以得到测序结果比对的最终count数据。
参考的官方文档
snakemake+ R/Python script
主要分为以下几个阶段:
1.比对生成sam文件:保存在bwa/文件夹下
2.sam转为bam: 保存在bam/文件夹下
3. 对bam文件进行sort: 保存在sort/文件夹下
4. 对bam文件进行count: 保存在count/文件夹下
其中还有两个必需的文件夹:
samples文件夹:保存待分析的测序原始数据
reference文件夹:保存参考基因组
这个是分析的流程图 流程图配置文件config.yaml
samples:
control_R1: samples/control_R1.fq.gz
control_R2: samples/control_R2.fq.gz
treated_R1: samples/treated_R1.fq.gz
treated_R2: samples/treated_R2.fq.gz
reference:
bwaRef: reference/chr19
annotation:
gtf: annotation/mouse.chr.gtf
待分析数据、参考基因组
tree samples/ reference/
samples/
├── control_R1.fq.gz
├── control_R2.fq.gz
├── nohup.out
├── treated_R1.fq.gz
├── treated_R2.fq.gz
├── wget-log
└── wget-log.1
reference/
├── chr19.1.bt2
├── chr19.2.bt2
├── chr19.3.bt2
├── chr19.4.bt2
├── chr19.fa
├── chr19.fa_bowtie2-build.log
├── chr19.fa.fai
├── chr19.rev.1.bt2
├── chr19.rev.2.bt2
├── GRCm38.83.chr19.gtf
└── nohup.out
1.fastqc 质控
rule quality_check:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
"qc/"
shell:
"fastqc -o {output} {input}"
在这次分析中没成功,单独分析也没成功,可能是我服务器的原因,所以这一步虽然在流程中,但是分析时被跳过了
2. bowtie2对比
rule bwa:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
"bwa/{sample}.sam"
params:
INDEX= config["reference"]["bwaRef"] #参考基因组路径
shell:
"bowtie2 -x {params.INDEX} -q {input} -S {output}"
3. samtools转化sam文件
input:
"bwa/{sample}.sam"
output:
"bam/{sample}.bam"
shell:
"samtools view -Sb {input} > {output}"
4. sort bam文件
rule sort:
input:
"bam/{sample}.bam"
output:
"sort/{sample}_sort.bam"
shell:
"samtools sort {input} -o {output}"
5. 建立bam文件的index
rule index:
input:
"sort/{sample}_sort.bam"
output:
"sort/{sample}.bam.bi"
shell:
"samtools index {input} {output}"
6. htseq-count对bam文件进行count
rule count:
input:
sample="sort/{sample}_sort.bam",
annotation=config["annotation"]["gtf"]
output:
"count/{sample}.count"
shell:
"htseq-count -f bam -r name -s no {input.sample} {input.annotation} > {output}"
7.采用python对各样本的count结果进行合并,然而这个却是在整个snakemake 文件的最开始
rule all:
input:
expand("count/{sample}.count",sample=config["samples"]) #进行遍历,告诉它所有sample的count文件
output:
"count/merge.count" #输出一个merge之后的文件
run:
import os
import pandas as pd
dm=pd.DataFrame() #建立一个空dataframe
for filename in input: #读取各个文件的全路径
df=pd.read_csv(filename,sep="\t",names=["gene",os.path.basename(filename)])
if(any(dm)):#进行合并
dm=pd.merge(dm,df,on="gene")
else:
dm=df
dm.to_csv(output[0],sep='\t')
#将合并之后文件输出到output路径,要使用output[0],或者output.index
8.最终整个snakefile文件如下
cat snakefile1
configfile: "config.yaml"
rule all:
input:
expand("count/{sample}.count",sample=config["samples"])
output:
"count/merge.count"
run:
import os
import pandas as pd
dm=pd.DataFrame()
for filename in input:
df=pd.read_csv(filename,sep="\t",names=["gene",os.path.basename(filename)])
if(any(dm)):
dm=pd.merge(dm,df,on="gene")
else:
dm=df
dm.to_csv(output[0],sep='\t')
#rule all:
# input:
# "count/merge.count"
# expand("count/{sample}.count",sample=config["samples"])
rule quality_check:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
"qc/"
shell:
"fastqc -o {output} {input}"
rule bwa:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
"bwa/{sample}.sam"
params:
INDEX= config["reference"]["bwaRef"]
shell:
"bowtie2 -x {params.INDEX} -q {input} -S {output}"
rule bam:
input:
"bwa/{sample}.sam"
output:
"bam/{sample}.bam"
shell:
"samtools view -Sb {input} > {output}"
rule sort:
input:
"bam/{sample}.bam"
output:
"sort/{sample}_sort.bam"
shell:
"samtools sort {input} -o {output}"
rule index:
input:
"sort/{sample}_sort.bam"
output:
"sort/{sample}.bam.bi"
shell:
"samtools index {input} {output}"
rule count:
input:
sample="sort/{sample}_sort.bam",
annotation=config["annotation"]["gtf"]
output:
"count/{sample}.count"
shell:
"htseq-count -f bam -r name -s no {input.sample} {input.annotation} > {output}"
生成的结果是这个样子:
bam/:
total 181M
-rw-r--r-- 1 root root 49M Jul 2 11:20 control_R1.bam
-rw-r--r-- 1 root root 49M Jul 2 11:21 control_R2.bam
-rw-r--r-- 1 root root 42M Jul 2 11:21 treated_R1.bam
-rw-r--r-- 1 root root 43M Jul 2 11:21 treated_R2.bam
bwa/:
total 640M
-rw-r--r-- 1 root root 173M Jul 2 11:16 control_R1.sam
-rw-r--r-- 1 root root 172M Jul 2 11:15 control_R2.sam
-rw-r--r-- 1 root root 148M Jul 2 11:18 treated_R1.sam
-rw-r--r-- 1 root root 148M Jul 2 11:17 treated_R2.sam
count/:
total 5.3M
-rw-r--r-- 1 root root 965K Jul 2 11:30 control_R1.count
-rw-r--r-- 1 root root 965K Jul 2 11:28 control_R2.count
-rw-r--r-- 1 root root 1.5M Jul 3 20:27 merge.count
-rw-r--r-- 1 root root 965K Jul 2 11:36 treated_R1.count
-rw-r--r-- 1 root root 965K Jul 2 11:33 treated_R2.count
sort/:
total 131M
-rw-r--r-- 1 root root 35M Jul 2 11:25 control_R1_sort.bam
-rw-r--r-- 1 root root 35M Jul 2 11:25 control_R2_sort.bam
-rw-r--r-- 1 root root 31M Jul 2 11:25 treated_R1_sort.bam
-rw-r--r-- 1 root root 31M Jul 2 11:25 treated_R2_sort.bam
嗯,还算比较满意,接下来会继续完善这个流程。