Linux-NGS-二代测序分析流程工作生活snakemake与RNAseq

RNAseq分析流程:从测序Rawdata到Count输出

2019-07-03  本文已影响201人  dming1024

在之前学习snakemake的基础上,这里重新又对之前编写的pipline进行了优化,只需要在config.yaml中输入相应的samples,以及参考基因组路径,即可以得到测序结果比对的最终count数据。
参考的官方文档
snakemake+ R/Python script

主要分为以下几个阶段:

1.比对生成sam文件:保存在bwa/文件夹下
2.sam转为bam: 保存在bam/文件夹下
3. 对bam文件进行sort: 保存在sort/文件夹下
4. 对bam文件进行count: 保存在count/文件夹下

其中还有两个必需的文件夹:

samples文件夹:保存待分析的测序原始数据
reference文件夹:保存参考基因组
这个是分析的流程图 流程图
配置文件config.yaml
samples:
 control_R1: samples/control_R1.fq.gz
 control_R2: samples/control_R2.fq.gz
 treated_R1: samples/treated_R1.fq.gz
 treated_R2: samples/treated_R2.fq.gz
reference:
 bwaRef: reference/chr19
annotation:
 gtf: annotation/mouse.chr.gtf
待分析数据、参考基因组
tree samples/ reference/
samples/
├── control_R1.fq.gz
├── control_R2.fq.gz
├── nohup.out
├── treated_R1.fq.gz
├── treated_R2.fq.gz
├── wget-log
└── wget-log.1
reference/
├── chr19.1.bt2
├── chr19.2.bt2
├── chr19.3.bt2
├── chr19.4.bt2
├── chr19.fa
├── chr19.fa_bowtie2-build.log
├── chr19.fa.fai
├── chr19.rev.1.bt2
├── chr19.rev.2.bt2
├── GRCm38.83.chr19.gtf
└── nohup.out
1.fastqc 质控
rule quality_check:
 input:
  lambda wildcards: config["samples"][wildcards.sample]
 output:
  "qc/"
 shell:
  "fastqc -o {output} {input}"

在这次分析中没成功,单独分析也没成功,可能是我服务器的原因,所以这一步虽然在流程中,但是分析时被跳过了

2. bowtie2对比
rule bwa:
 input:
  lambda wildcards: config["samples"][wildcards.sample]
 output:
  "bwa/{sample}.sam"
 params:
  INDEX= config["reference"]["bwaRef"] #参考基因组路径
 shell:
  "bowtie2 -x {params.INDEX} -q {input} -S {output}"
3. samtools转化sam文件
 input:
  "bwa/{sample}.sam"
 output:
  "bam/{sample}.bam"
 shell:
  "samtools view -Sb {input} > {output}"
4. sort bam文件
rule sort:
 input:
  "bam/{sample}.bam"
 output:
  "sort/{sample}_sort.bam"
 shell:
  "samtools sort {input} -o {output}"
5. 建立bam文件的index
rule index:
 input:
  "sort/{sample}_sort.bam"
 output:
  "sort/{sample}.bam.bi"
 shell:
  "samtools index {input} {output}"
6. htseq-count对bam文件进行count
rule count:
 input:
  sample="sort/{sample}_sort.bam",
  annotation=config["annotation"]["gtf"]
 output:
  "count/{sample}.count"
 shell:
  "htseq-count -f bam -r name -s no {input.sample}  {input.annotation} > {output}"
7.采用python对各样本的count结果进行合并,然而这个却是在整个snakemake 文件的最开始
rule all:
 input:
  expand("count/{sample}.count",sample=config["samples"]) #进行遍历,告诉它所有sample的count文件
 output:
  "count/merge.count" #输出一个merge之后的文件
 run:
  import os
  import pandas as pd
  dm=pd.DataFrame() #建立一个空dataframe
  for filename in input: #读取各个文件的全路径
        df=pd.read_csv(filename,sep="\t",names=["gene",os.path.basename(filename)])
        if(any(dm)):#进行合并
                dm=pd.merge(dm,df,on="gene")
        else:
                dm=df
  dm.to_csv(output[0],sep='\t')
#将合并之后文件输出到output路径,要使用output[0],或者output.index
8.最终整个snakefile文件如下
cat snakefile1
configfile: "config.yaml"

rule all:
 input:
  expand("count/{sample}.count",sample=config["samples"])
 output:
  "count/merge.count"
 run:
  import os
  import pandas as pd
  dm=pd.DataFrame()
  for filename in input:
        df=pd.read_csv(filename,sep="\t",names=["gene",os.path.basename(filename)])
        if(any(dm)):
                dm=pd.merge(dm,df,on="gene")
        else:
                dm=df
  dm.to_csv(output[0],sep='\t')
#rule all:
# input:
#  "count/merge.count"
#  expand("count/{sample}.count",sample=config["samples"])

rule quality_check:
 input:
  lambda wildcards: config["samples"][wildcards.sample]
 output:
  "qc/"
 shell:
  "fastqc -o {output} {input}"

rule bwa:
 input:
  lambda wildcards: config["samples"][wildcards.sample]
 output:
  "bwa/{sample}.sam"
 params:
  INDEX= config["reference"]["bwaRef"]
 shell:
  "bowtie2 -x {params.INDEX} -q {input} -S {output}"

rule bam:
 input:
  "bwa/{sample}.sam"
 output:
  "bam/{sample}.bam"
 shell:
  "samtools view -Sb {input} > {output}"

rule sort:
 input:
  "bam/{sample}.bam"
 output:
  "sort/{sample}_sort.bam"
 shell:
  "samtools sort {input} -o {output}"

rule index:
 input:
  "sort/{sample}_sort.bam"
 output:
  "sort/{sample}.bam.bi"
 shell:
  "samtools index {input} {output}"

rule count:
 input:
  sample="sort/{sample}_sort.bam",
  annotation=config["annotation"]["gtf"]
 output:
  "count/{sample}.count"
 shell:
  "htseq-count -f bam -r name -s no {input.sample}  {input.annotation} > {output}"
生成的结果是这个样子:
bam/:
total 181M
-rw-r--r-- 1 root root 49M Jul  2 11:20 control_R1.bam
-rw-r--r-- 1 root root 49M Jul  2 11:21 control_R2.bam
-rw-r--r-- 1 root root 42M Jul  2 11:21 treated_R1.bam
-rw-r--r-- 1 root root 43M Jul  2 11:21 treated_R2.bam

bwa/:
total 640M
-rw-r--r-- 1 root root 173M Jul  2 11:16 control_R1.sam
-rw-r--r-- 1 root root 172M Jul  2 11:15 control_R2.sam
-rw-r--r-- 1 root root 148M Jul  2 11:18 treated_R1.sam
-rw-r--r-- 1 root root 148M Jul  2 11:17 treated_R2.sam

count/:
total 5.3M
-rw-r--r-- 1 root root 965K Jul  2 11:30 control_R1.count
-rw-r--r-- 1 root root 965K Jul  2 11:28 control_R2.count
-rw-r--r-- 1 root root 1.5M Jul  3 20:27 merge.count
-rw-r--r-- 1 root root 965K Jul  2 11:36 treated_R1.count
-rw-r--r-- 1 root root 965K Jul  2 11:33 treated_R2.count

sort/:
total 131M
-rw-r--r-- 1 root root 35M Jul  2 11:25 control_R1_sort.bam
-rw-r--r-- 1 root root 35M Jul  2 11:25 control_R2_sort.bam
-rw-r--r-- 1 root root 31M Jul  2 11:25 treated_R1_sort.bam
-rw-r--r-- 1 root root 31M Jul  2 11:25 treated_R2_sort.bam

嗯,还算比较满意,接下来会继续完善这个流程。

上一篇下一篇

猜你喜欢

热点阅读