snakemake搭建生信分析流程

2019-05-12 本文已影响60人 dming1024

参考文章：
https://blog.csdn.net/u012110870/article/details/85330457
https://www.jianshu.com/p/14b9eccc0c0e

为了能够充分利用自己的云端服务器资源，我就选择尝试搭建自己的生信分析流程，首先是熟悉一下利用snakemake的流程。

wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1
conda env create --name snakemake-tutorial --file environment.yaml
source activate snakemake-tutorial
# 退出当前环境
source deactivate

可以看下示例数据

ls -lh
total 652K
-rw-rw-r-- 1 root root 229K Mar  8  2017 genome.fa
-rw-rw-r-- 1 root root 2.6K Mar  8  2017 genome.fa.amb
-rw-rw-r-- 1 root root   83 Mar  8  2017 genome.fa.ann
-rw-rw-r-- 1 root root 225K Mar  8  2017 genome.fa.bwt
-rw-rw-r-- 1 root root   18 Mar  8  2017 genome.fa.fai
-rw-rw-r-- 1 root root  57K Mar  8  2017 genome.fa.pac
-rw-rw-r-- 1 root root 113K Mar  8  2017 genome.fa.sa
drwxrwxr-x 2 root root 4.0K Mar  8  2017 samples

bwa比对

vim Snakefile
# 编辑如下内容
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        """
        bwa mem {input} | samtools view -Sb - > {output}
        """

尝试运行下

 snakemake -np mapped_reads/{A,B,C}.bam

rule bwa_map:
    input: data/genome.fa, data/samples/C.fastq
    output: mapped_reads/C.bam
    log: logs/bwa_mem/C.log
    jobid: 0
    wildcards: sample=C


        (bwa mem -R '@RG    ID:C    SM:C' data/genome.fa data/samples/C.fastq|samtools view -Sb - > mapped_reads/C.bam) 2> logs/bwa_mem/C.log
        

rule bwa_map:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped_reads/A.bam
    log: logs/bwa_mem/A.log
    jobid: 1
    wildcards: sample=A


        (bwa mem -R '@RG    ID:A    SM:A' data/genome.fa data/samples/A.fastq|samtools view -Sb - > mapped_reads/A.bam) 2> logs/bwa_mem/A.log
        

rule bwa_map:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped_reads/B.bam
    log: logs/bwa_mem/B.log
    jobid: 2
    wildcards: sample=B


        (bwa mem -R '@RG    ID:B    SM:B' data/genome.fa data/samples/B.fastq|samtools view -Sb - > mapped_reads/B.bam) 2> logs/bwa_mem/B.log
        
Job counts:
    count   jobs
    3   bwa_map
    3

没问题！再进行下个流程的编写

2.比对结果排序

vim Snakefile
rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample}"#不知道能否用通配符呢？
        " -O bam {input} > {output}"

以之前比对的输入文件作为此次运行的输出文件，sort之后输出到另一个文件夹中。这里的wildcards.sample来获取通配名。

建立索引

vim Snakefile
rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

可以将流程进行可视化，为dag.svg文件

snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

4.基因组变异识别

vim Snakefile
rule bcftools_call:
    input:
        fa="data/genome.fa",
        bamA="sorted_reads/A.bam"
        bamB="sorted_reads/B.bam"
        baiA="sorted_reads/A.bam.bai"
        baiB="sorted_reads/B.bam.bai"
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bamA} {input.bamB} | "
        "bcftools call -mv - > {output}"

这样书写样本路径，有些麻烦，可以进一步将input进行简化：

SAMPLES=["A","B"]
rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

5.(optional)用python编写报告

vim Snakefile
rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

6.(optional)增加目标规则（不是很懂，先贴上）

rule all:
    input:
        "report.html

最后优化的分析流程如下：

configfile: "config.yaml"


rule all:
    input:
        "report.html"


rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam") #比对过程中的temp文件，运行完成之后会自动删除
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"#bwa的比对参数
    log:
        "logs/bwa_mem/{sample}.log"
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"


rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"


rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"


rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"


rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

config.yaml是一个样本路径文件

cat config.yaml
samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

运行snakmake -s Snakefile就ok了

snakmake 
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   all
    1   bcftools_call
    3   bwa_map
    1   report
    3   samtools_index
    3   samtools_sort
    12

rule bwa_map:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped_reads/B.bam
    log: logs/bwa_mem/B.log
    jobid: 11
    wildcards: sample=B

Finished job 11.
1 of 12 steps (8%) done

rule samtools_sort:
    input: mapped_reads/B.bam
    output: sorted_reads/B.bam
    jobid: 7
    wildcards: sample=B
....#代码太多了，就不全部粘贴了
Finished job 1.
11 of 12 steps (92%) done

localrule all:
    input: report.html
    jobid: 0

Finished job 0.
12 of 12 steps (100%) done

运行完成之后，便会有calls文件，文件夹下有.vcf文件，即使snp分析结果
可以简单查看运行的流程图：

snakemake --dag -s snakefile1 sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

整个的分析流程图

 snakemake --dag| dot -Tsvg > dag.svg

知识点：
1.wildcards。用来获取通配符匹配到的部分，例如对于通配符"{dataset}/file.{group}.txt"匹配到文件101/file.A.txt，则{wildcards.dataset}就是101，{wildcards.group}就是A。
2.expand。 expand("sorted_reads/{sample}.bam", sample=SAMPLES),将SAMPLES中的值依次录入到{}中去。

snakemake搭建生信分析流程

猜你喜欢

热点阅读