snakemake搭建生信分析流程
2019-05-12 本文已影响60人
dming1024
参考文章:
https://blog.csdn.net/u012110870/article/details/85330457
https://www.jianshu.com/p/14b9eccc0c0e
为了能够充分利用自己的云端服务器资源,我就选择尝试搭建自己的生信分析流程,首先是熟悉一下利用snakemake的流程。
wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1
conda env create --name snakemake-tutorial --file environment.yaml
source activate snakemake-tutorial
# 退出当前环境
source deactivate
可以看下示例数据
ls -lh
total 652K
-rw-rw-r-- 1 root root 229K Mar 8 2017 genome.fa
-rw-rw-r-- 1 root root 2.6K Mar 8 2017 genome.fa.amb
-rw-rw-r-- 1 root root 83 Mar 8 2017 genome.fa.ann
-rw-rw-r-- 1 root root 225K Mar 8 2017 genome.fa.bwt
-rw-rw-r-- 1 root root 18 Mar 8 2017 genome.fa.fai
-rw-rw-r-- 1 root root 57K Mar 8 2017 genome.fa.pac
-rw-rw-r-- 1 root root 113K Mar 8 2017 genome.fa.sa
drwxrwxr-x 2 root root 4.0K Mar 8 2017 samples
- bwa比对
vim Snakefile
# 编辑如下内容
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"""
bwa mem {input} | samtools view -Sb - > {output}
"""
尝试运行下
snakemake -np mapped_reads/{A,B,C}.bam
rule bwa_map:
input: data/genome.fa, data/samples/C.fastq
output: mapped_reads/C.bam
log: logs/bwa_mem/C.log
jobid: 0
wildcards: sample=C
(bwa mem -R '@RG ID:C SM:C' data/genome.fa data/samples/C.fastq|samtools view -Sb - > mapped_reads/C.bam) 2> logs/bwa_mem/C.log
rule bwa_map:
input: data/genome.fa, data/samples/A.fastq
output: mapped_reads/A.bam
log: logs/bwa_mem/A.log
jobid: 1
wildcards: sample=A
(bwa mem -R '@RG ID:A SM:A' data/genome.fa data/samples/A.fastq|samtools view -Sb - > mapped_reads/A.bam) 2> logs/bwa_mem/A.log
rule bwa_map:
input: data/genome.fa, data/samples/B.fastq
output: mapped_reads/B.bam
log: logs/bwa_mem/B.log
jobid: 2
wildcards: sample=B
(bwa mem -R '@RG ID:B SM:B' data/genome.fa data/samples/B.fastq|samtools view -Sb - > mapped_reads/B.bam) 2> logs/bwa_mem/B.log
Job counts:
count jobs
3 bwa_map
3
没问题!再进行下个流程的编写
2.比对结果排序
vim Snakefile
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample}"#不知道能否用通配符呢?
" -O bam {input} > {output}"
以之前比对的输入文件作为此次运行的输出文件,sort之后输出到另一个文件夹中。这里的wildcards.sample来获取通配名。
- 建立索引
vim Snakefile
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
可以将流程进行可视化,为dag.svg文件
snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
4.基因组变异识别
vim Snakefile
rule bcftools_call:
input:
fa="data/genome.fa",
bamA="sorted_reads/A.bam"
bamB="sorted_reads/B.bam"
baiA="sorted_reads/A.bam.bai"
baiB="sorted_reads/B.bam.bai"
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bamA} {input.bamB} | "
"bcftools call -mv - > {output}"
这样书写样本路径,有些麻烦,可以进一步将input进行简化:
SAMPLES=["A","B"]
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
5.(optional)用python编写报告
vim Snakefile
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
6.(optional)增加目标规则(不是很懂,先贴上)
rule all:
input:
"report.html
最后优化的分析流程如下:
configfile: "config.yaml"
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam") #比对过程中的temp文件,运行完成之后会自动删除
params:
rg="@RG\tID:{sample}\tSM:{sample}"#bwa的比对参数
log:
"logs/bwa_mem/{sample}.log"
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
config.yaml是一个样本路径文件
cat config.yaml
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
运行snakmake -s Snakefile就ok了
snakmake
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 bcftools_call
3 bwa_map
1 report
3 samtools_index
3 samtools_sort
12
rule bwa_map:
input: data/genome.fa, data/samples/B.fastq
output: mapped_reads/B.bam
log: logs/bwa_mem/B.log
jobid: 11
wildcards: sample=B
Finished job 11.
1 of 12 steps (8%) done
rule samtools_sort:
input: mapped_reads/B.bam
output: sorted_reads/B.bam
jobid: 7
wildcards: sample=B
....#代码太多了,就不全部粘贴了
Finished job 1.
11 of 12 steps (92%) done
localrule all:
input: report.html
jobid: 0
Finished job 0.
12 of 12 steps (100%) done
运行完成之后,便会有calls文件,文件夹下有.vcf文件,即使snp分析结果
可以简单查看运行的流程图:
snakemake --dag -s snakefile1 sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
整个的分析流程图
snakemake --dag| dot -Tsvg > dag.svg
知识点:
1.wildcards。用来获取通配符匹配到的部分,例如对于通配符"{dataset}/file.{group}.txt"匹配到文件101/file.A.txt,则{wildcards.dataset}就是101,{wildcards.group}就是A。
2.expand。 expand("sorted_reads/{sample}.bam", sample=SAMPLES),将SAMPLES中的值依次录入到{}中去。