2018-04-26 snakemake

2018-04-26 本文已影响0人 aldlhy

Snakemake is a tool to manage workflow system. It is written in Python. It use python style code to define rules which describes how to create output files from input files.

Features

* Similar to GNU Make, you specify targets in terms of a pseudo-rule at the top.

* For each target and intermediate file, you create rules that define how they are created from input files.

* Snakemake determines the rule dependencies by matching file names.

* Input and output files can contain multiple named wildcards.

* Rules can either use shell commands, plain Python code or external Python or R scripts to create output files from input files.

* Snakemake workflows can be easily executed on **workstations**, **clusters**, **the grid**, and **in the cloud** without modification. The job scheduling can be constrained by arbitrary resources like e.g. available CPU cores, memory or GPUs.

* Snakemake can automatically deploy required software dependencies of a workflow using [Conda](https://conda.io/) or [Singularity](http://singularity.lbl.gov/).

* Snakemake can use Amazon S3, Google Storage, Dropbox, FTP, WebDAV, SFTP and iRODS to access input or output files and further access input files via HTTP and HTTPS.

Workflow

In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.

Components

input files ---

output files ---

rules --- describe how to create output files from input files

Rules

rule all

rule my_rule

Example

SAMPLES = ['Sample1', 'Sample2']

rule all:

input:

expand('{sample}.txt', sample=SAMPLES)

rule quantify_genes:

input:

genome = 'genome.fa',

r1 = 'fastq/{sample}.R1.fastq.gz',

r2 = 'fastq/{sample}.R2.fastq.gz'

output:

'{sample}.txt'

shell:

'echo {input.genome} {input.r1} {input.r2} > {output}'

Reference

Snakemake — Snakemake 4.8.1+0.g7f3006d.dirty documentation

Snakemake—a scalable bioinformatics workflow engine