2018-04-26 snakemake
Snakemake is a tool to manage workflow system. It is written in Python. It use python style code to define rules which describes how to create output files from input files.
Features
* Similar to GNU Make, you specify targets in terms of a pseudo-rule at the top.
* For each target and intermediate file, you create rules that define how they are created from input files.
* Snakemake determines the rule dependencies by matching file names.
* Input and output files can contain multiple named wildcards.
* Rules can either use shell commands, plain Python code or external Python or R scripts to create output files from input files.
* Snakemake workflows can be easily executed on **workstations**, **clusters**, **the grid**, and **in the cloud** without modification. The job scheduling can be constrained by arbitrary resources like e.g. available CPU cores, memory or GPUs.
* Snakemake can automatically deploy required software dependencies of a workflow using [Conda](https://conda.io/) or [Singularity](http://singularity.lbl.gov/).
* Snakemake can use Amazon S3, Google Storage, Dropbox, FTP, WebDAV, SFTP and iRODS to access input or output files and further access input files via HTTP and HTTPS.
Workflow
In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.
Components
input files ---
output files ---
rules --- describe how to create output files from input files
Rules
rule all
rule my_rule
Example
SAMPLES = ['Sample1', 'Sample2']
rule all:
input:
expand('{sample}.txt', sample=SAMPLES)
rule quantify_genes:
input:
genome = 'genome.fa',
r1 = 'fastq/{sample}.R1.fastq.gz',
r2 = 'fastq/{sample}.R2.fastq.gz'
output:
'{sample}.txt'
shell:
'echo {input.genome} {input.r1} {input.r2} > {output}'