snakemake常见功能记录
2021-08-29 本文已影响0人
无话_
通配符的使用
rule complex_conversion:
input:
"{dataset}/inputfile"
output:
"{dataset}/file.{group}.txt"
shell:
"somecommand --group {wildcards.group} < {input} > {output}"
#shell中使用 wildcards.xxx
进阶操作
output: "{dataset,\d+}.{group}.txt"
#正则表达式限制
wildcard_constraints:
dataset="\d+"
rule a:
...
rule b:
...
#全局限制
Expend(自定义数组)
rule aggregate:
input:
expand("{dataset}/a.{ext}", dataset=DATASETS, ext=FORMATS)
output:
"aggregated.txt"
shell:
...
进阶操作
expand("{{dataset}}/a.{ext}", ext=FORMATS)
#保留{dataset}通配符功能
#简化版expend---multiext
rule plot:
input:
...
output:
multiext("some/plot", ".pdf", ".svg", ".png")
shell:
...
#同expand("some/plot.{ext}", ext=[".pdf", ".svg", ".png"])
Threads与Resources
# attempt和--restart-times
# 通过设置这两个参数,在处理大内存项目时,可以实现自动增加内存多次尝试投递
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources
Messages
rule NAME:
input: "path/to/inputfile", "path/to/other/inputfile"
output: "path/to/outputfile", "path/to/another/outputfile"
threads: 8
message: "Executing somecommand with {threads} threads on the following files {input}."
shell: "somecommand --threads {threads} {input} {output}"
Priorities(优先级)
# 数字越大优先级越高,感觉也没啥大用
rule:
input: ...
output: ...
priority: 50
shell: ...
Log-Files
rule abc:
input: "input.txt"
output: "output.txt"
log: "logs/abc.log"
shell: "somecommand --log {log} {input} {output}"
# 会创建log文件,需要该命令本身的支持
#无-log参数,可尝试将标准输出重定向至log文件(未实际尝试)
#使用通配符写入多个
log: "logs/abc.{dataset}.log"
parameters
rule:
input:
...
params:
prefix="somedir/{sample}"
output:
"somedir/{sample}.csv"
shell:
"somecommand -o {params.prefix}"
#某些脚本的使用时,并不直接使用其文件,而完整文件名的一部分,甚至是目录
#以及某些参数的指定
rule:
input:
...
params:
prefix=lambda wildcards, output: output[0][:-4]
output:
"somedir/{sample}.csv"
shell:
"somecommand -o {params.prefix}"
#func作为输入/参数
#wildcards has to be the first argument
python
#调用外部python脚本,读取snakemake参数,此处应该需要将snakemakefile与py置于同一目录(未实测)
#此处不是指直接用 run 运行Python脚本
#外部python脚本示例
def do_something(data_path, out_path, threads, myparam):
do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])
R
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#r-and-r-markdown
temp/protected
#该文件使用后将删除
rule NAME:
input:
"path/to/inputfile"
output:
temp("path/to/outputfile")
shell:
"somecommand {input} {output}"
#该文件生成后保护
rule NAME:
input:
"path/to/inputfile"
output:
protected("path/to/outputfile")
shell:
"somecommand {input} {output}"
directory
#目录作为输出,能不能就不用
rule NAME:
input:
"path/to/inputfile"
output:
directory("path/to/outputdir")
shell:
"somecommand {input} {output}"
flag file
rule all:
input: "mytask.done"
rule mytask:
output: touch("mytask.done")
shell: "mycommand ..."
Job Properties
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#job-properties
Functions as Input Files
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#functions-as-input-files
config
使用yaml格式
https://www.runoob.com/w3cnote/yaml-intro.html
#使用snamemake --config 可覆盖config文件中含有的参数
#不能使用sys读取输入,sys读取时,会按空格切分,会将snakemake本身使用的参数如(-s,-np,--config)读取到脚本中
其它
#产生新文件的同时,直接修改了原文件
#在snakemake断点运行时会出现问题
# 不要陷入使用rule all 定义变量后二次运行的误区
# http://www.xknote.com/ask/60f336aaf2eb8.html
#读取文件大小以设定参数
https://stackoverflow.com/questions/50891407/snakemake-how-to-dynamically-set-memory-resource-based-on-input-file-size