利用WDL语言书写数据处理流程

2021-05-06 本文已影响0人 Peter_iris

The Workflow Description Language (WDL) is a way to specify data processing workflows with a human-readable and -writeable syntax. WDL makes it straightforward to define analysis tasks, chain them together in workflows, and parallelize their execution. The language makes common patterns simple to express, while also admitting uncommon or complicated behavior; and strives to achieve portability not only across execution platforms, but also different types of users. Whether one is an analyst, a programmer, an operator of a production system, or any other sort of user, WDL should be accessible and understandable.
WDL的官方说明文档: Getting Started With WDL

目前在GATK4的流程中，基本都是以WDL的形式编写的。WDL没有太多复杂的逻辑和语法，入门简单。首先看一个hello world的例子

workflow myWorkflow {
   call myTask
}
task myTask {
   command {
       echo "hello world"
   }
   output {
       String out = read_string(stdout())
   }
}

对于一个WDL脚本而言，有以下5个核心结构
1.workflow: 工作流定义

task : 工作流包含的任务定义
call: 调用或触发工作流里面的 task 执行
command: task在计算节点上要执行的命令行
output: task 或 workflow 的输出定义
runtime: task在计算节点上的运行时参数，包括 CPU、内存、docker 镜像等
image.png

每个脚本包含1个workflow， workflow由多个task构成。在workflow中，通过call调用对应的task。每个task在workflow代码块之外单独定义。

task代表任务，读取输入文件，执行相应命令，然后输出。command中对应的就是执行的命令，比如一条具体的gatk的命令，output指定task的输出值。可以将task理解为编程语言中的函数，每个函数读取输入的参数，执行代码，然后返回，command对应执行的具体代码，output对应返回值。

在WDL中，也是可以传递参数的。task和workflow中的写法不同

变量利用

我们要把一个处理步骤构造成一个task, 就要封装计算软件的命令行，那么命令行的参数如何传入呢？输出文件的名字如何指定呢？这些问题在 WDL 中可以通过变量来解决。比如 Hello world 例子中的 String out 就是一个字符串类型的输入，用于指定输出文件的名字。WDL 中的变量可以定义在 workflow 中，也可以定义在 task中。在command 和 output 中可以通过$和{}的方式来引用变量。

变量的类型主要有以下几种：

String
Int
Float
File
Boolean
Array[T]
Map[K, V]
Pair[X, Y]
Object

workflow 中的参数

下面的示意图中， workflow 有3个参数，文件类型的my_ref，my_input 和字符串类型的name。传递这3个参数给task时，直接传变量名就可以了。

image.png

task 中的参数

下面的示意图中，task 有3个输入的参数，文件类型的ref，in 和字符串类型的id。在command中，通过${ref}这种格式访问变量的值。

image.png

Task 如何组成Workflow呢

作为流程管理语言，需要对多个task统一管理。task之间具有多种关系

1. 线性输出关系：

image.png

第一种是最常见的场景，简单的线性串联，多个 task 依次执行，前面步骤的输出作为后面步骤的输入，最后一个 task 的输出作为整个 workflow 的输出。
示例如下：

workflow LinearChain {
 File firstInput
 call stepA { input: in=firstInput }
 call stepB { input: in=stepA.out }
 call stepC { input: in=stepB.out }
}
task stepA {
 File in
 command { programA I=${in} O=outputA.ext }
 output { File out = "outputA.ext" }
}
task stepB {
 File in
 command { programB I=${in} O=outputB.ext }
 output { File out = "outputB.ext" }
}
task stepC {
 File in
 command { programC I=${in} O=outputC.ext }
 output { File out = "outputC.ext" }
}

2. 多对多的依赖关系

一个task的输出作为多个task的输入，或者多个task的输出作为1个task的输入
case1：

image.png

workflow MultiOutMultiIn {
 File firstInput
 call stepA { input: in=firstInput }
 call stepB { input: in=stepA.out }
 call stepC { input: in1=stepB.out1, in2=stepB.out2 }
}
task stepA {
 File in
 command { programA I=${in} O=outputA.ext }
 output { File out = "outputA.ext" }
}
task stepB {
 File in
 command { programB I=${in} O1=outputB1.ext O2=outputB2.ext }
 output {
   File out1 = "outputB1.ext"
   File out2 = "outputB2.ext" }
}
task stepC {
 File in1
 File in2
 command { programB I1=${in1} I2=${in2} O=outputC.ext }
 output { File out = "outputC.ext" }
}

case2：

image.png

示例如下：

workflow BranchAndMerge {
 File firstInput
 call stepA { input: in=firstInput }
 call stepB { input: in=stepA.out }
 call stepC { input: in=stepA.out }
 call stepD { input: in1=stepC.out, in2=stepB.out }
}
task stepA {
 File in
 command { programA I=${in} O=outputA.ext }
 output { File out = "outputA.ext" }
}
task stepB {
 File in
 command { programB I=${in} O=outputB.ext }
 output { File out = "outputB.ext" }
}
task stepC {
 File in
 command { programC I=${in} O=outputC.ext }
 output { File out = "outputC.ext" }
}
task stepD {
 File in1
 File in2
 command { programD I1=${in1} I2=${in2} O=outputD.ext }
 output { File out = "outputD.ext" }
}

3. 平行执行关系（并行计算）

image.png

workflow ScatterGather {
 Array[File] inputFiles
 scatter (oneFile in inputFiles) {
   call stepA { input: in=oneFile }
 }
 call stepB { input: files=stepA.out }
}
task stepA {
 File in
 command { programA I=${in} O=outputA.ext }
 output { File out = "outputA.ext" }
}
task stepB {
 Array[File] files
 command { programB I=${files} O=outputB.ext }
 output { File out = "outputB.ext" }
}

task和函数还是有一定的区别，函数可以在代码中多次调用，但是task多次调用会有风险。下面的示意图中，stepA 运行两次，一次作为stepB的输入，一次作为stepC的输入。如果stepA的两次调用并行执行，当执行完之后，在传递给下一个task时，由于存在两个同名的stepA, stepB和stepC 就会无法正确接受参数。

image.png
WDL中提供了解决方案，叫做task alias, 为task起一个别名，示例如下

workflow taskAlias {
 File firstInput
 File secondInput
 call stepA as firstSample { input: in=firstInput }
 call stepA as secondSample { input: in=secondInput }
 call stepB { input: in=firstSample.out }
 call stepC { input: in=secondSample.out }
}
task stepA {
 File in
 command { programA I=${in} O=outputA.ext }
 output { File out = "outputA.ext" }
}
task stepB {
 File in
 command { programB I=${in} O=outputB.ext }
 output { File out = "outputB.ext" }
}
task stepC {
 File in
 command { programC I=${in} O=outputC.ext }
 output { File out = "outputC.ext" }
}

在WDL脚本中, 理论上每个task 只可以调用1次，如果希望多次调用，必须借助task alias。