How to build WDLs
Summary
In this tutorial, we will create a WDL script for a common bioinformatics pipeline.
Pre-requisites
This tutorial assumes that you have an understanding of the basic structure of a WDL script.
Some useful links:
Our workflow
The processing with BBMap contains two steps:
Alignment of sequence files to reference genome using BBMap, followed by
SAM to BAM format conversion using samtools.
The basic commands for the two steps are:
# align reads to reference contigs
bbmap.sh in=reads.fq ref=reference.fasta out=test.sam
# create a bam file from alignment
samtools view -b -F0x4 test.sam | samtools sort - > test.sorted.bam
Setup your Working Environment
Download the example data repository:
git clone https://code.jgi.doe.gov/official-jgi-workflows/wdl-specific-repositories/jaws-tutorial-examples.git
cd jaws-tutorial-examples/data
In this folder, you will find test data set:
Sample single-end FASTQ file
Reference fasta and index files
Converting Each Task to a WDL
If we have a workflow represented as a script (or a sequence of commands), we can parse it into WDL tasks.
Note
Each script you create should execute in and write output to the current working directory.
BBMap
This task will align the sample single-end FASTQ file to reference genome, using BBMap algorithm. Here is the task skeleton definition:
task alignment {
Inputs
command {...}
output {...}
runtime {...}
}
Now, we need to define the input variables, the alignment command line that will be executed, and the expected outputs files:
1task alignment {
2 input {
3 File fastq
4 File fasta
5 }
6
7 command <<<
8 bbmap.sh in=~{fastq} ref=~{fasta} out=test.sam
9 >>>
10
11 output {
12 File sam = "test.sam"
13 }
14}
We are passing the fastq file for our sample and the reference fasta as inputs to the task.
Note
Notice how to reference the variables in the command, using ~{variable_name}. Older WDL specification use ${variable_name}, however to avoid confusion with bash variables, it’s recommended to use ~{variable_name}.
Hint
The command section is enclosed in either curly braces { … } or triple angle braces <<< … >>>. Expression placeholders differ depending on the command section style:
Command Body Style |
Placeholder Style |
---|---|
command { … } |
~{} (preferred) or ${} |
command <<< >>> |
~{} only |
Next, we need to define the runtime attributes, i.e., the number of CPUs, memory, and time required for the task, as well as the Docker container used for the execution.
task alignment {
...
runtime {
docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
cpu: 1
memory: "5G"
runtime_minutes: 10
}
}
Our task is complete and should look like this:
task alignment {
input {
File fastq
File fasta
}
command <<<
bbmap.sh in=~{fastq} ref=~{fasta} out=test.sam
>>>
output {
File sam = "test.sam"
}
runtime {
docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
cpu: 1
memory: "5G"
runtime_minutes: 10
}
}
Samtools
This task will take the output from alignment step in SAM format, convert it to BAM, and sort it on coordinates using Samtools utility.
The task skeleton is the same used above. The complete Samtools task definition should look like this:
1task samtools {
2 input {
3 File sam
4 }
5
6 command <<<
7 set -eo pipefail
8 samtools view -b -F0x4 ~{sam} | samtools sort - > test.sorted.bam
9 >>>
10
11 output {
12 File bam = "test.sorted.bam"
13 }
14
15 runtime {
16 docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
17 cpu: 1
18 memory: "5G"
19 runtime_minutes: 10
20 }
21}
Hint: set -eo pipefail
This command can be useful when used at the begining of the command{} section in your WDL. This command will help capture errors at the point where they occur in your unix code, rather than having the commands run beyond where the error happened, since this makes debugging more difficult.
Workflow Definition
Let’s explore the workflow skeleton:
1version 1.0
2
3workflow bbtools {
4 input { }
5
6 call alignment { input: }
7
8 call samtools { input: }
9}
At the top level, we define a workflow named bbtools, within which we make calls to a set of tasks, here alignment and samtools.
The order in which the tasks are defined implies the order of execution if there is a dependency between the tasks. If no dependencies are determined, cromwell (the execution engine) will run the tasks in parallel.
Note
The very first line represents the version of WDL specification being used. In this example, we are using version 1.0 of the WDL spec. Note that JAWS is currently using 1.0 version.
Now, we need to define the input variables for the tasks, and most importantly, we need to tell cromwell how to link the tasks together:
version 1.0
workflow bbtools {
input {
File reads
File ref
}
call alignment {
input: fastq=reads,
fasta=ref
}
call samtools {
input: sam=alignment.sam
}
}
The WDL calls two functions or tasks. The second task, samtools uses the output from the previous task, alignment.
How to pass the output of one task as input to another?
In this example, each of the two tasks has an output section that defines the name of the output. The name of the output for the alignment task is “sam” (e.g. File sam = "test.sam"
). Now the second task samtools can access this output by refering to it as “alignment.sam” (<task><dot><output variable>). See the line input: sam=alignment.sam
.
Finally, combining all the top-level components, workflow and taks on the same file, we are expecting to have:
version 1.0
workflow bbtools {
input {
File reads
File ref
}
call alignment {
input: fastq=reads,
fasta=ref
}
call samtools {
input: sam=alignment.sam
}
}
task alignment {
input {
File fastq
File fasta
}
command <<<
bbmap.sh in=~{fastq} ref=~{fasta} out=test.sam
>>>
output {
File sam = "test.sam"
}
runtime {
docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
cpu: 1
memory: "5G"
runtime_minutes: 10
}
}
task samtools {
input {
File sam
}
command <<<
set -eo pipefail
samtools view -b -F0x4 ~{sam} | samtools sort - > test.sorted.bam
>>>
output {
File bam = "test.sorted.bam"
}
runtime {
docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
cpu: 1
memory: "5G"
runtime_minutes: 10
}
}
Now, you can save this file as align.wdl.
Note
Note that the tasks are defined outside of the workflow block while the call statements are placed inside of it.
Note
Note that each command, in the “command” level, is run in a docker container.
Validate
Validate using JAWS
Next, we will validate our script, make sure there are no syntax errors. We will use jaws validate command:
## Login to Dori
## Activate the environment
module load jaws
jaws validate align.wdl
> Workflow is OK
- Validate locally
jaws validate uses miniwdl.
Inputs
Create your input file
You can create an inputs file by scratch, following the skeleton:
{
"<workflow name>.<variable name>": "<value>"
}
For our example in this tutorial, you will have:
jaws inputs align.wdl
{
"bbtools.reads": "data/sample.fastq.bz2",
"bbtools.ref": "data/sample.fasta"
}
Create your input file using JAWS
As an alternative, you can build a skeleton template based on the WDL using the following command:
jaws inputs align.wdl
This command should output a template for the inputs.json file. You can then fill in the values of each key.
{
"bbtools.reads": "File",
"bbtools.ref": "File"
}
Execute Locally
Running with your own Cromwell version.
Make sure the bbtools and samtools are installed in your environment. Also, you can use conda environment, as demonstrated here.
# run with your installed version
cromwell run align.wdl -i inputs.json
## OR
java -jar /path/to/cromwell/cromwell.jar run align.wdl -i inputs.json
Outputs
The outputs of the workflow will be written to <workflow_root>/call-<call_name>/execution/ folder!
Each task of your workflow gets run inside the execution directory so it is here that you can find any output files including the stderr, stdout & script file.
Please explore the directory structure for relevant files!
Visualize your Workflow
Create the Directed Acyclic Graph (DAG) of the WDL file using WOMtool:
java -jar womtool-87.jar graph align.wdl > align.dot
dot -Tpng align.dot -o align.png # You need to install graphviz
Install dependencies:
wget https://github.com/broadinstitute/cromwell/releases/download/87/womtool-87.jar
brew install graphviz # mac
Sudo apt install graphviz #linux