How to build WDLs

Summary

In this tutorial, we will create a WDL script for a common bioinformatics pipeline.

Pre-requisites

This tutorial assumes that you have an understanding of the basic structure of a WDL script.

Some useful links:

Start with the official WDL site

Real world examples

Re-usable subworkflow tasks: WLD-tasks

Our workflow

The processing with BBMap contains two steps:

Alignment of sequence files to reference genome using BBMap, followed by

SAM to BAM format conversion using samtools.

The basic commands for the two steps are:

# align reads to reference contigs
bbmap.sh in=reads.fq ref=reference.fasta out=test.sam

# create a bam file from alignment
samtools view -b -F0x4 test.sam | samtools sort - > test.sorted.bam

Setup your Working Environment

Download the example data repository:

git clone https://code.jgi.doe.gov/official-jgi-workflows/wdl-specific-repositories/jaws-tutorial-examples.git
cd jaws-tutorial-examples/data

In this folder, you will find test data set:

Sample single-end FASTQ file

Reference fasta and index files

Converting Each Task to a WDL

If we have a workflow represented as a script (or a sequence of commands), we can parse it into WDL tasks.

Note

Each script you create should execute in and write output to the current working directory.

BBMap

This task will align the sample single-end FASTQ file to reference genome, using BBMap algorithm. Here is the task skeleton definition:

task alignment {
     Inputs
     command {...}
     output {...}
 runtime {...}
}

Now, we need to define the input variables, the alignment command line that will be executed, and the expected outputs files:

task alignment {
  input {
    File fastq
    File fasta
  }

  command <<<
    bbmap.sh in=~{fastq} ref=~{fasta} out=test.sam
  >>>

  output {
    File sam = "test.sam"
  }
}

We are passing the fastq file for our sample and the reference fasta as inputs to the task.

Note

Notice how to reference the variables in the command, using ~{variable_name}. Older WDL specification use ${variable_name}, however to avoid confusion with bash variables, it’s recommended to use ~{variable_name}.

Hint

The command section is enclosed in either curly braces { … } or triple angle braces <<< … >>>. Expression placeholders differ depending on the command section style:

Command Body Style	Placeholder Style
command { … }	~{} (preferred) or ${}
command <<< >>>	~{} only

Next, we need to define the runtime attributes, i.e., the number of CPUs, memory, and time required for the task, as well as the Docker container used for the execution.

task alignment {
  ...

  runtime {
    docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
    cpu: 1
    memory: "5G"
    runtime_minutes: 10
  }
}

Our task is complete and should look like this:

task alignment {
  input {
    File fastq
    File fasta
  }

  command <<<
    bbmap.sh in=~{fastq} ref=~{fasta} out=test.sam
  >>>

  output {
    File sam = "test.sam"
  }

  runtime {
    docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
    cpu: 1
    memory: "5G"
    runtime_minutes: 10
  }
}

Samtools

This task will take the output from alignment step in SAM format, convert it to BAM, and sort it on coordinates using Samtools utility.

The task skeleton is the same used above. The complete Samtools task definition should look like this:

task samtools {
  input {
    File sam
  }

  command <<<
    set -eo pipefail
    samtools view -b -F0x4 ~{sam} | samtools sort - > test.sorted.bam
  >>>

  output {
    File bam = "test.sorted.bam"
  }

  runtime {
    docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
    cpu: 1
    memory: "5G"
    runtime_minutes: 10
  }
}

Workflow Definition

Let’s explore the workflow skeleton:

version 1.0

workflow bbtools {
   input { }

   call alignment { input: }

   call samtools { input: }
}

At the top level, we define a workflow named bbtools, within which we make calls to a set of tasks, here alignment and samtools.

The order in which the tasks are defined implies the order of execution if there is a dependency between the tasks. If no dependencies are determined, cromwell (the execution engine) will run the tasks in parallel.

Note

The very first line represents the version of WDL specification being used. In this example, we are using version 1.0 of the WDL spec. Note that JAWS is currently using 1.0 version.

Now, we need to define the input variables for the tasks, and most importantly, we need to tell cromwell how to link the tasks together:

 version 1.0

workflow bbtools {
   input {
     File reads
     File ref
   }

   call alignment {
     input: fastq=reads,
            fasta=ref
   }

   call samtools {
     input: sam=alignment.sam
   }
 }

The WDL calls two functions or tasks. The second task, samtools uses the output from the previous task, alignment.

How to pass the output of one task as input to another?

In this example, each of the two tasks has an output section that defines the name of the output. The name of the output for the alignment task is “sam” (e.g. File sam = "test.sam"). Now the second task samtools can access this output by refering to it as “alignment.sam” (<task><dot><output variable>). See the line input: sam=alignment.sam.

Finally, combining all the top-level components, workflow and taks on the same file, we are expecting to have:

version 1.0

workflow bbtools {
  input {
    File reads
    File ref
  }

   call alignment {
      input: fastq=reads,
             fasta=ref
   }

   call samtools {
      input: sam=alignment.sam
   }
}

task alignment {
  input {
    File fastq
    File fasta
  }

  command <<<
    bbmap.sh in=~{fastq} ref=~{fasta} out=test.sam
  >>>

  output {
    File sam = "test.sam"
  }

  runtime {
    docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
    cpu: 1
    memory: "5G"
    runtime_minutes: 10
  }
}

task samtools {
  input {
   File sam
  }

   command <<<
      set -eo pipefail
      samtools view -b -F0x4 ~{sam} | samtools sort - > test.sorted.bam
   >>>

   output {
      File bam = "test.sorted.bam"
   }

   runtime {
      docker: "jfroula/aligner-bbmap@sha256:8a849019294cea0636d474d07f18e5f84e2b2b58cf50b104c04348db91cdabb4"
      cpu: 1
      memory: "5G"
      runtime_minutes: 10
   }
}

Now, you can save this file as align.wdl.

Note

Note that the tasks are defined outside of the workflow block while the call statements are placed inside of it.

Note

Note that each command, in the “command” level, is run in a docker container.

Validate

Validate using JAWS

Next, we will validate our script, make sure there are no syntax errors. We will use jaws validate command:

  ## Login to Dori
  ## Activate the environment
  module load jaws

  jaws validate align.wdl
  > Workflow is OK

- Validate locally

jaws validate uses miniwdl.

Inputs

Create your input file

You can create an inputs file by scratch, following the skeleton:

{
   "<workflow name>.<variable name>": "<value>"
}

For our example in this tutorial, you will have:

jaws inputs align.wdl
{
   "bbtools.reads": "data/sample.fastq.bz2",
   "bbtools.ref": "data/sample.fasta"
}

Create your input file using JAWS

As an alternative, you can build a skeleton template based on the WDL using the following command:

jaws inputs align.wdl

This command should output a template for the inputs.json file. You can then fill in the values of each key.

{
   "bbtools.reads": "File",
   "bbtools.ref": "File"
}

Execute Locally

Running with your own Cromwell version.

Make sure the bbtools and samtools are installed in your environment. Also, you can use conda environment, as demonstrated here.

# run with your installed version
cromwell run align.wdl -i inputs.json
## OR
java -jar /path/to/cromwell/cromwell.jar run align.wdl -i inputs.json

Outputs

The outputs of the workflow will be written to <workflow_root>/call-<call_name>/execution/ folder!

Each task of your workflow gets run inside the execution directory so it is here that you can find any output files including the stderr, stdout & script file.

Please explore the directory structure for relevant files!