Creating Docker Containers

Summary

This tutorial describes one way Docker images can be created and used in your WDL. If you are unfamiliar with Docker, please see Docker tutorial or search for the many YouTube tutorials.

Prerequisites

This tutorial page relies on completing the previous tutorial, Lesson 1: Development Environment.

Note

As a pre-requisite, you will need a computer with Docker installed (Docker Engine - Community). Installation instructions can be found at docs.docker.com/install or if you have conda installed conda install -c conda-forge docker-py.

Here are the steps we’re going to take for this tutorial:
  1. make a Docker image from the same commands you used for the conda environment (Lesson 1: Development Environment.);

  2. run a WDL that is using your Docker container.

Clone the Example Repository

For this tutorial, I will be using the example code from jaws-tutorial-examples. To follow along, do:

git clone https://code.jgi.doe.gov/official-jgi-workflows/wdl-specific-repositories/jaws-tutorial-examples.git
cd jaws-tutorial-examples/5min_example

Create docker image

Next we’ll describe how to create a Dockerfile and register it with hub.docker.com. But first create an account and click on “Create a Repository”. In the space provided, enter a name for your container, that doesn’t have to exist yet, like aligner-bbmap. You will push a docker image to this name after you create it in the next steps.

To make the Dockerfile, you can use the same commands you used for the conda environment. Notice that it is good practice to specify the versions when installing software like I have done in the example Dockerfile. Of course, you can drop the versions altogether to get the latest version but the Dockerfile may not work out-of-the-box in the future due to version conflicts.

Note

It is helpful, when creating the Dockerfile to test each command (i.e. apt-get, wget, conda install, etc) manually, inside an empty docker container. Once everything is working, you can copy the commands to a Dockerfile.

This docker command will create an interactive container with an ubuntu base image. You can start installing stuff as root.

docker run -it ubuntu:latest /bin/bash

Here is an example Dockerfile (provided in 5min_example). We will create a container from it.

FROM ubuntu:22.04

# Install stuff with apt-get
RUN apt-get update && apt-get install -y wget bzip2 \
    && rm -rf /var/lib/apt/lists/*

# Point to all the future conda installations you are going to do
ENV CONDAPATH=/usr/local/bin/miniconda3
ENV PATH=$CONDAPATH/bin:$PATH

# Install miniconda
# There is a good reason to install miniconda in a path other than its default.
# The default intallation directory is /root/miniconda3 but this path will not be
# accessible by shifter or singularity so we'll install under /usr/local/bin/miniconda3.
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh \
    && bash ./Miniconda3*.sh -b -p $CONDAPATH \
    && rm Miniconda3*.sh

# Install software with conda
RUN conda install -c bioconda bbmap==38.84 samtools==1.11 \
    && conda clean -afy

# This will give us a workingdir within the container (e.g. a place we can mount data to)
WORKDIR /bbmap

# Move script into container.
# Notes that it is copied to a location in your $PATH
COPY script.sh /usr/local/bin/script.sh
Build the image and upload to hub.docker.com
You need to use your docker hub user name to tag the image when you are building it (see below).
# create a "Build" directory and create docker container from there so its a small image. Its good practice to always create an image in
# a directory containing only the required files, otherwise the container will also include them and could be very large.
mkdir build
cp script.sh Dockerfile build/
cd build
docker build --tag <your_docker_hub_user_name>/aligner-bbmap:1.0.0 .
cd ../

Test that the example script runs in the docker container

# use your image name
docker run <your_docker_hub_user_name>/aligner-bbmap:1.0.0 script.sh

# if you are in the root of the 5min_example directory, then try re-running the script with data.
docker run --volume="$(pwd)/../data:/bbmap" <your_docker_hub_user_name>/aligner-bbmap:1.0.0 script.sh sample.fastq.bz2 sample.fasta

# Notice script.sh is found because it was set in PATH in the Dockerfile and
# the two inputs are found because the data directory is mounted to /bbmap (inside container) where the script runs.

When you are convinced the docker image is good, you can register it with hub.docker.com (remember to make an account first). When you run a WDL in JAWS, the docker images will be pulled from hub.docker.com.

docker login
docker push <your_docker_hub_user_name>/aligner-bbmap:1.0.0

Now your image is available on any site i.e. dori, jgi, tahoma, perlmutter, aws, etc. Although you can manually pull your image using:

JAWS will do this for you (you will need to manually pull the images if you are testing Cromwell locally).

Test your image on Perlmutter

Besides your docker-machine, it is useful to test your image on Perlmutter since you will likely be running your WDL there at some point. There are certain aspects of the docker container that will work on your docker-machine but won’t on another site, like dori. This is because shifter or singularity behave differently than docker.

To test the docker container on perlmutter-p1.nersc.gov. You’ll need to use the shifter command instead of docker to run your workflow, but the image is the same. More about shifter at NERSC.

Example:

# pull image from hub.docker.com
shifterimg pull <your_docker_hub_user_name>/aligner-bbmap:1.0.0

# clone the repo on Perlmutter
git clone https://code.jgi.doe.gov/official-jgi-workflows/wdl-specific-repositories/jaws-tutorial-examples.git
cd jaws-tutorial-examples/5min_example

# run your wrapper script. notice we are running the script.sh that was saved inside the image
shifter --image=<your_docker_hub_user_name>/aligner-bbmap:1.0.0 ./script.sh ../data/sample.fastq.bz2 ../data/sample.fasta

The WDL

The script.sh that is supplied with the repo has two essential commands:

# align reads to reference contigs
bbmap.sh Xmx12g in=$READS ref=$REF out=test.sam

# create a bam file from alignment
samtools view -b -F0x4 test.sam | samtools sort - > test.sorted.bam

It would make sense to have both commands inside one task of the WDL because they logically should be run together. However, for an excersise, we will have the two commands become two tasks. The output from the first command is used in the second command, so in our WDL example, we can see how tasks pass information.

See an example of the finished WDL align_final.wdl and its input.json` file

align_final.wdl
version 1.0

workflow bbtools {
    input {
        File reads
        File ref
    }

    call alignment {
       input: fastq=reads,
              fasta=ref
    }
    call samtools {
       input: sam=alignment.sam
   }
}

task alignment {
    input {
        File fastq
        File fasta
    }

    command {
        bbmap.sh Xmx12g in=~{fastq} ref=~{fasta} out=test.sam
    }

    runtime {
        docker: "jfroula/aligner-bbmap:2.0.2"
        runtime_minutes: 10
        memory: "5G"
        cpu: 1
    }

    output {
       File sam = "test.sam"
    }
}

task samtools {
    input {
        File sam
    }

    command {
       samtools view -b -F0x4 ~{sam} | samtools sort - > test.sorted.bam
    }

    runtime {
        docker: "jfroula/aligner-bbmap:2.0.2"
        runtime_minutes: 10
        memory: "5G"
        cpu: 1
    }

    output {
       File bam = "test.sorted.bam"
    }
}
inputs.json
{
    "bbtools.reads": "../data/sample.fastq.bz2",
    "bbtools.ref": "../data/sample.fasta"
}

Note

Singularity, docker, or shifter can be prepended to each command for testing (see align_with_shifter.sh); however, this wouldn’t be appropriate for a finished “JAWSified” WDL because you loose portability. The final WDL should have the docker image name put inside the runtime {} section.

This may be helpful when testing & debugging so I’ve included an example where shifter is prepended to each command.

align_with_shifter.wdl
version 1.0

workflow bbtools {
    input {
        File reads
        File ref
    }

    call alignment {
       input: fastq=reads,
              fasta=ref
    }
    call samtools {
       input: sam=alignment.sam
   }
}

task alignment {
    input {
        File fastq
        File fasta
    }

    command {
        shifter --image=jfroula/aligner-bbmap:2.0.2 bbmap.sh Xmx12g in=~{fastq} ref=~{fasta} out=test.sam
    }

    output {
       File sam = "test.sam"
    }
}

task samtools {
    input {
        File sam
    }

    command {
       shifter --image=jfroula/aligner-bbmap:2.0.2 samtools view -b -F0x4 ~{sam} | shifter --image=jfroula/aligner-bbmap:2.0.2 samtools sort - > test.sorted.bam
    }

    output {
       File bam = "test.sorted.bam"
    }
}

You would run this WDL on Perlmutter with the following command.

java -jar /global/cfs/cdirs/jaws/jaws-install/perlmutter-prod/lib/cromwell-84.jar run align_with_shifter.wdl -i inputs.json

The Docker Image Should be in the runtime{} Section

Everything in the command{} section of the WDL will run inside a docker container if you’ve added docker to the runtime{} section. Now your WDL has the potential to run on a machine with shifter, singularity, or docker. JAWS will take your docker image and run it appropriately as singularity, docker or shifter. If you run the WDL with the cromwell command on a shifter or singularity machine, you need to supply a cromwell.conf file, explained shortly.

See align_final.wdl:

runtime {
    docker: "jfroula/aligner-bbmap:2.0.3"
}

Run the Final WDL with Cromwell

On a Docker machine

You can now run the final WDL:

conda activate bbtools  # you need this for the cromwell command only
cromwell run align_final.wdl -i inputs.json

On Perlmutter

You’ll have to include a cromwell.conf file in the command because it is the config file that knows whether to run the image, supplied in the runtime{} section, with docker, singularity, or shifter. You don’t need to supply a cromwell.conf file in the above cromwell command because docker is default.

The cromwell.conf file is used to:

  1. override cromwell’s default settings

  2. tells cromwell how to interpret the WDL (i.e. use shifter, singularity, etc)

  3. specifies the backend to use (i.e. local, slurm, aws, condor, etc)

Note

JAWS takes care of the cromwell.conf for you.

Here you can find the config files: jaws-tutorials-examples/config_files.

java -Dconfig.file=<repository-root>/config_files/<cromwell_*.conf> \
     -Dbackend.providers.Local.config.dockerRoot=$(pwd)/cromwell-executions \
     -Dbackend.default=Local \
     -jar <path/to/cromwell.jar> run <wdl> -i <inputs.json>

where

-Dconfig.file
points to a cromwell conf file that is used to overwrite the default configurations. There are versions for perlmutter, dori, etc.

-Dbackend.providers.Local.config.dockerRoot
this overwrites a variable ‘dockerRoot’ that is in cromwell_perlmutter.conf so that cromwell will use your own current working directory to place its output.

-Dbackend.default=[Local|Slurm]
this will allow you to choose between the Local and Slurm backends. With slurm, each task will have it’s own sbatch command (and thus wait in queue).

cromwell.jar can be what you installed or you can use these paths:
dori: /clusterfs/jgi/groups/dsi/homes/svc-jaws/jaws-install/dori-prod/lib/cromwell-84.jar
perlmutter: /global/cfs/cdirs/jaws/jaws-install/perlmutter-prod/lib/cromwell-84.jar

Understanding the Cromwell Output

Cromwell output is:

  1. files created by the workflow

  2. the stdout/stderr printed to screen

1. Where to find the output files

Cromwell saves the results under a directory called cromwell-executions. And under here, there is a unique folder name representing one WDL run.

../_images/crom-exec.svg

Each task of your workflow gets run inside the execution directory so it is here that you can find any output files including the stderr, stdout & script file.

Explaination of cromwell generated files

stderr

The stderr from any of the commands/scripts in your task should be in this file.

stdout

The stdout from all the commands/scripts in your task should be in this file. Not all scripts send errors to stderr as they should so you will find them in here instead.

script

The script file is run by the script.submit file. It contains all the commands that you supplied in the commands{} section of the WDL, as well as cromwell generated code that creates the stderr, stdout, and rc files.

script.submit

This file contains the actual command that cromwell ran. If the file was created by JAWS, there is one more step before “script” gets run.

script.submit -> dockerScript -> script
rc

This file contains the return code for the commands{} section of the WDL. One thing to remember is that the return code used for the rc file is from your last command run. And so if a command fails but the last command succeeded, the return code would be 0, unless you used set -e which forces an exit upon the first error.

These files are only seen in JAWS

stdout.submit

This file is created by script.submit and not by the script file and the content is not useful for debugging your task.

stderr.submit

This file is created by script.submit and not by the script file which means there may be some useful error messages. If there was a problem upstream of the task even starting, the error should be in this file.

dockerScript

This file is created by script.submit and runs the script file.

script.submit -> dockerScript -> script

2. Cromwell’s stdout

When you ran align_with_shifter.wdl with cromwell above, observe these lines in the output.

  1. the bash bbmap.sh and samtools commands that were run

  2. paths to the output files from the workflow

  3. you should see WorkflowSucceededState

  4. copy a path from one of the output execution directories. Notice the cromwell generated files and your .sam or .bam output is there.

  5. Call-to-Backend shows that we are running on local backend (default)

Note

You won’t have access to this same cromwell standard output when you run through JAWS. The same information can be found in different ways.

Limitations when using docker

  1. One docker image per task - this is a general constraint that Cromwell has.

  2. The docker image must be registered with docker hub - this is how we have set up the docker backend configuration.

  3. A sha256 tag must be used instead of some custom tag (i.e v1.0.1) for call-caching to work.

    To find the sha256 tag, you can use:

    # on a docker-machine
    docker images --digests | grep <your_docker_hub_user_name>
    
    # on a shifter-machine
    shifterimg lookup ubuntu:16.04
    

    The version tag (16.04) can be replaced by the sha256 tag.

    runtime {
        docker: "ubuntu@sha256:20858ebbc96215d6c3c574f781133ebffdc7c18d98af4f294cc4c04871a6fe61"
    }
    

    You can interactively go into a container from shifter by

    shifter --image=id:20858ebbc96215d6c3c574f781133ebffdc7c18d98af4f294cc4c04871a6fe61
    or
    shifter --image=ubuntu:16.04