Using Reference Data in Your WDLs
JAWS offers a place to save large, re-usable, files (i.e NR database) that won’t get copied everytime a WDL is submitted.
On Perlmutter, you can create a folder under /global/dna/shared/databases/jaws/refdata/<your-group-name>
.
When you add files there, they will automatically be synced to all the JAWS sites. The files will also be accessible within your WDL by using the /refdata/<your-group-name>
path.
Adding Data to refdata directory
You can create your own folders under /global/dna/shared/databases/jaws/refdata but you need to log into a dtn node (i.e. ssh dtn04) since they are read-only nodes.
Your new folders and files must have permissions for JAWS to read. So set global read perms (i.e. drwxrwsr-x+).
You can copy data from Perlmutter filesystems: 1) global home, 2) global common, 3) and the Community File System(CFS) but not /pscratch (unless using globus).
Globus is the fastest data copy method AND can read data from /pscratch. Use the NERSC Perlmutter => NERSC DTN endpoints.
No symlinks (e.g. latest -> v10.4). Symlinks will not be maintained when the data files are sync’d between sites.
Besides adding your files to your group folder, you need to cut and paste the full paths to a “manifest file”. You create your own manifest file and it should be named like <USER>_changes.txt. Modifying this file will trigger globus to copy your files to all the other JAWS sites. For example:
If I added /global/dna/shared/databases/jaws/refdata/ekirton/tiny.fastq to jfroula_changes.`txt, in 20mins or less, globus would initiate a copy to all sites.
Of course you can add folders, e.g. /global/dna/shared/databases/jaws/refdata/ekirton would copy everything.
How it Works
There is a daemon running in the background that checks every 20 minutes for any modifications in a <USER>_changes.txt file. If a new file is created or the contents are changed of an existing one, this daemon will will validate that the files paths are:
full paths
existing paths
have read permission by the jaws user (jaws is part of the genome group so 440 would be minimum perms).
The daemon will gather all the paths from any <USER>_changes.txt file that has been modified and send them to the globus API for transfer. Another daemon will monitor the transfer via globus API and write to a log every so often. This log is monitored by our monitoring system (the ElasticSearch Stack) so that you’ll be able to see the status on the JAWS dashboard (not available yet). Finally, the JAWS team will also be alerted if there is a transfer failure.
Removing Files
When you want to remove files, just delete them from your group folder. You don’t need to do anything else because once a month, globus will copy all folders and in doing so will delete anything on the destination site that is not on the source site.
How to use refdata in your WDLs
Use /refdata in your WDLs as the root. For example, if you wanted to run a blast command in your WDL, you would point to the database like: blastn -db /refdata/nt_test/nt where nt_test is where you saved all the blast index files.
Hint
In your WDL, the input type for refdata files should be specified as String and not File. Variables specified with File are copied into Cromwell’s working directory, and since /refdata doesn’t exist outside the container, JAWS will fail to validate the path and you’ll get an error.
Example
WDL Example
version 1.0 workflow refdata_wf { call task1 { } } task task1 { command <<< # How to access reference data. The command is being run in a # docker container and the path to refdata outside the # container is mounted as "/refdata" inside the container. The # mounting of which happens in the cromwell config file. ls /refdata/nt_test >>> runtime { docker: "ubuntu:latest" cpu: 1 memory: "1G" } output { String outfile = stdout() } }