HTCondor Backend Configuration Options When Creating WDL
Use the following tables to help figure out how to configure your runtime{} section.
How to Allocate Resources in your Runtime Section
HTCondor is the back end to Cromwell and is responsible for grabbing the appropriatly sized resource from slurm for each wdl-task. HTCondor can determine what resource your task needs from only memory
and cpu
which is set in the runtime{} section. In fact, memory
and cpu
have defaults set to “5G” and 2(threads), respectively, so you don’t have to include them but it is advised for tranparency sake.
Note
Inside your runtime{} section of the WDL, cpu
should be set to threads and not cpus, despite the name, because Condor expects that value to be threads.
Table of available resources
Site |
Type |
#Nodes |
Mem (GB)* |
Hrs |
#Threads |
---|---|---|---|---|---|
Perlmutter (NERSC) |
Large |
3072 |
492 |
24 |
128 |
Xlarge*** |
4 |
980 |
24 |
128 |
|
JGI (Lab-IT) |
Small |
316 |
46 |
72 |
32 |
Medium |
72 |
236 |
72 |
32 |
|
Large |
8 |
492 |
72 |
32 |
|
Dori (Lab-IT) |
Large |
100 |
492 |
72 |
64 |
Xlarge*** |
18 |
1500 |
72 |
36 |
|
Tahoma (EMSL) |
Medium |
160 |
364 |
48 |
36 |
Xlarge |
24 |
1480 |
48 |
36 |
|
AWS |
– |
100 |
236 |
– |
64** |
* This number is the gigabytes you can actually use because of overhead. For example, on dori, a “large” node is advertized at 512G but since there is overhead, we will reserve 10-20G and instead ask for 492G in our WDL.
AWS is a valid site for JAWS. However, since it uses it’s own scheduling system, simply specify the memory and cpu requirements for each task in the runtime
section.
*** Xlarge compute nodes are not yet available for user jobs.
Links to documentation about each cluster
Runtime Examples
Note
Remember that in your runtime, the number you give “cpu:” is interpreted by HTCondor to be threads not cpu.
What would the runtime{} section look like if your task required 8 threads and “5G” RAM.
runtime {
memory: "5G"
cpu: 8
}
For a dori node (492G usable ram, threads: 64), you could run 8 tasks in parallel because (64 threads/8 = 8) and you would have more than the required 5G of ram per task since (492G/8 = 61.5G).
What happens if I request 64 threads but only 2G of the possible 128G of ram?
A common dori node has 492G and 64 threads. So what happens if you request
runtime {
memory: "2G"
cpu: 64
}
Will you be restricted to 2G or will you have access to 492G? Since we don’t have any memory limits in place in HTCondor you would be allowed to use all 492G on the node. The reverse should be true as well, if you ask for 492G but only 2 threads (cpu: 2), you should still have access to 64 threads.
How should I set the runtime{} section when I want to run many scattering jobs and use multiple nodes.
For example, if you request ~500 tasks each with runtime:
runtime {
memory: "8G"
cpu: 4
}
HTCondor will put them in the queue and the pool manager will start getting new nodes. If we had access to a maximum of 30 nodes, it would grab 30 nodes and start running as many parts of the scatter it can. If nothing else is running you could get 480 of them to run at the same time.