JAWS Troubleshooting RoadMap

When you receive a notification that your JAWS runs have failed, here’s a step-by-step guide on what to do next.

Inspect the jaws log

The first step in debugging a failed JAWS run is to inspect the output of the jaws log command. This command will provide key information about which stage of the run encountered an issue.

Run the jaws log command πŸ”—

Run the following command to inspect the log output for a specific run:

 jaws log <RUN_ID>
#STATUS_FROM           STATUS_TO           TIMESTAMP            COMMENT
created                upload queued       2024-04-08 11:56:44
upload queued          upload complete     2024-04-08 11:56:55
upload complete        ready               2024-04-08 11:57:06
ready                  submitted           2024-04-09 14:09:47
submitted              queued              2024-04-09 14:09:58
queued                 running             2024-04-09 14:10:42
running                succeeded           2024-04-16 19:02:43
succeeded              complete            2024-04-16 19:03:00
complete               finished            2024-04-16 19:03:10
finished               download succeeded  2024-04-16 19:03:22
download succeeded     fix perms queued    2024-04-16 19:03:33
fix perms queued       fix perms complete  2024-04-16 19:03:43
fix perms complete     sync complete       2024-04-16 19:03:53
sync complete          slack succeeded     2024-04-16 19:04:04
slack succeeded        done                2024-04-16 19:04:14

The log displays a sequence of transitions between different stages of the run, along with timestamps and comments to help you understand where the issue occurred.

Understanding the JAWS Run Stages πŸ”—

Below is a breakdown of the typical stages of a successful JAWS run:

  1. created β†’ upload queued: The run is created and input data is queued for transfer to the compute site using Globus.

  2. upload queued β†’ upload complete: Data transfer is completed successfully.

  3. upload complete β†’ ready: The compute site has received the input data and is preparing to start the run.

  4. ready β†’ submitted: The run has been successfully submitted to Cromwell for task execution.

  5. submitted β†’ queued: Tasks are submitted to the compute cluster (HTCondor) and queued for execution.

  6. queued β†’ running: Tasks are actively being processed on the compute site.

  7. running β†’ succeeded: The tasks have completed successfully on Cromwell.

  8. succeeded β†’ complete: The run is finished.

  9. complete β†’ finished: The run is fully complete, and JAWS is updading jaws tasks.

  10. complete β†’ download queued: Output data is queued for transfer to the JAWS Teams Directory.

  11. download queued β†’ download complete: Output data transfer is complete.

  12. download complete β†’ fix perms complete: File permissions are verified and adjusted if necessary.

  13. fix perms complete β†’ sync complete: Complete.

  14. sync complete β†’ slack succeeded: Slack Message sent.

  15. slack succeeded β†’ done: The run is fully complete.

Non-Cromwell Common Failure Scenarios

Scenario 1: Failed at Globus Transfer πŸ”—

After your run is created, JAWS will transfer the input data from the input site to the compute site using Globus. If an error occurs in this stage, you might see something like the following (note that other issues with Globus can also occur, leading to different error messages):

jaws log <RUN_ID>
#STATUS_FROM     STATUS_TO        TIMESTAMP            COMMENT
created          upload failed    2024-08-13 16:08:15  No transfer method known for Transfer 3258
upload failed    slack succeeded  2024-08-13 16:08:25
slack succeeded  done             2024-08-13 16:08:35

Explanation: This error message indicates that the job failed during the transfer of input data from the the input site to the compute site.

Relevant Error Files: In this case, no cromwell-executions folder was created. Therefore, the only error message can be found using the jaws log command.

Action: This could be caused by issues with the Globus endpoint or network instability. Contact the JAWS team on Slack (#jaws channel) for assistance, providing the log details and the RUN_ID, and the JAWS team will be able to assist you in resolving it.

Scenario 2: Failed at submitting to Cromwell πŸ”—

Once the input data is transferred, JAWS Site will submit the run to Cromwell. If an error occurs, you might see something like:

jaws log <RUN_ID>
#STATUS_FROM      STATUS_TO          TIMESTAMP           COMMENT
created           upload queued      2024-07-01 15:23:10
upload queued     upload complete    2024-07-01 15:23:10
upload complete   submission failed  2024-07-01 15:23:43 Server timeout: The service is unable to respond at this time; please try again later.
submission failed slack succeeded    2024-07-01 15:24:03
slack succeeded   done               2024-07-01 15:24:19

Explanation: This log shows that the job failed during the submission to Cromwell, which indicates a temporary issue with the server, possibly due to high load or network problems.

Relevant Error Files: In this case, no cromwell-executions folder was created. Therefore, the only error message can be found using the jaws log command.

Action: Please contact the JAWS Team for further assistance. You can reach out via Slack in the #jaws channel.

Scenario 3: Failed at Transfer the Ouputs πŸ”—

Another error that can occur is when JAWS is transferring the output data to the JAWS Teams Directory (find more information here). JAWS also uses Globus for this transfer, and you might find a message like the following:

jaws log <RUN_ID>
#STATUS_FROM        STATUS_TO           TIMESTAMP            COMMENT
created             upload queued       2024-04-08 11:56:44
upload queued       upload complete     2024-04-08 11:56:55
upload complete     ready               2024-04-08 11:57:06
ready               submitted           2024-04-09 14:09:47
submitted           queued              2024-04-09 14:09:58
queued              running             2024-04-09 14:10:42
running             succeeded           2024-04-16 19:02:43
succeeded           complete            2024-04-16 19:03:00
complete            finished            2024-04-16 19:03:10
finished            download failed     2024-04-16 19:03:22  ('POST', 'https://transfer.api.globus.org/v0.10/transfer', 'Bearer', 502, 'ExternalError', "Error validating login to endpoint 'NERSC Perlmutter jaws Collab (5b869795-a6f8-4a87-9272-7c1851c25033)', Error (connect)\nEndpoint: NERSC Perlmutter jaws Collab (5b869795-a6f8-4a87-9272-7c1851c25033)\nServer: 128.55.64.33:443\nMessage: The operation timed out\n", 'vmKIUrP6K')
download failed     fix perms queued    2024-04-16 19:03:33
fix perms queued    fix perms complete  2024-04-16 19:03:43
fix perms complete  sync complete       2024-04-16 19:03:53
sync complete       slack succeeded     2024-04-16 19:04:04
slack succeeded     done                2024-04-16 19:04:14

Explanation: The run succeeded, but the output data transfer to the JAWS Teams Directory failed.

Relevant Error Files: In this case, cromwell-executions folder was created. However, this error happened after cromwell execution folder, so the only error message can be found using the jaws log command.

Action: This is likely a network or Globus issue. Run jaws download to attempt the transfer again. If the problem persists, contact the JAWS team.

Cromwell Common Failure Scenarios

Scenario 4: Failed at Cromwell Execution πŸ”—

Explanation: The Cromwell execution failed, which means one or more tasks did not complete successfully. Example:

jaws log <RUN_ID>
#STATUS_FROM      STATUS_TO         TIMESTAMP            COMMENT
created           upload queued     2024-08-13 15:56:48
upload queued     upload complete   2024-08-13 15:58:30
upload complete   ready             2024-08-13 15:58:43
ready             submitted         2024-08-13 15:58:46
submitted         queued            2024-08-13 15:59:02
queued            running           2024-08-13 15:59:47
running           failed            2024-08-13 16:02:15  Cromwell execution failed
failed            complete          2024-08-13 16:02:47
complete          finished          2024-08-13 16:03:00
finished          download skipped  2024-08-13 16:03:05
download skipped  slack succeeded   2024-08-13 16:03:15
slack succeeded   done              2024-08-13 16:03:25

Action: In this case, we need more investigation. Let’s explore the other JAWS Commands.

Cromwell Execution Failed: What to do next?

If the jaws log indicates that the Cromwell execution failed, the next step is to investigate the specific tasks that failed. This can be done using the jaws tasks command, which provides detailed information about the status of each task within the workflow.

Inspect the jaws tasks

The jaws tasks command provides detailed information about which tasks were executed, including their names, their statuses, and their return codes.

Run the jaws tasks command πŸ”—
jaws tasks <RUN_ID>
#TASK_DIR   JOB_ID  STATUS  QUEUE_START          RUN_START            RUN_END              QUEUE_MIN  RUN_MIN  CACHED  TASK_NAME               REQ_CPU  REQ_GB  REQ_MIN  CPU_HRS  RETURN_CODE
call-task1  532741  failed  2024-09-13 12:03:43  2024-09-13 12:10:43  2024-09-13 12:10:46  7          0        False   runblastplus_sub.task1  1        1       20       0.0      2

Key steps to troubleshoot jaws status:

Scenario 1: All tasks succeeded, but the run status reports failed πŸ”—

Explanation: In some cases, jaws tasks may report that all tasks succeeded, but the overall run still failed.

Possible Cause: This can happen due to external issues unrelated to the task execution, Cromwell was unable to find an expected output file.

Relevant Error Files: Inspect the error.json file for more details.

How to find the error.json file? πŸ”—
  • Run the jaws status <RUN_ID> Command: - First, you need to find the output directory for your run. You can do this by running the jaws status <RUN_ID> command:

jaws status 86698
{
   "compute_site_id": "perlmutter",
   "cpu_hours": 0.0,
   "cromwell_run_id": "e2a3b977-0d73-4478-8613-56e601d166ce",
   "id": 86698,
   "input_site_id": "perlmutter",
   "json_file": "/global/u1/d/dcassol/JAWS/jaws-tutorial-examples/5min_example/inputs.json",
   "output_dir": "/pscratch/sd/j/jaws/perlmutter-prod/dsi-aa/dcassol/86698/e2a3b977-0d73-4478-8613-56e601d166ce",
   "result": "failed",
   "status": "done",
   "status_detail": "The run is complete.",
   "submitted": "2024-09-19 14:54:58",
   "tag": null,
   "team_id": "dsi-aa",
   "updated": "2024-09-19 15:06:13",
   "user_id": "dcassol",
   "wdl_file": "/global/u1/d/dcassol/JAWS/jaws-tutorial-examples/5min_example/align_final.wdl",
   "workflow_name": "bbtools",
   "workflow_root": "/pscratch/sd/j/jaws/perlmutter-prod/cromwell-executions/bbtools/e2a3b977-0d73-4478-8613-56e601d166ce"
}
  • Look for the output_dir field in the command output. This field will provide the path to the directory where the output and error files are stored.

  • Navigate to the output_dir directory to locate the error.json file. This file contains detailed information about the errors that occurred during execution extracted from the Cromwell Metadata logs.

Action: If error.json file shows a message about missing ouput files, where are two common reasons:

  1. Filesystem Instability: The task generated the output file, but Cromwell couldn’t find it due to a temporary filesystem issue.

    Solution: Use the jaws resubmit command. This will leverage task caching, meaning the previous outputs will be reused, and the resubmission should succeed. If the issue persists, contact the JAWS team for further assistance.

  2. Output File Not Generated: The task didn’t create the output file, even though the task itself returned a success code (return code 0).

    Solution: Inspect the stderr file to understand why the file wasn’t created despite the successful execution. This may indicate an issue with the command stanza not catching the exeption and/or some issue with the input data.

Scenario 2: jaws tasks shows a failed task πŸ”—

Let’s investigate which Return Code the task has returned. Go to next topic to understand the Return Codes.

How to Get to the cromwell-execution Folder for the Failed Task? πŸ”—

If the input site is the same as the comoute site, you can access the workflow_root folder directly. If the input site is different from the compute site, you need to download the failed cromwell-executions folders to your input site using the jaws download command.

jaws download 86698
{
   "download_id": 40351,
   "id": 86698,
   "status": "download queued"
}

Once the download completes, you can locate the cromwell-executions folder inside the JAWS Teams directory (output_dir). Within the cromwell-executions folder, you will find important logs like stderr, stdout, and rc (return code) files. These logs provide detailed information about task failures.

Inspect Cromwell Return Codes

For each task or shard executed in Cromwell, a return code is assigned that indicates whether the task succeeded or failed. These return codes help us understand the status of individual tasks within a workflow and provide insight into potential issues that require troubleshooting.

How to get to the Return code? πŸ”—

You can inspect return codes using the jaws tasks command, which provides a RETURN_CODE column for each task.

Additionally, each Cromwell execution folder contains a rc file that stores the return code of the corresponding task.

This guide focuses on understanding these return codes, with particular emphasis on return code 79, which is commonly used by Cromwell.

Common Return Codes and Their Meanings πŸ”—

#

Description

0

The task completed successfully.

1

A generic error occurred during task execution. This is often caused by issues in the task script. Check the stderr for the root cause.

2

This code is used when the task encounters a specific error related to incorrect input or improper usage of the task script. Check the stderr for the root cause.

255

Typically signifies an abnormal termination, indicating that the process did not exit as expected.

137

This return code occurs when the task exceeds the allocated memory and is terminated. Adjust the memory request in the WDL or task configuration to resolve this issue.

79

Cromwell sets return code 79 when the task is terminated by Cromwell. This can happen due to several reasons, which are detailed below.

Understanding Return Code 79 πŸ”—

Return code 79 is a special code used by Cromwell to indicate that the task was terminated by Cromwell itself.

According to Cromwell’s code comments:

If SIGTERM, SIGKILL or SIGINT codes are used, cromwell will assume the job has been aborted by cromwell. Since it is arbitrary which code is chosen from that range, and it has to relate with the unpleasant business of β€˜killing’. 79 was chosen.

This can happen for several reasons, including file system instability, container issues, or the failure to generate expected output files.

Container Image Not Found (*.sif file not found):
  • Symptom: The task fails with return code 79 because the container image file (e.g., .sif file) was not found during execution.

  • Cause: File system instability during the pulling the image. The task will not start running, and Cromwell will set return code 79.

  • Action: Check the stderr.submit file for error messages. Use jaws resubmit to retry the run. If the issue persists, contact the JAWS team for further investigation.

File System Instability and Retry Logic in JAWS:
  • Symptom: The task is terminated with return code 79 due to file system instability during the script execution.

  • Action: JAWS automatically triggers a retry mechanism when the fist return code is 79.

  • Example:

cat cromwell-executions/<Workflow_Name>/79272074-ae28-40cd-8715-cb109a73b6e2/call-helloWDL/execution/stderr.submit
2024-09-15 20:48:55: ERROR: task execution failed with the return code 79 that may be caused by a system issue. Retrial: 1

cat cromwell-executions/<Workflow_Name>/79272074-ae28-40cd-8715-cb109a73b6e2/call-helloWDL/execution/rc.1
79

cromwell-executions/<Workflow_Name>/79272074-ae28-40cd-8715-cb109a73b6e2/call-helloWDL/execution/stdout.submit
Submitting job(s). 1 job(s) submitted to cluster 1711957.
2024-09-15 20:49:56: INFO: task execution successful after 2 attempt(s).  Return code = 0
Expected Output File Not Found:
  • Symptom: Cromwell cannot find the expected output file after the task execution, and the return code 79 is set in the rc file. You will find the error message in the error.json file.

  • This can happen in two scenarios:

    • Expected files were generated:

      The file system instability prevented Cromwell from confirming the output files, resulting in return code 79. Usually, the return code from the script was 0, and because Cromwell was not able to confirm the output files, it set the return code to 79.

      Action: In this case, a jaws resubmit <RUN_ID> will usually resolve the issue. Please confirm that the expected output files were generated before resubmitting the run.

    • Expected files were not generated:

      The script failed to create the output files. Cromwell may have set return code 79 because it couldn’t find the output, even though the script returned 0.

      Action: Investigate the stderr and the script’s logic. It may involve an uncaught exception. This might require editing the WDL or input data, followed by a jaws submit to rerun the workflow.