Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

How to Debug a Failed Job

Systematically diagnose why a job failed.

Step 1: Identify the Failed Job

torc jobs list <workflow_id> --status failed

Note the job ID and name.

Step 2: Check the Exit Code

torc results get <workflow_id> --job-id <job_id>

Common exit codes:

CodeMeaning
1General error
2Misuse of shell command
126Permission denied
127Command not found
137Killed (SIGKILL) — often OOM
139Segmentation fault
143Terminated (SIGTERM)

Step 3: Read the Logs

# Get log paths
torc reports results <workflow_id> --job-id <job_id>

# View stderr (usually contains error messages)
cat output/job_stdio/job_wf43_j15_r1_a1.e

# View stdout
cat output/job_stdio/job_wf43_j15_r1_a1.o

Step 4: Check Resource Usage

Did the job exceed its resource limits?

torc reports check-resource-utilization <workflow_id>

Look for:

  • Memory exceeded — Job was likely OOM-killed (exit code 137)
  • Runtime exceeded — Job was terminated for running too long

Step 5: Reproduce Locally

Get the exact command that was run:

torc jobs get <job_id>

Try running it manually to see the error:

# Copy the command from the output and run it
python process.py --input data.csv

Common Fixes

ProblemSolution
OOM killedIncrease memory in resource requirements
File not foundVerify input files exist, check dependencies
Permission deniedCheck file permissions, execution bits
TimeoutIncrease runtime in resource requirements

Step 6: Fix and Retry

After fixing the issue:

# Reinitialize to reset failed jobs
torc workflows reset-status --failed --reinitialize <workflow_id>

# Run again
torc workflows run <workflow_id>
torc submit-slurm <workflow_id>

See Also