Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Slurm Exit Codes

Torc sets per-step walltimes via srun --time, which produces deterministic exit codes that you can inspect with torc results list and torc slurm sacct.

Exit Code Reference

ScenarioExit CodeSlurm StateTorc StatusDescription
Out of memory137OUT_OF_MEMORYfailedExceeded --mem cgroup limit (SIGKILL)
Timeout, SIGTERM handled0COMPLETEDcompletedCaught SIGTERM, saved state, exited
Timeout, SIGKILL152TIMEOUTterminatedDid not exit before --time limit

Out of Memory (exit code 137)

The job exceeded its --mem cgroup limit. Slurm's OOM killer sent SIGKILL (signal 9). 137 = 128 + 9.

$ torc results list $WORKFLOW_ID
╭────┬──────────┬─────────┬─────────────╮
│ ID │ Job Name │ Status  │ Return Code │
├────┼──────────┼─────────┼─────────────┤
│ 1  │ train    │ failed  │ 137         │
╰────┴──────────┴─────────┴─────────────╯

$ torc slurm sacct $WORKFLOW_ID
╭──────────────────────┬───────────────┬──────────╮
│ Step Name            │ State         │ MaxRSS   │
├──────────────────────┼───────────────┼──────────┤
│ wf1_j1_r1_a1         │ OUT_OF_MEMORY │ 4096000K │
╰──────────────────────┴───────────────┴──────────╯

Fix: Increase memory in resource requirements, or use torc reports check-resource-utilization --correct to auto-adjust based on peak usage.

Timeout with Graceful Shutdown (exit code 0)

The job received SIGTERM via srun --signal, saved a checkpoint, and called sys.exit(0). From Slurm's perspective, the job completed normally.

$ torc results list $WORKFLOW_ID
╭────┬──────────┬───────────┬─────────────╮
│ ID │ Job Name │ Status    │ Return Code │
├────┼──────────┼───────────┼─────────────┤
│ 1  │ simulate │ completed │ 0           │
╰────┴──────────┴───────────┴─────────────╯

This is the expected outcome when using srun_termination_signal. The job handled the signal correctly but did not finish all its work. Reinitialize and re-submit to continue from the checkpoint:

torc workflows reinitialize $WORKFLOW_ID
torc workflows submit $WORKFLOW_ID

See the Graceful Job Termination tutorial for a complete example with a Python signal handler.

Timeout without Handler (exit code 152)

The job did not exit before the step's --time limit. Slurm sent SIGTERM, waited KillWait seconds (typically 30s, configured in slurm.conf), then sent SIGKILL. 152 = 128 + 24 (SIGXCPU).

$ torc results list $WORKFLOW_ID
╭────┬──────────┬────────────┬─────────────╮
│ ID │ Job Name │ Status     │ Return Code │
├────┼──────────┼────────────┼─────────────┤
│ 1  │ train    │ terminated │ 152         │
╰────┴──────────┴────────────┴─────────────╯

$ torc slurm sacct $WORKFLOW_ID
╭──────────────────────┬─────────┬──────────╮
│ Step Name            │ State   │ MaxRSS   │
├──────────────────────┼─────────┼──────────┤
│ wf1_j1_r1_a1         │ TIMEOUT │ 2048000K │
╰──────────────────────┴─────────┴──────────╯

Fix:

  • Add a SIGTERM handler using the shutdown-flag pattern
  • Set srun_termination_signal to give more lead time (e.g., "TERM@300" for 5 minutes)
  • Increase the allocation walltime

Why Torc Sets --time

Without srun --time, steps inherit the allocation's walltime. When the allocation expires, Slurm cancels all steps with State=CANCELLED, which is ambiguous — it could mean the user canceled the job, the admin preempted it, or time ran out.

By setting --time to the remaining allocation time (rounded down to whole minutes), Torc ensures the step times out before the allocation expires. This produces the unambiguous State=TIMEOUT with exit code 152, which Torc can distinguish from user-initiated cancellation.