Slurm Exit Codes
Torc sets per-step walltimes via srun --time, which produces deterministic exit codes that you can
inspect with torc results list and torc slurm sacct.
Exit Code Reference
| Scenario | Exit Code | Slurm State | Torc Status | Description |
|---|---|---|---|---|
| Out of memory | 137 | OUT_OF_MEMORY | failed | Exceeded --mem cgroup limit (SIGKILL) |
| Timeout, SIGTERM handled | 0 | COMPLETED | completed | Caught SIGTERM, saved state, exited |
| Timeout, SIGKILL | 152 | TIMEOUT | terminated | Did not exit before --time limit |
Out of Memory (exit code 137)
The job exceeded its --mem cgroup limit. Slurm's OOM killer sent SIGKILL (signal 9).
137 = 128 + 9.
$ torc results list $WORKFLOW_ID
╭────┬──────────┬─────────┬─────────────╮
│ ID │ Job Name │ Status │ Return Code │
├────┼──────────┼─────────┼─────────────┤
│ 1 │ train │ failed │ 137 │
╰────┴──────────┴─────────┴─────────────╯
$ torc slurm sacct $WORKFLOW_ID
╭──────────────────────┬───────────────┬──────────╮
│ Step Name │ State │ MaxRSS │
├──────────────────────┼───────────────┼──────────┤
│ wf1_j1_r1_a1 │ OUT_OF_MEMORY │ 4096000K │
╰──────────────────────┴───────────────┴──────────╯
Fix: Increase memory in resource requirements, or use
torc reports check-resource-utilization --correct to auto-adjust based on peak usage.
Timeout with Graceful Shutdown (exit code 0)
The job received SIGTERM via srun --signal, saved a checkpoint, and called sys.exit(0). From
Slurm's perspective, the job completed normally.
$ torc results list $WORKFLOW_ID
╭────┬──────────┬───────────┬─────────────╮
│ ID │ Job Name │ Status │ Return Code │
├────┼──────────┼───────────┼─────────────┤
│ 1 │ simulate │ completed │ 0 │
╰────┴──────────┴───────────┴─────────────╯
This is the expected outcome when using srun_termination_signal. The job handled the signal
correctly but did not finish all its work. Reinitialize and re-submit to continue from the
checkpoint:
torc workflows reinitialize $WORKFLOW_ID
torc workflows submit $WORKFLOW_ID
See the Graceful Job Termination tutorial for a complete example with a Python signal handler.
Timeout without Handler (exit code 152)
The job did not exit before the step's --time limit. Slurm sent SIGTERM, waited KillWait seconds
(typically 30s, configured in slurm.conf), then sent SIGKILL. 152 = 128 + 24 (SIGXCPU).
$ torc results list $WORKFLOW_ID
╭────┬──────────┬────────────┬─────────────╮
│ ID │ Job Name │ Status │ Return Code │
├────┼──────────┼────────────┼─────────────┤
│ 1 │ train │ terminated │ 152 │
╰────┴──────────┴────────────┴─────────────╯
$ torc slurm sacct $WORKFLOW_ID
╭──────────────────────┬─────────┬──────────╮
│ Step Name │ State │ MaxRSS │
├──────────────────────┼─────────┼──────────┤
│ wf1_j1_r1_a1 │ TIMEOUT │ 2048000K │
╰──────────────────────┴─────────┴──────────╯
Fix:
- Add a SIGTERM handler using the shutdown-flag pattern
- Set
srun_termination_signalto give more lead time (e.g.,"TERM@300"for 5 minutes) - Increase the allocation walltime
Why Torc Sets --time
Without srun --time, steps inherit the allocation's walltime. When the allocation expires, Slurm
cancels all steps with State=CANCELLED, which is ambiguous — it could mean the user canceled the
job, the admin preempted it, or time ran out.
By setting --time to the remaining allocation time (rounded down to whole minutes), Torc ensures
the step times out before the allocation expires. This produces the unambiguous State=TIMEOUT
with exit code 152, which Torc can distinguish from user-initiated cancellation.