Slurm Job Step Monitoring
Torc automatically wraps workflow jobs with srun when running inside a Slurm allocation, creating
per-job steps with cgroup enforcement. It then monitors live resource usage via sstat and collects
final accounting data via sacct. This page describes the architecture and data flow.
Overview
When a Torc job runner detects SLURM_JOB_ID in the environment, it activates three subsystems:
- srun wrapping -- creates a Slurm job step per workflow job
- sstat monitoring -- polls live CPU and memory usage during execution
- sacct collection -- retrieves final accounting data after completion
These work together to provide accurate resource metrics for recovery heuristics, resource correction, and reporting.
flowchart TD
classDef runner fill:#2563eb,color:#fff,stroke:#1d4ed8
classDef slurm fill:#7c3aed,color:#fff,stroke:#6d28d9
classDef data fill:#059669,color:#fff,stroke:#047857
classDef server fill:#dc2626,color:#fff,stroke:#b91c1c
JR[Job Runner]:::runner
SRUN[srun wrapper]:::slurm
SSTAT[sstat monitor thread]:::slurm
SACCT[sacct collector]:::slurm
RES[ResultModel]:::data
SS[SlurmStatsModel]:::data
API[Torc Server API]:::server
JR -->|wraps command| SRUN
JR -->|spawns| SSTAT
SRUN -->|creates step| STEP((Slurm Step)):::slurm
SSTAT -->|polls step| STEP
SSTAT -->|cpu %, memory| RES
SRUN -->|exit code| JR
JR -->|calls after exit| SACCT
SACCT -->|MaxRSS, AveCPU| SS
SS -->|backfill peaks| RES
RES -->|POST /results| API
SS -->|POST /slurm_stats| API
srun Wrapping
Module: src/client/async_cli_command.rs
When SLURM_JOB_ID is set, AsyncCliCommand::start() wraps the user's command with srun instead
of running it directly via the shell. This gives each workflow job its own Slurm step with cgroup
isolation.
Step Naming Convention
Each step is named with a deterministic pattern that enables unambiguous sacct lookup:
wf{workflow_id}_j{job_id}_r{run_id}_a{attempt_id}
Example: wf42_j7_r1_a1 -- workflow 42, job 7, run 1, attempt 1.
srun Arguments
flowchart LR
classDef flag fill:#f59e0b,color:#000,stroke:#d97706
classDef note fill:#e5e7eb,color:#000,stroke:#9ca3af
SRUN[srun]:::flag
JOBID["--jobid=SLURM_JOB_ID"]:::flag
NT["--ntasks=1"]:::flag
EX["--exact"]:::flag
CB["--cpu-bind=none"]:::flag
JN["--job-name=step_name"]:::flag
ND["--nodes=num_nodes"]:::flag
CPU["--cpus-per-task=N"]:::flag
MEM["--mem=NM"]:::flag
TIME["--time=remaining_min"]:::flag
SIG["--signal=TERM@N"]:::flag
SRUN --> JOBID
SRUN --> NT
SRUN --> EX
SRUN --> CB
SRUN --> JN
SRUN --> ND
EX -.- N0["Use exact resources,<br/>don't claim entire node"]:::note
ND -.- N1["Default: 1 node<br/>Override via num_nodes"]:::note
CB -.- N2["Default on; disable with<br/>enable_cpu_bind: true"]:::note
TIME -.- N4["Remaining allocation time<br/>rounded down to minutes"]:::note
SIG -.- N5["Early warning signal<br/>e.g. TERM@120"]:::note
SRUN --> CPU
SRUN --> MEM
SRUN --> TIME
SRUN --> SIG
| Argument | Purpose |
|---|---|
--jobid | Binds step to the parent Slurm allocation |
--ntasks=1 | One task per workflow job |
--exact | Use exactly the requested resources; don't claim the entire node exclusively |
--cpu-bind=none | Disables CPU affinity binding (default; omitted when enable_cpu_bind: true) |
--job-name | Sets the step name for sacct/sstat lookup |
--nodes | Number of nodes for this step (from num_nodes, default 1) |
--cpus-per-task | CPU cgroup limit (from job's resource requirements) |
--mem | Memory cgroup limit in MB (from job's resource requirements) |
--time | Per-step walltime from remaining allocation time (see below) |
--signal | Early warning signal before step timeout (see below) |
Multi-Node Steps
The num_nodes field on resource requirements controls how many nodes an srun step spans
(srun --nodes). It defaults to 1. The allocation size (sbatch --nodes) is set via the Slurm
scheduler configuration.
- Single-node jobs (default):
num_nodes=1-- each job runs on one node - Multi-node jobs (MPI, Julia Distributed.jl):
num_nodes=N-- step spans N nodes
When num_nodes > 1, Torc treats the job as consuming whole nodes exclusively for the lifetime of
the step. This avoids ambiguous partial-node accounting for MPI and other true multi-node jobs.
Resource Enforcement
In Slurm mode, srun always applies --cpus-per-task and --mem flags from the job's resource
requirements. This creates cgroup-enforced limits: jobs exceeding their memory allocation are killed
by the OOM killer (exit code 137), and CPU usage is throttled. These flags are also required for
--exact to work correctly, allowing multiple job steps to run concurrently on shared nodes.
Note: The
limit_resourcessetting only applies to direct mode (OOM enforcement). It is not supported in Slurm mode — usemode: directif you need to disable resource enforcement.
Per-Step Walltime (--time)
Torc sets srun --time=<remaining_minutes> on every step, where remaining_minutes is the time
left in the Slurm allocation (rounded down to whole minutes, minimum 1). This ensures that when
a step exceeds its time, Slurm produces a clean State=TIMEOUT with return code 152, rather than
State=CANCELLED which is what happens when the allocation itself expires and kills the step.
The rounding-down is intentional: it guarantees the step's timeout fires before the allocation expires, so Torc can detect the timeout via sacct and report it accurately.
Early Termination Signal (--signal)
When the workflow has srun_termination_signal set (e.g., "TERM@120"), Torc passes
srun --signal=TERM@120 to every step. This tells Slurm to send SIGTERM to the job process 120
seconds before the step's --time limit. The job can catch this signal, save a checkpoint, and
exit cleanly before the hard kill.
This is a workflow-level setting stored in the database and applied uniformly to all srun invocations. See the Graceful Job Termination tutorial for a complete example with a Python signal handler.
Memory vs Runtime Enforcement
Memory and runtime are handled differently because they carry different risks:
-
Memory (
--mem) enforces a per-job cgroup limit based on the job's own resource requirements. A job that exceeds its declared memory is killed by the OOM killer (exit code 137). This strict enforcement is necessary because a runaway memory consumer can destabilize the entire node by starving other concurrent jobs of memory. -
Runtime (
--time) is set to the remaining allocation time, not the job's declaredruntimefrom resource requirements. This ensures Slurm produces a cleanState=TIMEOUT(return code 152) instead ofState=CANCELLEDwhen time runs out. The per-step--timeis not derived from the job's resource requirements because, unlike excess memory usage, a long-running job does not endanger other jobs — it simply runs longer than expected. Killing a job at 95% completion because it slightly exceeded its estimated runtime would waste all the compute invested.
sstat Monitoring
Module: src/client/resource_monitor.rs
The resource monitor spawns a background thread that periodically polls sstat for live resource
metrics. This provides time-series data for CPU and memory usage during job execution.
Step ID Discovery
sstat requires numeric step IDs (e.g., 12345.0, 12345.1), not step names. After srun starts, the
monitor discovers the numeric ID:
sequenceDiagram
participant JR as Job Runner
participant RM as Resource Monitor
participant SQ as squeue --steps
participant SS as sstat
JR->>RM: register_srun_step(step_name)
loop Up to 5 attempts (200ms apart)
RM->>SQ: squeue --steps -j $SLURM_JOB_ID -o "%i|%j"
SQ-->>RM: 12345.2|wf42_j7_r1_a1
end
Note over RM: Maps step_name → numeric ID "2"
loop Every sample_interval
RM->>SS: sstat -j 12345.2 --format=AveCPU,MaxRSS
SS-->>RM: 00:01:30|2048576K
Note over RM: Compute CPU%, track peak memory
end
The squeue --steps command is used instead of scontrol show step because it produces one compact
line per step, which is critical for allocations with thousands of concurrent steps.
CPU Percent Calculation
sstat reports AveCPU as cumulative CPU time since step start, not instantaneous usage. The monitor
converts this to a utilization percentage:
cpu_percent = (new_ave_cpu_s - prev_ave_cpu_s) / elapsed_s * 100.0
First-sample guard: The first sstat sample is always skipped because there is no previous
baseline to compute a delta from. Without this guard, the first sample would compute
cumulative_cpu / total_elapsed which gives an inaccurate average, not instantaneous usage.
sstat Fields Collected
| sstat Field | Parsed As | Stored In |
|---|---|---|
AveCPU | Seconds (HH:MM:SS) | CPU percent (derived) |
MaxRSS | Bytes (with K/M/G suffix) | peak_memory_bytes |
Noise Reduction
sstat calls for completed steps return non-zero exit codes. These are logged at debug level (not
warn) to avoid flooding logs when steps finish between polling intervals.
sacct Collection
Module: src/client/async_cli_command.rs
After a job step exits, collect_sacct_stats() retrieves the final Slurm accounting record. This is
a blocking call that runs on the job runner thread.
Retry Logic
The Slurm accounting daemon (slurmdbd) often has a delay between step completion and record
availability. The collector retries up to 6 times with 5-second delays:
flowchart TD
classDef action fill:#2563eb,color:#fff,stroke:#1d4ed8
classDef decision fill:#f59e0b,color:#000,stroke:#d97706
classDef result fill:#059669,color:#fff,stroke:#047857
classDef fail fill:#dc2626,color:#fff,stroke:#b91c1c
START[Step exits]:::action
CALL["sacct -j $JOB_ID -P -n<br/>--format JobName,MaxRSS,..."]:::action
FOUND{Step row<br/>found?}:::decision
DATA{Has MaxRSS<br/>or AveCPU?}:::decision
RETRY{Attempt<br/>< 6?}:::decision
WAIT[Sleep 5s]:::action
PARSE[Parse fields]:::action
RETURN[Return SacctStats]:::result
NONE[Return None]:::fail
START --> CALL
CALL --> FOUND
FOUND -->|Yes| DATA
FOUND -->|No| RETRY
DATA -->|Yes| PARSE --> RETURN
DATA -->|No, fields empty| RETRY
RETRY -->|Yes| WAIT --> CALL
RETRY -->|No, give up| NONE
sacct Output Format
The command queries sacct with pipe-separated output:
sacct -j <slurm_job_id> --format JobName,MaxRSS,MaxVMSize,MaxDiskRead,MaxDiskWrite,AveCPU,NodeList -P -n
The step row is matched by JobName == step_name in code rather than using sacct's --name flag,
which on some Slurm versions matches the allocation name instead of the step name.
sacct Fields Collected
| sacct Field | Model Field | Description |
|---|---|---|
MaxRSS | max_rss_bytes | Peak resident set size |
MaxVMSize | max_vm_size_bytes | Peak virtual memory size |
MaxDiskRead | max_disk_read_bytes | Total bytes read from disk |
MaxDiskWrite | max_disk_write_bytes | Total bytes written to disk |
AveCPU | ave_cpu_seconds | Average CPU time (HH:MM:SS parsed to seconds) |
NodeList | node_list | Nodes where step ran |
Data Assembly
Module: src/client/job_runner.rs
The job runner assembles the final ResultModel from three sources:
flowchart LR
classDef sstat fill:#7c3aed,color:#fff,stroke:#6d28d9
classDef sacct fill:#2563eb,color:#fff,stroke:#1d4ed8
classDef runner fill:#f59e0b,color:#000,stroke:#d97706
classDef result fill:#059669,color:#fff,stroke:#047857
subgraph "sstat (live monitoring)"
S1[peak_memory_bytes]:::sstat
S2[peak_cpu_percent]:::sstat
S3[avg_cpu_percent]:::sstat
S4[avg_memory_bytes]:::sstat
end
subgraph "sacct (post-mortem)"
A1[max_rss_bytes]:::sacct
A2[ave_cpu_seconds]:::sacct
end
subgraph "Job Runner"
R1[exec_time_minutes]:::runner
R2[return_code]:::runner
end
RESULT[ResultModel]:::result
S1 -->|"max(sstat, sacct)"| RESULT
S2 --> RESULT
S3 --> RESULT
S4 --> RESULT
A1 -->|"max(sstat, sacct)"| RESULT
A2 -->|"avg_pct = ave_cpu_s / exec_s * 100"| RESULT
R1 --> RESULT
R2 --> RESULT
Backfill Strategy
The backfill_sacct_into_result() function merges sacct data into the result, following these
rules:
| Field | Strategy | Rationale |
|---|---|---|
peak_memory_bytes | max(sstat_peak, sacct_MaxRSS) | sstat may miss spikes between samples; sacct may miss brief spikes between accounting flushes |
avg_cpu_percent | sacct ave_cpu_s / exec_s * 100 | Replaces sstat-derived average with sacct's authoritative lifetime average |
peak_cpu_percent | sacct average (only if sstat gave 0%) | sacct has no instantaneous peak; using the average is better than displaying 0% |
avg_memory_bytes | sstat only (no backfill) | sacct provides no average RSS; only the sstat time-series has this |
Zero guards: sacct values of 0 are skipped during backfill. A zero usually means accounting was never flushed (very short or failed steps), not that the job used no resources.
Storage
SlurmStatsModel
Sacct data is stored in the slurm_stats table with a UNIQUE constraint on
(workflow_id, job_id, run_id, attempt_id):
erDiagram
workflow ||--o{ slurm_stats : "has"
job ||--o{ slurm_stats : "has"
slurm_stats {
int id PK
int workflow_id FK
int job_id FK
int run_id
int attempt_id
text slurm_job_id
int max_rss_bytes
int max_vm_size_bytes
int max_disk_read_bytes
int max_disk_write_bytes
real ave_cpu_seconds
text node_list
}
style slurm_stats fill:#7c3aed,color:#fff,stroke:#6d28d9
The API provides:
POST /slurm_stats-- store accounting data after step completionGET /slurm_stats?workflow_id=N-- list stats with optionaljob_id,run_id,attempt_idfilters
ResultModel Integration
The ResultModel (stored in the result table) holds the merged resource metrics from both sstat
and sacct. These are the fields used by recovery heuristics and resource correction:
peak_memory_bytes-- compared against configured memory limitpeak_cpu_percent-- compared against configured CPU countexec_time_minutes-- compared against configured runtimeavg_cpu_percent-- displayed in reports and dashboards
Parsing Utilities
Module: src/client/slurm_utils.rs
Two parsing functions handle Slurm's non-standard formats:
parse_slurm_memory(s)-- parses suffixed memory strings (1024K,2.5G,512M) into bytesparse_slurm_cpu_time(s)-- parsesHH:MM:SSorD-HH:MM:SStimestamps into seconds
These are used by both the sstat monitor and the sacct collector to normalize Slurm output into numeric values.
Failure Modes and Recovery
| Failure | Impact | Handling |
|---|---|---|
| srun not available | Falls back to direct shell execution | Detected by SLURM_JOB_ID absence |
| sstat unavailable | No live metrics; sacct still works | sstat errors logged at debug level |
| Step ID not discoverable | sstat monitoring skipped for that job | Retries 5x at 200ms intervals |
| sacct returns no data | SlurmStatsModel has NULL fields | Retries 6x at 5s intervals |
| sacct reports zeros | Backfill skips zero values | Preserves sstat-derived metrics |
| OOM kill (exit 137) | Job fails, memory limit detected | Recovery heuristics increase memory |
| Timeout (exit 152) | Job fails, runtime limit detected | Recovery heuristics increase runtime |
Configuration
| Setting | Location | Description |
|---|---|---|
limit_resources | Execution config | Enable OOM enforcement in direct mode (not supported in slurm mode) |
num_nodes | Resource requirements | Number of nodes per job (default: 1) |
sample_interval_seconds | Resource monitor config | sstat polling interval |
TORC_FAKE_SRUN | Environment variable | Override srun binary path (testing) |
TORC_FAKE_SACCT | Environment variable | Override sacct binary path (testing) |