Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Advanced Slurm Configuration

This guide covers advanced Slurm configuration for users who need fine-grained control over their HPC workflows.

For most users: See Slurm Overview for the recommended approach using torc submit-slurm. You don't need to manually configure schedulers or actions—Torc handles this automatically.

When to Use Manual Configuration

Manual Slurm configuration is useful when you need:

  • Custom Slurm directives (e.g., --constraint, --exclusive)
  • Multi-node jobs with specific topology requirements
  • Shared allocations across multiple jobs for efficiency
  • Non-standard partition configurations
  • Fine-tuned control over allocation timing

Torc Server Requirements

The Torc server must be accessible from compute nodes:

  • External server (Recommended): A team member allocates a shared server in the HPC environment. This is recommended if your operations team provides this capability.
  • Login node: Suitable for small workflows. The server runs single-threaded by default. If you have many thousands of short jobs, check with your operations team about resource limits.

Manual Scheduler Configuration

Defining Slurm Schedulers

Define schedulers in your workflow specification:

slurm_schedulers:
  - name: standard
    account: my_project
    nodes: 1
    walltime: "12:00:00"
    partition: compute
    mem: 64G

  - name: gpu_nodes
    account: my_project
    nodes: 1
    walltime: "08:00:00"
    partition: gpu
    gres: "gpu:4"
    mem: 256G

Scheduler Fields

FieldDescriptionRequired
nameScheduler identifierYes
accountSlurm account/allocationYes
nodesNumber of nodesYes
walltimeTime limit (HH:MM:SS or D-HH:MM:SS)Yes
partitionSlurm partitionNo
memMemory per nodeNo
gresGeneric resources (e.g., GPUs)No
qosQuality of ServiceNo
ntasks_per_nodeTasks per nodeNo
tmpTemporary disk spaceNo
extraAdditional sbatch argumentsNo

Defining Workflow Actions

Actions trigger scheduler allocations:

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: standard
    scheduler_type: slurm
    num_allocations: 1

  - trigger_type: on_jobs_ready
    action_type: schedule_nodes
    jobs: [train_model]
    scheduler: gpu_nodes
    scheduler_type: slurm
    num_allocations: 2

Action Trigger Types

TriggerDescription
on_workflow_startFires when workflow is submitted
on_jobs_readyFires when specified jobs become ready
on_jobs_completeFires when specified jobs complete
on_workflow_completeFires when all jobs complete

Assigning Jobs to Schedulers

Reference schedulers in job definitions:

jobs:
  - name: preprocess
    command: ./preprocess.sh
    scheduler: standard

  - name: train
    command: python train.py
    scheduler: gpu_nodes
    depends_on: [preprocess]

Scheduling Strategies

Strategy 1: Many Single-Node Allocations

Submit multiple Slurm jobs, each with its own Torc worker:

slurm_schedulers:
  - name: work_scheduler
    account: my_account
    nodes: 1
    walltime: "04:00:00"

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: work_scheduler
    scheduler_type: slurm
    num_allocations: 10

When to use:

  • Jobs have diverse resource requirements
  • Want independent time limits per job
  • Cluster has low queue wait times

Benefits:

  • Maximum scheduling flexibility
  • Independent time limits per allocation
  • Fault isolation

Drawbacks:

  • More Slurm queue overhead
  • Multiple jobs to schedule

Strategy 2: Multi-Node Allocation

A single Torc worker manages all nodes in the allocation. The worker reports the total resources across all nodes (CPUs × nodes, memory × nodes, etc.) and launches each job via srun --exact, which lets Slurm place it on whichever node has capacity:

slurm_schedulers:
  - name: work_scheduler
    account: my_account
    nodes: 10
    walltime: "04:00:00"

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: work_scheduler
    scheduler_type: slurm
    num_allocations: 1

When to use:

  • Many single-node jobs with similar requirements
  • Want faster queue scheduling (larger jobs often prioritized)
  • MPI or multi-node jobs that span multiple nodes

Benefits:

  • Single queue wait
  • Full per-step sacct accounting and cgroup enforcement
  • Slurm handles node placement automatically via srun --exact

Drawbacks:

  • Shared time limit for all jobs in the allocation

Staged Allocations

For pipelines with distinct phases, stage allocations to avoid wasted resources:

slurm_schedulers:
  - name: preprocess_sched
    account: my_project
    nodes: 2
    walltime: "01:00:00"

  - name: compute_sched
    account: my_project
    nodes: 20
    walltime: "08:00:00"

  - name: postprocess_sched
    account: my_project
    nodes: 1
    walltime: "00:30:00"

actions:
  # Preprocessing starts immediately
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: preprocess_sched
    scheduler_type: slurm
    num_allocations: 1

  # Compute nodes allocated when compute jobs are ready
  - trigger_type: on_jobs_ready
    action_type: schedule_nodes
    jobs: [compute_step]
    scheduler: compute_sched
    scheduler_type: slurm
    num_allocations: 1

  # Postprocessing allocated when those jobs are ready
  - trigger_type: on_jobs_ready
    action_type: schedule_nodes
    jobs: [postprocess]
    scheduler: postprocess_sched
    scheduler_type: slurm
    num_allocations: 1

Note: The torc submit-slurm command handles this automatically by analyzing job dependencies.

Custom Slurm Directives

Use the extra field for additional sbatch arguments:

slurm_schedulers:
  - name: exclusive_nodes
    account: my_project
    nodes: 4
    walltime: "04:00:00"
    extra: "--exclusive --constraint=skylake"

Submitting Workflows

With Manual Configuration

# Submit workflow with pre-defined schedulers and actions
torc submit workflow.yaml

Scheduling Additional Nodes

Add more allocations to a running workflow:

torc slurm schedule-nodes -n 5 $WORKFLOW_ID

Debugging

Check Slurm Job Status

squeue --me

View Torc Worker Logs

Workers log to the Slurm output file. Check:

cat slurm-<jobid>.out

Verify Server Connectivity

From a compute node:

curl $TORC_API_URL/health

srun Job Step Wrapping

When Torc detects that it is running inside a Slurm allocation (SLURM_JOB_ID is set in the environment), it automatically wraps each individual job with srun. This creates a dedicated Slurm job step for every Torc job, which provides:

  • Cgroup enforcement — Slurm enforces CPU and memory limits from the job's resource requirements. Jobs that exceed their stated requirements are immediately killed.
  • sstat visibility — HPC administrators and users can inspect per-step metrics (CPU, memory, wall-time) with sstat -j <SLURM_JOB_ID>.
  • Scheduler awareness — Every running Torc job appears as a named step in squeue, giving the HPC team and users full visibility into what is actually executing.
  • Accounting data — After each step exits, Torc calls sacct to collect Slurm accounting statistics and stores them with the job result (see Slurm Accounting Stats below).

Step Naming

Each srun step is named wf<workflow_id>_j<job_id>_r<run_id>_a<attempt_id>, for example wf10_j42_r1_a1. This name appears in squeue --me and sacct output, and the same component string is embedded in the log file prefix job_wf<workflow_id>_j<job_id>_r<run_id>_a<attempt_id> (for example, job_wf10_j42_r1_a1.o), so all Slurm and Torc records for a job can be easily correlated.

Multi-Node Jobs

For a comprehensive guide to multi-node patterns, see Multi-Node Jobs.

The num_nodes resource requirement field controls how many nodes each job step spans (srun --nodes). It defaults to 1. The Slurm allocation size (sbatch --nodes) is set separately via the Slurm scheduler configuration.

Single-node jobs (default) — no extra configuration needed:

resource_requirements:
  - name: standard
    num_cpus: 4
    memory: 16g
    runtime: PT2H
    # num_nodes defaults to 1

True multi-node jobs (MPI, Julia Distributed.jl, etc.) — the job spans multiple nodes in the allocation:

resource_requirements:
  - name: mpi_job
    num_cpus: 32
    memory: 128g
    runtime: PT8H
    num_nodes: 4      # srun spans all 4 nodes; allocation size set via scheduler

In this pattern, the step spans 4 nodes exclusively, and Torc passes srun --nodes=4 when launching the job. The job command receives SLURM_JOB_NODELIST, SLURM_NTASKS, and the rest of the standard Slurm step environment, so MPI launchers (mpirun, mpiexec) and Julia Distributed.jl will automatically use all allocated nodes.

Multi-Node Allocation Rule

Inside a multi-node Slurm allocation, Torc uses two scheduling modes:

  • Single-node jobs (num_nodes=1) may share nodes based on CPU, memory, and GPU availability.
  • Multi-node jobs (num_nodes>1) reserve whole nodes exclusively.

This keeps job claiming and local resource accounting aligned with Slurm allocations.

Resource Limit Enforcement

In Slurm mode, Torc always passes --cpus-per-task and --mem to srun so Slurm enforces the cgroup limits defined in each job's resource requirements. These flags work together with --exact to allow multiple job steps to run concurrently on shared nodes.

Note: limit_resources: false is not supported in Slurm mode. If you need to run jobs without resource enforcement inside a Slurm allocation, use mode: direct instead:

execution_config:
  mode: direct
  limit_resources: false

In direct mode, jobs run as plain processes without srun wrapping. This means you lose per-step sacct accounting and cgroup isolation, but jobs can use any available resources without restriction.

Disabling srun Wrapping

To disable srun wrapping entirely and run jobs via direct shell execution inside a Slurm allocation, set mode: direct in your execution config:

execution_config:
  mode: direct

In direct mode, Slurm accounting (sacct) and live monitoring (sstat) are unavailable since jobs do not run as Slurm steps. However, Torc's own resource monitor can still track memory and CPU usage if enabled.

Note: Direct mode inside a Slurm allocation is useful when srun has compatibility issues, or when you want to run jobs without resource limits (limit_resources: false). For most workflows, the default auto mode (which selects Slurm mode inside allocations) is recommended.

Slurm Accounting Stats

After each job step exits, Torc calls sacct once to collect the following Slurm-native accounting fields and stores them in the slurm_stats table:

Fieldsacct sourceDescription
max_rss_bytesMaxRSSPeak resident-set size (from cgroups)
max_vm_size_bytesMaxVMSizePeak virtual memory size
max_disk_read_bytesMaxDiskReadPeak disk read bytes
max_disk_write_bytesMaxDiskWritePeak disk write bytes
ave_cpu_secondsAveCPUAverage CPU time in seconds
node_listNodeListNodes used by the job step

These fields complement the existing sysinfo-based metrics (peak_memory_bytes, peak_cpu_percent, etc.) and are available via torc slurm stats <workflow_id>.

sacct data is collected on a best-effort basis. Fields are null when:

  • The job ran locally (no SLURM_JOB_ID)
  • sacct is not available on the node
  • The step was not found in the Slurm accounting database at collection time

Local Execution

When running locally (no SLURM_JOB_ID environment variable), Torc uses its standard shell wrapper and the srun behavior is never triggered. No configuration is needed for local runs.

See Also