Multi-Node Jobs
This guide explains how to run jobs across multiple nodes in a Slurm allocation. There are two distinct patterns, and choosing the right one depends on whether your individual jobs need more than one node.
Key Concept: Per-Node Resource Values
All resource values in Torc are per-node. When you write num_cpus: 32 and memory: 128g, you
are describing what each node provides — not the total across all nodes. This keeps the mental model
simple: resource requirements always describe what a single node looks like, regardless of how many
nodes are in the allocation.
Two Patterns
Pattern 1: Many Single-Node Jobs in a Multi-Node Allocation
Use when: You have many independent jobs that each fit on one node, and you want them to run in parallel across multiple nodes for throughput.
How it works: Torc requests a multi-node Slurm allocation (e.g., 4 nodes). The behavior depends on the execution mode:
- Slurm mode (default): A single worker manages the allocation and places each single-node job
onto a node via
srun --nodes=1. Slurm handles resource isolation and node placement. - Direct mode: Jobs are executed directly without
srunwrapping. To distribute work across nodes, setstart_one_worker_per_node: trueon theschedule_nodesaction. This launches one worker per node viasrun --ntasks-per-node=1, and each worker executes jobs directly on its node.
Single-node jobs may share a node as long as CPU, memory, and GPU limits allow. With N nodes, Torc can spread work across the allocation for throughput.
Example (Slurm mode): 100 independent analysis jobs, each needing 8 CPUs and 32 GB, across a 4-node allocation:
name: parallel_analysis
description: Run 100 analysis tasks across 4 nodes
resource_requirements:
- name: analysis
num_cpus: 8 # per node
memory: 32g # per node
runtime: PT1H
jobs:
- name: analyze_{i}
command: python analyze.py --chunk {i}
resource_requirements: analysis
scheduler: multi_node
parameters:
i: "1:100"
slurm_schedulers:
- name: multi_node
account: myproject
nodes: 4
walltime: "08:00:00"
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: multi_node
scheduler_type: slurm
num_allocations: 1
Example (Direct mode): The same workload using direct execution with one worker per node:
name: parallel_analysis_direct
description: Run 20 analysis tasks across 2 nodes via direct execution
execution_config:
mode: direct
resource_requirements:
- name: analysis
num_cpus: 5
num_nodes: 1
memory: 2g
runtime: PT3M
jobs:
- name: analyze_{i}
command: python analyze.py --chunk {i}
resource_requirements: analysis
scheduler: multi_node
parameters:
i: "1:20"
slurm_schedulers:
- name: multi_node
account: myproject
nodes: 2
walltime: "00:10:00"
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: multi_node
scheduler_type: slurm
start_one_worker_per_node: true
num_allocations: 1
Each node has 8 CPUs and 32 GB available per job. If a node has 64 CPUs total, it can run up to 8 jobs concurrently (64 / 8 = 8). Across 4 nodes, that means up to 32 jobs running at once.
Note:
start_one_worker_per_nodeis only supported withexecution_config.mode: direct. It is not compatible with slurm execution mode, where Torc uses a single worker withsrun-based node placement.
Pattern 2: True Multi-Node Jobs (MPI, Distributed Training)
Use when: A single job needs to span multiple nodes — for example, MPI applications, distributed
deep learning, or Julia Distributed.jl.
How it works: You set num_nodes to the number of nodes the job needs. Torc treats that job as
an exclusive whole-node reservation: if a job needs 4 nodes, those 4 nodes are reserved for that
step and are not shared with other jobs until the step completes. Torc passes
srun --nodes=<num_nodes> when launching the job, so the process spans multiple nodes within the
allocation. The job receives the standard Slurm step environment (SLURM_JOB_NODELIST,
SLURM_NTASKS, etc.), so MPI launchers and distributed frameworks work automatically.
Important: In slurm execution mode (the default), Torc wraps each job with srun. Do not use
srun or mpirun in your command — this would create nested process managers that conflict over
node and task placement. Just write the application command directly and let Torc handle the launch.
If you need explicit control over the MPI launcher (e.g., mpirun), use direct execution mode
instead (see example below).
Example (Slurm mode): A distributed training job that spans all 4 nodes in the allocation:
name: distributed_training
description: Distributed training across 4 nodes
resource_requirements:
- name: mpi_training
num_cpus: 32 # per node (128 total across 4 nodes)
memory: 128g # per node (512 GB total across 4 nodes)
num_nodes: 4 # allocates and spans 4 nodes
runtime: PT8H
jobs:
- name: train
command: python -m torch.distributed.run train.py
resource_requirements: mpi_training
scheduler: training_nodes
slurm_schedulers:
- name: training_nodes
account: myproject
nodes: 4
walltime: "12:00:00"
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: training_nodes
scheduler_type: slurm
num_allocations: 1
Here num_cpus: 32 and memory: 128g describe each of the 4 nodes. The total resources available
to the job are 128 CPUs and 512 GB. Torc launches this as
srun --nodes=4 python -m torch.distributed.run train.py.
Example (Direct mode): If you need to use mpirun or another MPI launcher explicitly, use
direct execution mode so Torc does not wrap the command with srun:
name: distributed_training_mpi
description: MPI training across 4 nodes with explicit mpirun
execution_config:
mode: direct
resource_requirements:
- name: mpi_training
num_cpus: 32
memory: 128g
num_nodes: 4
runtime: PT8H
jobs:
- name: train
command: mpirun python train.py
resource_requirements: mpi_training
scheduler: training_nodes
slurm_schedulers:
- name: training_nodes
account: myproject
nodes: 4
walltime: "12:00:00"
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: training_nodes
scheduler_type: slurm
num_allocations: 1
Choosing Between the Patterns
| Question | Pattern 1 (single-node jobs) | Pattern 2 (multi-node jobs) |
|---|---|---|
| Does each job fit on one node? | Yes | No |
| Does the job use MPI or distributed training? | No | Yes |
| Goal is throughput (many jobs in parallel)? | Yes | No |
| Goal is scaling one job across nodes? | No | Yes |
num_nodes setting | 1 (default) | Number of nodes needed |
Whole-Node Reservation Rule
Torc uses this scheduling rule inside a multi-node Slurm allocation:
- Single-node jobs (
num_nodes=1) may share a node. - Multi-node jobs (
num_nodes>1) reserve whole nodes exclusively.
This is intentionally conservative. Torc does not try to pack other work onto nodes that are part of an active multi-node step.
Allocation Strategy: One Large vs. Many Small
When running many single-node jobs (Pattern 1), you also need to decide how to request the underlying Slurm allocations. There are two approaches, each with trade-offs.
One multi-node allocation
Request all nodes in a single sbatch job (e.g., nodes: 4). In slurm mode, a single worker
distributes jobs across nodes via srun. In direct mode with start_one_worker_per_node, Torc runs
one worker per node and each worker executes jobs locally.
Advantages:
- Slurm schedulers typically give priority to larger allocations. On busy clusters, a 4-node job may start sooner than four separate 1-node jobs.
- Works well when all jobs take roughly the same amount of time, because you pay for all nodes until the last job finishes.
Disadvantages:
- Wasted node-hours when runtimes vary. If one job takes 4 hours and the rest take 30 minutes, three nodes sit idle for 3.5 hours waiting for the slow job to finish. You are billed for all four nodes for the full 4 hours.
- On heavily loaded clusters, large allocations can take much longer to schedule because Slurm must find all nodes free at the same time.
Many single-node allocations
Request separate 1-node allocations (e.g., num_allocations: 4 with nodes: 1). Each allocation
runs its own worker and pulls jobs independently.
Advantages:
- Tolerant of variable runtimes. Each allocation finishes as soon as its last job completes. Fast jobs release their nodes immediately instead of waiting for the slowest one.
- Single-node jobs are easier for Slurm to schedule because they can fill gaps in the queue. You may start running sooner and finish sooner overall.
- If one allocation fails or is preempted, the others keep running. The failed allocation's incomplete jobs return to the ready queue and are picked up by another worker.
Disadvantages:
- On clusters that prioritize large jobs, many small allocations may sit in the queue longer.
- More Slurm jobs to manage (though Torc handles this automatically).
Which to choose
| Scenario | Recommended strategy |
|---|---|
| Jobs have similar runtimes, cluster is busy | One multi-node allocation |
| Jobs have variable runtimes | Many single-node allocs |
| Cluster has long queue wait for large jobs | Many single-node allocs |
| Cluster prioritizes large jobs, queue is short | One multi-node allocation |
You can also mix strategies: use a multi-node allocation for a batch of similar jobs, then switch to single-node allocations for a stage with variable runtimes. Torc's scheduler actions make this easy to express per workflow stage.
Mixing Both Patterns
A workflow can combine both patterns, but the cleanest approach is to use separate stages or separate allocations. Once a true multi-node step starts, Torc reserves whole nodes for it exclusively.
For example, single-node preprocessing jobs followed by a multi-node training step:
name: preprocess_then_train
resource_requirements:
- name: preprocess
num_cpus: 4
memory: 16g
runtime: PT30M
- name: distributed_training
num_cpus: 32
memory: 128g
num_nodes: 4
runtime: PT4H
jobs:
- name: prep_{i}
command: python preprocess.py --shard {i}
resource_requirements: preprocess
scheduler: prep_nodes
parameters:
i: "1:8"
- name: train
command: python -m torch.distributed.run train.py
resource_requirements: distributed_training
scheduler: training_nodes
depends_on: [prep_1, prep_2, prep_3, prep_4, prep_5, prep_6, prep_7, prep_8]
slurm_schedulers:
- name: prep_nodes
account: myproject
nodes: 2
walltime: "01:00:00"
- name: training_nodes
account: myproject
nodes: 4
walltime: "06:00:00"
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: prep_nodes
scheduler_type: slurm
num_allocations: 1
- trigger_type: on_jobs_ready
action_type: schedule_nodes
jobs: [train]
scheduler: training_nodes
scheduler_type: slurm
num_allocations: 1
The preprocessing jobs run across 2 nodes (Pattern 1). When they complete, a 4-node allocation is
requested for the training job (Pattern 2). Torc wraps the training command with srun --nodes=4
automatically.
num_nodes
The num_nodes field controls how many nodes each job step spans (srun --nodes). It defaults to
1. The Slurm allocation size (sbatch --nodes) is set separately via the Slurm scheduler
configuration.
- Pattern 1:
num_nodes=1(default) -- each job runs on a single node; allocation size is set on the scheduler - Pattern 2:
num_nodes=N-- each job spans N nodes (MPI, distributed training)
Common Mistakes
Specifying total resources instead of per-node
# WRONG: 512g is the total across 4 nodes
resource_requirements:
- name: mpi_job
num_cpus: 128 # total across nodes
memory: 512g # total across nodes
num_nodes: 4
# CORRECT: 128g is what each of the 4 nodes provides
resource_requirements:
- name: mpi_job
num_cpus: 32 # per node (128 total)
memory: 128g # per node (512g total)
num_nodes: 4
Using srun or mpirun in job commands with slurm execution mode
In slurm mode (the default), Torc wraps each job with srun. Adding srun or mpirun to your
command creates nested process managers that conflict over node and task placement.
# WRONG: nested srun
command: srun --mpi=pmix python train.py
# WRONG: mpirun under Torc's srun
command: mpirun python train.py
# CORRECT: let Torc handle srun wrapping
command: python train.py
If you need explicit control over mpirun, use execution_config.mode: direct so Torc does not
wrap the command with srun.
Using num_nodes > 1 for independent jobs
If your jobs don't need inter-node communication, keep num_nodes=1 (the default) and let Torc
schedule them independently across nodes for maximum throughput.
See Also
- Slurm Overview — Auto-generated Slurm configuration with
submit-slurm - Advanced Slurm Configuration — Manual scheduler and srun wrapping details
- Resource Requirements Reference — Complete field reference