Workflow Actions
Workflow actions enable automatic execution of commands and resource allocation in response to workflow lifecycle events. Actions provide hooks for setup, teardown, monitoring, and dynamic resource management throughout workflow execution.
Overview
Actions consist of three components:
- Trigger - The condition that activates the action
- Action Type - The operation to perform
- Configuration - Parameters specific to the action
actions:
- trigger_type: "on_workflow_start"
action_type: "run_commands"
commands:
- "mkdir -p output logs"
- "echo 'Workflow started' > logs/status.txt"
Trigger Types
Workflow Lifecycle Triggers
on_workflow_start
Executes once, at the workflow's first initialization.
When it fires: During the first initialize_jobs, after jobs are transitioned from
uninitialized to ready/blocked states. It does not fire again on reinitialize — a reinit is not
a new start (see Workflow Reinitialization).
Typical use cases:
- Scheduling Slurm allocations
- Creating directory structures
- Copying initial data
- trigger_type: "on_workflow_start"
action_type: "run_commands"
commands:
- "mkdir -p output checkpoints temp"
- "echo 'Workflow started at $(date)' > workflow.log"
on_workflow_complete
Executes once when all jobs reach terminal states (completed, failed, or canceled).
When it fires: After the last job completes, as detected by the job runner.
Typical use cases:
- Archiving final results
- Uploading to remote storage
- Cleanup of temporary files
- Generating summary reports
- trigger_type: "on_workflow_complete"
action_type: "run_commands"
commands:
- "tar -czf results.tar.gz output/"
- "aws s3 cp results.tar.gz s3://bucket/results/"
- "rm -rf temp/"
Job-Based Triggers
on_jobs_ready
Executes when all specified jobs transition to the "ready" state.
When it fires: When the last specified job becomes ready to execute (all dependencies satisfied).
Typical use cases:
- Scheduling Slurm allocations
- Starting phase-specific monitoring
- Pre-computation setup
- Notifications before expensive operations
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
jobs: ["train_model_001", "train_model_002", "train_model_003"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
Important: The action triggers only when all matching jobs are ready, not individually as each job becomes ready.
on_jobs_complete
Executes when all specified jobs reach terminal states (completed, failed, or canceled).
When it fires: When the last specified job finishes execution.
Typical use cases:
- Scheduling Slurm allocations
- Cleaning up intermediate files
- Archiving phase results
- Freeing storage space
- Phase-specific reporting
- trigger_type: "on_jobs_complete"
action_type: "run_commands"
jobs: ["preprocess_1", "preprocess_2", "preprocess_3"]
commands:
- "echo 'Preprocessing phase complete' >> workflow.log"
- "rm -rf raw_data/"
Worker Lifecycle Triggers
Worker lifecycle triggers are persistent by default, meaning they execute once per worker (job runner), not once per workflow.
on_worker_start
Executes when each worker (job runner) starts.
When it fires: After a job runner starts and checks for workflow actions, before claiming any jobs.
Typical use cases:
- Worker-specific initialization
- Setting up worker-local logging
- Copying data to compute node local storage
- Initializing worker-specific resources
- Recording worker startup metrics
- trigger_type: "on_worker_start"
action_type: "run_commands"
persistent: true # Each worker executes this
commands:
- "echo 'Worker started on $(hostname) at $(date)' >> worker.log"
- "mkdir -p worker_temp"
on_worker_complete
Executes when each worker completes (exits the main loop).
When it fires: After a worker finishes processing jobs and before it shuts down.
Typical use cases:
- Worker-specific cleanup
- Uploading worker-specific logs
- Recording worker completion metrics
- Cleaning up worker-local resources
- trigger_type: "on_worker_complete"
action_type: "run_commands"
persistent: true # Each worker executes this
commands:
- "echo 'Worker completed on $(hostname) at $(date)' >> worker.log"
- "rm -rf worker_temp"
Job Selection
For on_jobs_ready and on_jobs_complete triggers, specify which jobs to monitor.
Exact Job Names
- trigger_type: "on_jobs_complete"
action_type: "run_commands"
jobs: ["job1", "job2", "job3"]
commands:
- "echo 'Specific jobs complete'"
Regular Expressions
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
job_name_regexes: ["train_model_[0-9]+", "eval_.*"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
Common regex patterns:
"train_.*"- All jobs starting with "train_""model_[0-9]+"- Jobs like "model_1", "model_2"".*_stage1"- All jobs ending with "_stage1""job_(a|b|c)"- Jobs "job_a", "job_b", or "job_c"
Combining Selection Methods
You can use both together - the action triggers when all matching jobs meet the condition:
jobs: ["critical_job"]
job_name_regexes: ["batch_.*"]
# Triggers when "critical_job" AND all "batch_*" jobs are ready/complete
Action Types
run_commands
Execute shell commands sequentially on a compute node.
Configuration:
- trigger_type: "on_workflow_complete"
action_type: "run_commands"
commands:
- "tar -czf results.tar.gz output/"
- "aws s3 cp results.tar.gz s3://bucket/"
Execution details:
- Commands run in the workflow's output directory
- Commands execute sequentially (one after another)
- If a command fails, the action fails (but workflow continues)
- Commands run on compute nodes, not the submission node
- Uses the shell environment of the job runner process
schedule_nodes
Dynamically allocate compute resources from a Slurm scheduler.
Configuration:
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
jobs: ["train_model_1", "train_model_2"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
Parameters:
scheduler(required) - Name of Slurm scheduler configuration (must exist inslurm_schedulers)scheduler_type(required) - Must be "slurm"num_allocations(required) - Number of Slurm allocation requests to submitstart_one_worker_per_node(optional, default: false) - Launch one worker per allocated node viasrun --ntasks-per-node=1. Use this for direct-mode workflows with single-node jobs sharing a multi-node allocation. Not compatible withexecution_config.mode: slurm.
Use cases:
- Just-in-time resource allocation
- Cost optimization (allocate only when needed)
- Separating workflow phases with different resource requirements
Complete Examples
Refer to this example
Execution Model
Action Claiming and Execution
- Atomic Claiming: Actions are claimed atomically by workers to prevent duplicate execution
- Non-Persistent Actions: Execute once per workflow (first worker to claim executes)
- Persistent Actions: Can be claimed and executed by multiple workers
- Trigger Counting: Job-based triggers increment a counter as jobs transition; action becomes pending when count reaches threshold
- Immediate Availability: Worker lifecycle actions are immediately available after workflow initialization
Action Lifecycle
[Workflow Created]
↓
[initialize_jobs called]
↓
├─→ on_workflow_start actions become pending
├─→ on_worker_start actions become pending (persistent)
├─→ on_worker_complete actions become pending (persistent)
└─→ on_jobs_ready actions wait for job transitions
↓
[Worker Claims and Executes Actions]
↓
[Jobs Execute]
↓
[Jobs Complete]
↓
├─→ on_jobs_complete actions become pending when all specified jobs complete
└─→ on_workflow_complete actions become pending when all jobs complete
↓
[Workers Exit]
↓
[on_worker_complete actions execute per worker]
Important Characteristics
- No Rollback: Failed actions don't affect workflow execution
- Compute Node Execution: Actions run on compute nodes via job runners
- One-Time Triggers: Non-persistent actions trigger once when conditions are first met
- No Inter-Action Dependencies: Actions don't depend on other actions
- Concurrent Workers: Multiple workers can execute different actions simultaneously
Workflow Reinitialization
Every reinitialize entry point — torc workflows reinit, torc jobs reset-status --reinit,
torc recover, and the Slurm regenerate/watch paths — funnels through the single server routine
reset_actions_for_reinitialize, which runs as part of initialize_jobs. For each action it
recomputes trigger_count from the current job states and then decides whether to re-arm the
action (clear executed/executed_by so it can fire again on the new run) or keep it executed
(leave it suppressed).
The decision follows one rule, independent of action_type:
Re-arm an action iff its triggering condition will genuinely re-occur in the new run.
A user who reinitializes after resetting a subset of jobs expects only those jobs to re-run.
Actions tied to events that are not happening again must not re-fire — a re-fired action re-submits
its Slurm allocation (schedule_nodes) or re-runs its commands (run_commands), and that duplicate
side effect is the bug. This is why the decision is not specific to schedule_nodes: the hazard
is "duplicate side effect," not "duplicate allocation."
Re-arm decision
flowchart TD
start([For each non-recovery action<br/>during reinitialize]) --> recompute[Recompute trigger_count<br/>from current job states]
recompute --> trig{Trigger type?}
trig -- on_workflow_start --> keep
trig -- "on_workflow_complete<br/>on_worker_start / on_worker_complete" --> rearm
trig -- "on_jobs_ready /<br/>on_jobs_complete" --> terminal{All gating jobs still<br/>in a terminal state?<br/>completed / failed /<br/>canceled / terminated}
terminal -- "Yes — subset re-run<br/>(gates untouched,<br/>event already happened)" --> keep
terminal -- "No — full re-run<br/>(a gate was reset,<br/>now ready/blocked)" --> rearm
rearm[Re-arm:<br/>executed = 0, executed_by = NULL<br/>action can fire again]
keep[Keep executed:<br/>action stays suppressed<br/>no duplicate side effect]
rearm --> done([Write trigger_count])
keep --> done
style keep fill:#d4edda,stroke:#28a745,color:#155724
style rearm fill:#fff3cd,stroke:#ffc107,color:#856404
style terminal fill:#cce5ff,stroke:#004085,color:#004085
Per trigger:
on_workflow_start→ keep. The workflow starts exactly once in its lifetime; a reinitialize is not a new start. This is what stops plainreinit/reset-status --reinitfrom re-running start-time setup or re-submitting the original node count. (A never-fired action keepsexecuted = 0and still fires, so a first real initialize is unaffected.)on_jobs_ready/on_jobs_complete→ keep iff the gating jobs are still all terminal. That is the subset re-run (the gates were untouched, so the event already happened and is not recurring). Re-arm when any gate was reset — the full re-run, where that job will run again and its action should fire again with it.on_workflow_complete,on_worker_start,on_worker_complete→ re-arm. These recur every run (the workflow will complete again; new workers start), so they should fire again.
Why "terminal state" and not trigger_count. The job-gated test measures the gating jobs with
the on_jobs_complete notion (terminal jobs only), not the action's own trigger_count. For an
on_jobs_ready action a reset gate returns to Ready, and Ready already satisfies
on_jobs_ready, so trigger_count cannot distinguish a freshly-reset gate from one that already
completed. Using it would wrongly suppress the action on a full re-run and stall the workflow. The
terminal-state count drops as soon as a gate is reset, which is exactly the signal that the action
must fire again.
Example — subset re-run vs. full re-run:
gate_jobruns first; anon_jobs_completeaction (aschedule_nodesallocation or arun_commandsarchive step) fires when it completes. A laterwork_jobdepends ongate_job.
Subset re-run (the common reschedule-the-failure case):
gate_jobcompletes (action fires);work_joblater fails.- User resets only the failed job and reinitializes.
gate_jobstaysCompleted(terminal). - The action is kept executed — it is not re-armed, so no duplicate allocation/command runs when a worker starts.
Full re-run (the gate itself is re-run):
- Same first run as above.
- User resets
gate_jobitself and reinitializes.gate_jobreturns to a non-terminal state. - The action is re-armed, so when
gate_jobcompletes again in the new run it fires again.
With multiple gating jobs, the action is only kept executed when all of them remain terminal; resetting any one of them re-arms it.
The client-side recover/regenerate guards (mark_satisfied_schedule_actions_executed) remain as
defense in depth, but the server is now the single point that prevents reinit from resurrecting an
already-satisfied action, so all entry points are covered uniformly.
Limitations
- No Action Dependencies: Actions cannot depend on other actions completing
- No Conditional Execution: Actions cannot have conditional logic (use multiple actions with different job selections instead)
- No Action Retries: Failed actions are not automatically retried
- Single Action Type: Each action has one action_type (cannot combine run_commands and schedule_nodes)
- Job Selectors Are Spec-Time Only: An action's job names/patterns are fixed when the workflow
is created and don't update if new jobs are added later (e.g. by
spawn_jobs)
For complex workflows requiring these features, consider:
- Using job dependencies to order operations
- Creating separate jobs for conditional logic
- Implementing retry logic within command scripts
- Creating multiple actions for different scenarios