Workflow Recovery
Torc provides mechanisms for recovering workflows when Slurm allocations are preempted or fail
before completing all jobs. The torc slurm regenerate command creates new schedulers and
allocations for pending jobs.
The Recovery Problem
When running workflows on Slurm, allocations can fail or be preempted before all jobs complete. This leaves workflows in a partial state with:
- Ready/uninitialized jobs - Jobs that were waiting to run but never got scheduled
- Blocked jobs - Jobs whose dependencies haven't completed yet
- Orphaned running jobs - Jobs still marked as "running" in the database even though their Slurm allocation has terminated
Simply creating new Slurm schedulers and submitting allocations isn't enough because:
- Orphaned jobs block recovery: Jobs stuck in "running" status prevent the workflow from being considered complete, blocking recovery precondition checks
- Duplicate allocations: If the workflow had
on_workflow_startactions to schedule nodes, those actions would fire again when the workflow is reinitialized, creating duplicate allocations - Missing allocations for blocked jobs: Blocked jobs will eventually become ready, but there's no mechanism to schedule new allocations for them
Orphan Detection
Before recovery can proceed, orphaned jobs must be detected and their status corrected. This is
handled by the orphan detection module (src/client/commands/orphan_detection.rs).
How It Works
The orphan detection logic checks for three types of orphaned resources:
-
Active allocations with terminated Slurm jobs: ScheduledComputeNodes marked as "active" in the database, but whose Slurm job is no longer running (verified via
squeue) -
Pending allocations that disappeared: ScheduledComputeNodes marked as "pending" whose Slurm job no longer exists (cancelled or failed before starting)
-
Running jobs with no active compute nodes: Jobs marked as "running" but with no active compute nodes to process them (fallback for non-Slurm cases)
flowchart TD
A[Start Orphan Detection] --> B[List active ScheduledComputeNodes]
B --> C{For each Slurm allocation}
C --> D[Check squeue for job status]
D --> E{Job still running?}
E -->|Yes| C
E -->|No| F[Find jobs on this allocation]
F --> G[Mark jobs as failed]
G --> H[Update ScheduledComputeNode to complete]
H --> C
C --> I[List pending ScheduledComputeNodes]
I --> J{For each pending allocation}
J --> K[Check squeue for job status]
K --> L{Job exists?}
L -->|Yes| J
L -->|No| M[Update ScheduledComputeNode to complete]
M --> J
J --> N[Check for running jobs with no active nodes]
N --> O[Mark orphaned jobs as failed]
O --> P[Done]
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#6c757d,color:#fff
style D fill:#4a9eff,color:#fff
style E fill:#6c757d,color:#fff
style F fill:#4a9eff,color:#fff
style G fill:#dc3545,color:#fff
style H fill:#4a9eff,color:#fff
style I fill:#4a9eff,color:#fff
style J fill:#6c757d,color:#fff
style K fill:#4a9eff,color:#fff
style L fill:#6c757d,color:#fff
style M fill:#4a9eff,color:#fff
style N fill:#4a9eff,color:#fff
style O fill:#dc3545,color:#fff
style P fill:#28a745,color:#fff
Integration Points
Orphan detection is integrated into two commands:
-
torc recover: Runs orphan detection automatically as the first step before checking preconditions. This ensures that orphaned jobs don't block recovery. -
torc workflows sync-status: Standalone command to run orphan detection without triggering a full recovery. Useful for debugging or when you want to clean up orphaned jobs without submitting new allocations.
The torc watch Command
The torc watch command also performs orphan detection during its polling loop. When it detects
that no valid Slurm allocations exist (via a quick squeue check), it runs the full orphan
detection logic to clean up any orphaned jobs before checking if the workflow can make progress.
Recovery Actions
The recovery system uses ephemeral recovery actions to solve these problems.
How It Works
When torc slurm regenerate runs:
flowchart TD
A[torc slurm regenerate] --> B[Fetch pending jobs]
B --> C{Has pending jobs?}
C -->|No| D[Exit - nothing to do]
C -->|Yes| E[Build WorkflowGraph from pending jobs]
E --> F[Mark existing schedule_nodes actions as executed]
F --> G[Group jobs using scheduler_groups]
G --> H[Create schedulers for each group]
H --> I[Update jobs with scheduler assignments]
I --> J[Create on_jobs_ready recovery actions for deferred groups]
J --> K{Submit allocations?}
K -->|Yes| L[Submit Slurm allocations]
K -->|No| M[Done]
L --> M
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#6c757d,color:#fff
style D fill:#6c757d,color:#fff
style E fill:#4a9eff,color:#fff
style F fill:#4a9eff,color:#fff
style G fill:#4a9eff,color:#fff
style H fill:#4a9eff,color:#fff
style I fill:#4a9eff,color:#fff
style J fill:#ffc107,color:#000
style K fill:#6c757d,color:#fff
style L fill:#ffc107,color:#000
style M fill:#28a745,color:#fff
Step 1: Mark Existing Actions as Executed
All existing schedule_nodes actions are marked as executed using the claim_action API. This
prevents them from firing again and creating duplicate allocations:
sequenceDiagram
participant R as regenerate
participant S as Server
participant A as workflow_action table
R->>S: get_workflow_actions(workflow_id)
S-->>R: [action1, action2, ...]
loop For each schedule_nodes action
R->>S: claim_action(action_id)
S->>A: UPDATE executed=1, executed_at=NOW()
end
Step 2: Group Jobs Using WorkflowGraph
The system builds a WorkflowGraph from pending jobs and uses scheduler_groups() to group them by
(resource_requirements, has_dependencies). This aligns with the behavior of
torc workflows create-slurm:
- Jobs without dependencies: Can be scheduled immediately with
on_workflow_start - Jobs with dependencies (deferred): Need
on_jobs_readyrecovery actions to schedule when they become ready
flowchart TD
subgraph pending["Pending Jobs"]
A[Job A: no deps, rr=default]
B[Job B: no deps, rr=default]
C[Job C: depends on A, rr=default]
D[Job D: no deps, rr=gpu]
end
subgraph groups["Scheduler Groups"]
G1[Group 1: default, no deps<br/>Jobs: A, B]
G2[Group 2: default, has deps<br/>Jobs: C]
G3[Group 3: gpu, no deps<br/>Jobs: D]
end
A --> G1
B --> G1
C --> G2
D --> G3
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#ffc107,color:#000
style D fill:#17a2b8,color:#fff
style G1 fill:#28a745,color:#fff
style G2 fill:#28a745,color:#fff
style G3 fill:#28a745,color:#fff
Step 3: Create Recovery Actions for Deferred Groups
For groups with has_dependencies = true, the system creates on_jobs_ready recovery actions.
These actions:
- Have
is_recovery = trueto mark them as ephemeral - Use a
_deferredsuffix in the scheduler name - Trigger when the blocked jobs become ready
- Schedule additional Slurm allocations for those jobs
flowchart LR
subgraph workflow["Original Workflow"]
A[Job A: blocked] --> C[Job C: blocked]
B[Job B: blocked] --> C
end
subgraph actions["Recovery Actions"]
RA["on_jobs_ready: schedule_nodes<br/>job_ids: (A, B)<br/>is_recovery: true"]
RC["on_jobs_ready: schedule_nodes<br/>job_ids: (C)<br/>is_recovery: true"]
end
style A fill:#6c757d,color:#fff
style B fill:#6c757d,color:#fff
style C fill:#6c757d,color:#fff
style RA fill:#ffc107,color:#000
style RC fill:#ffc107,color:#000
Recovery Action Lifecycle
Recovery actions are ephemeral - they exist only during the recovery period:
stateDiagram-v2
[*] --> Created: regenerate creates action
Created --> Executed: Jobs become ready, action triggers
Executed --> Deleted: Workflow reinitialized
Created --> Deleted: Workflow reinitialized
classDef created fill:#ffc107,color:#000
classDef executed fill:#28a745,color:#fff
classDef deleted fill:#6c757d,color:#fff
class Created created
class Executed executed
class Deleted deleted
When a workflow is reinitialized (e.g., after resetting jobs), all recovery actions are deleted and original actions are reset to their initial state. This ensures a clean slate for the next run.
Database Schema
Recovery actions are tracked using the is_recovery column in the workflow_action table:
| Column | Type | Description |
|---|---|---|
is_recovery | INTEGER | 0 = normal action, 1 = recovery action |
Behavior Differences
| Operation | Normal Actions | Recovery Actions |
|---|---|---|
On reset_actions_for_reinitialize | Reset executed to 0 | Deleted entirely |
| Created by | Workflow spec | torc slurm regenerate |
| Purpose | Configured behavior | Temporary recovery |
Usage
# Regenerate schedulers for pending jobs
torc slurm regenerate <workflow_id> --account <account>
# With automatic submission
torc slurm regenerate <workflow_id> --account <account> --submit
# Using a specific HPC profile
torc slurm regenerate <workflow_id> --account <account> --profile kestrel
Implementation Details
The recovery logic is implemented in:
src/client/commands/orphan_detection.rs: Shared orphan detection logic used byrecover,watch, andworkflows sync-statussrc/client/commands/recover.rs: Main recovery command implementationsrc/client/commands/slurm.rs:handle_regeneratefunction for Slurm scheduler regenerationsrc/client/workflow_graph.rs:WorkflowGraph::from_jobs()andscheduler_groups()methodssrc/server/api/workflow_actions.rs:reset_actions_for_reinitializefunctionmigrations/20251225000000_add_is_recovery_to_workflow_action.up.sql: Schema migration
Key implementation notes:
- WorkflowGraph construction: A
WorkflowGraphis built from pending jobs usingfrom_jobs(), which reconstructs the dependency structure fromdepends_on_job_ids - Scheduler grouping: Jobs are grouped using
scheduler_groups()by(resource_requirements, has_dependencies), matchingcreate-slurmbehavior - Deferred schedulers: Groups with dependencies get a
_deferredsuffix in the scheduler name - Allocation calculation: Number of allocations is based on job count and resources per node
- Recovery actions: Only deferred groups (jobs with dependencies) get
on_jobs_readyrecovery actions