Workflow Recovery

Torc provides mechanisms for recovering workflows when Slurm allocations are preempted or fail before completing all jobs. The torc slurm regenerate command creates new schedulers and allocations for pending jobs.

The Recovery Problem

When running workflows on Slurm, allocations can fail or be preempted before all jobs complete. This leaves workflows in a partial state with:

Ready/uninitialized jobs - Jobs that were waiting to run but never got scheduled
Blocked jobs - Jobs whose dependencies haven't completed yet
Orphaned running jobs - Jobs still marked as "running" in the database even though their Slurm allocation has terminated

Simply creating new Slurm schedulers and submitting allocations isn't enough because:

Orphaned jobs block recovery: Jobs stuck in "running" status prevent the workflow from being considered complete, blocking recovery precondition checks
Duplicate allocations: If the workflow had on_workflow_start actions to schedule nodes, those actions would fire again when the workflow is reinitialized, creating duplicate allocations
Missing allocations for blocked jobs: Blocked jobs will eventually become ready, but there's no mechanism to schedule new allocations for them

Orphan Detection

Before recovery can proceed, orphaned jobs must be detected and their status corrected. This is handled by the orphan detection module (src/client/commands/orphan_detection.rs).

How It Works

The orphan detection logic checks for three types of orphaned resources:

Active allocations with terminated Slurm jobs: ScheduledComputeNodes marked as "active" in the database, but whose Slurm job is no longer running (verified via squeue)
Pending allocations that disappeared: ScheduledComputeNodes marked as "pending" whose Slurm job no longer exists (cancelled or failed before starting)
Running jobs with no active compute nodes: Jobs marked as "running" but with no active compute nodes to process them (fallback for non-Slurm cases)

flowchart TD
    A[Start Orphan Detection] --> B[List active ScheduledComputeNodes]
    B --> C{For each Slurm allocation}
    C --> D[Check squeue for job status]
    D --> E{Job still running?}
    E -->|Yes| C
    E -->|No| F[Find jobs on this allocation]
    F --> G[Mark jobs as failed]
    G --> H[Update ScheduledComputeNode to complete]
    H --> C
    C --> I[List pending ScheduledComputeNodes]
    I --> J{For each pending allocation}
    J --> K[Check squeue for job status]
    K --> L{Job exists?}
    L -->|Yes| J
    L -->|No| M[Update ScheduledComputeNode to complete]
    M --> J
    J --> N[Check for running jobs with no active nodes]
    N --> O[Mark orphaned jobs as failed]
    O --> P[Done]

    style A fill:#4a9eff,color:#fff
    style B fill:#4a9eff,color:#fff
    style C fill:#6c757d,color:#fff
    style D fill:#4a9eff,color:#fff
    style E fill:#6c757d,color:#fff
    style F fill:#4a9eff,color:#fff
    style G fill:#dc3545,color:#fff
    style H fill:#4a9eff,color:#fff
    style I fill:#4a9eff,color:#fff
    style J fill:#6c757d,color:#fff
    style K fill:#4a9eff,color:#fff
    style L fill:#6c757d,color:#fff
    style M fill:#4a9eff,color:#fff
    style N fill:#4a9eff,color:#fff
    style O fill:#dc3545,color:#fff
    style P fill:#28a745,color:#fff

Integration Points

Orphan detection is integrated into two commands:

torc recover: Runs orphan detection automatically as the first step before checking preconditions. This ensures that orphaned jobs don't block recovery.
torc workflows sync-status: Standalone command to run orphan detection without triggering a full recovery. Useful for debugging or when you want to clean up orphaned jobs without submitting new allocations.

The `torc watch` Command

The torc watch command also performs orphan detection during its polling loop. When it detects that no valid Slurm allocations exist (via a quick squeue check), it runs the full orphan detection logic to clean up any orphaned jobs before checking if the workflow can make progress.

Recovery Actions

The recovery system uses ephemeral recovery actions to solve these problems.

How It Works

When torc slurm regenerate runs:

flowchart TD
    A[torc slurm regenerate] --> B[Fetch pending jobs]
    B --> C{Has pending jobs?}
    C -->|No| D[Exit - nothing to do]
    C -->|Yes| E[Build WorkflowGraph from pending jobs]
    E --> F[Mark existing schedule_nodes actions as executed]
    F --> G[Group jobs using scheduler_groups]
    G --> H[Create schedulers for each group]
    H --> I[Update jobs with scheduler assignments]
    I --> J[Create on_jobs_ready recovery actions for deferred groups]
    J --> K{Submit allocations?}
    K -->|Yes| L[Submit Slurm allocations]
    K -->|No| M[Done]
    L --> M

    style A fill:#4a9eff,color:#fff
    style B fill:#4a9eff,color:#fff
    style C fill:#6c757d,color:#fff
    style D fill:#6c757d,color:#fff
    style E fill:#4a9eff,color:#fff
    style F fill:#4a9eff,color:#fff
    style G fill:#4a9eff,color:#fff
    style H fill:#4a9eff,color:#fff
    style I fill:#4a9eff,color:#fff
    style J fill:#ffc107,color:#000
    style K fill:#6c757d,color:#fff
    style L fill:#ffc107,color:#000
    style M fill:#28a745,color:#fff

Step 1: Mark Existing Actions as Executed

All existing schedule_nodes actions are marked as executed using the claim_action API. This prevents them from firing again and creating duplicate allocations:

sequenceDiagram
    participant R as regenerate
    participant S as Server
    participant A as workflow_action table

    R->>S: get_workflow_actions(workflow_id)
    S-->>R: [action1, action2, ...]

    loop For each schedule_nodes action
        R->>S: claim_action(action_id)
        S->>A: UPDATE executed=1, executed_at=NOW()
    end

Step 2: Group Jobs Using WorkflowGraph

The system builds a WorkflowGraph from pending jobs and uses scheduler_groups() to group them by (resource_requirements, has_dependencies). This aligns with the behavior of torc workflows create-slurm:

Jobs without dependencies: Can be scheduled immediately with on_workflow_start
Jobs with dependencies (deferred): Need on_jobs_ready recovery actions to schedule when they become ready

flowchart TD
    subgraph pending["Pending Jobs"]
        A[Job A: no deps, rr=default]
        B[Job B: no deps, rr=default]
        C[Job C: depends on A, rr=default]
        D[Job D: no deps, rr=gpu]
    end

    subgraph groups["Scheduler Groups"]
        G1[Group 1: default, no deps<br/>Jobs: A, B]
        G2[Group 2: default, has deps<br/>Jobs: C]
        G3[Group 3: gpu, no deps<br/>Jobs: D]
    end

    A --> G1
    B --> G1
    C --> G2
    D --> G3

    style A fill:#4a9eff,color:#fff
    style B fill:#4a9eff,color:#fff
    style C fill:#ffc107,color:#000
    style D fill:#17a2b8,color:#fff
    style G1 fill:#28a745,color:#fff
    style G2 fill:#28a745,color:#fff
    style G3 fill:#28a745,color:#fff

Step 3: Create Recovery Actions for Deferred Groups

For groups with has_dependencies = true, the system creates on_jobs_ready recovery actions. These actions:

Have is_recovery = true to mark them as ephemeral
Use a _deferred suffix in the scheduler name
Trigger when the blocked jobs become ready
Schedule additional Slurm allocations for those jobs

flowchart LR
    subgraph workflow["Original Workflow"]
        A[Job A: blocked] --> C[Job C: blocked]
        B[Job B: blocked] --> C
    end

    subgraph actions["Recovery Actions"]
        RA["on_jobs_ready: schedule_nodes<br/>job_ids: (A, B)<br/>is_recovery: true"]
        RC["on_jobs_ready: schedule_nodes<br/>job_ids: (C)<br/>is_recovery: true"]
    end

    style A fill:#6c757d,color:#fff
    style B fill:#6c757d,color:#fff
    style C fill:#6c757d,color:#fff
    style RA fill:#ffc107,color:#000
    style RC fill:#ffc107,color:#000

Recovery Action Lifecycle

Recovery actions are ephemeral - they exist only during the recovery period:

stateDiagram-v2
    [*] --> Created: regenerate creates action
    Created --> Executed: Jobs become ready, action triggers
    Executed --> Deleted: Workflow reinitialized
    Created --> Deleted: Workflow reinitialized

    classDef created fill:#ffc107,color:#000
    classDef executed fill:#28a745,color:#fff
    classDef deleted fill:#6c757d,color:#fff

    class Created created
    class Executed executed
    class Deleted deleted

When a workflow is reinitialized (e.g., after resetting jobs), all recovery actions are deleted and original actions are reset to their initial state. This ensures a clean slate for the next run.

Database Schema

Recovery actions are tracked using the is_recovery column in the workflow_action table:

Column	Type	Description
`is_recovery`	INTEGER	0 = normal action, 1 = recovery action

Behavior Differences

Operation	Normal Actions	Recovery Actions
On `reset_actions_for_reinitialize`	Reset `executed` to 0	Deleted entirely
Created by	Workflow spec	`torc slurm regenerate`
Purpose	Configured behavior	Temporary recovery

Usage

# Regenerate schedulers for pending jobs
torc slurm regenerate <workflow_id> --account <account>

# With automatic submission
torc slurm regenerate <workflow_id> --account <account> --submit

# Using a specific HPC profile
torc slurm regenerate <workflow_id> --account <account> --profile kestrel

Implementation Details

The recovery logic is implemented in:

src/client/commands/orphan_detection.rs: Shared orphan detection logic used by recover, watch, and workflows sync-status
src/client/commands/recover.rs: Main recovery command implementation
src/client/commands/slurm.rs: handle_regenerate function for Slurm scheduler regeneration
src/client/workflow_graph.rs: WorkflowGraph::from_jobs() and scheduler_groups() methods
src/server/api/workflow_actions.rs: reset_actions_for_reinitialize function
migrations/20251225000000_add_is_recovery_to_workflow_action.up.sql: Schema migration

Key implementation notes:

WorkflowGraph construction: A WorkflowGraph is built from pending jobs using from_jobs(), which reconstructs the dependency structure from depends_on_job_ids
Scheduler grouping: Jobs are grouped using scheduler_groups() by (resource_requirements, has_dependencies), matching create-slurm behavior
Deferred schedulers: Groups with dependencies get a _deferred suffix in the scheduler name
Allocation calculation: Number of allocations is based on job count and resources per node
Recovery actions: Only deferred groups (jobs with dependencies) get on_jobs_ready recovery actions

Keyboard shortcuts

Torc Documentation