Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Workflow Recovery

Torc provides mechanisms for recovering workflows when Slurm allocations are preempted or fail before completing all jobs. The torc slurm regenerate command creates new schedulers and allocations for pending jobs.

The Recovery Problem

When running workflows on Slurm, allocations can fail or be preempted before all jobs complete. This leaves workflows in a partial state with:

  1. Ready/uninitialized jobs - Jobs that were waiting to run but never got scheduled
  2. Blocked jobs - Jobs whose dependencies haven't completed yet
  3. Orphaned running jobs - Jobs still marked as "running" in the database even though their Slurm allocation has terminated

Simply creating new Slurm schedulers and submitting allocations isn't enough because:

  1. Orphaned jobs block recovery: Jobs stuck in "running" status prevent the workflow from being considered complete, blocking recovery precondition checks
  2. Duplicate allocations: If the workflow had on_workflow_start actions to schedule nodes, those actions would fire again when the workflow is reinitialized, creating duplicate allocations
  3. Missing allocations for blocked jobs: Blocked jobs will eventually become ready, but there's no mechanism to schedule new allocations for them

Orphan Detection

Before recovery can proceed, orphaned jobs must be detected and their status corrected. This is handled by the orphan detection module (src/client/commands/orphan_detection.rs).

How It Works

The orphan detection logic checks for three types of orphaned resources:

  1. Active allocations with terminated Slurm jobs: ScheduledComputeNodes marked as "active" in the database, but whose Slurm job is no longer running (verified via squeue)

  2. Pending allocations that disappeared: ScheduledComputeNodes marked as "pending" whose Slurm job no longer exists (cancelled or failed before starting)

  3. Running jobs with no active compute nodes: Jobs marked as "running" but with no active compute nodes to process them (fallback for non-Slurm cases)

flowchart TD
    A[Start Orphan Detection] --> B[List active ScheduledComputeNodes]
    B --> C{For each Slurm allocation}
    C --> D[Check squeue for job status]
    D --> E{Job still running?}
    E -->|Yes| C
    E -->|No| F[Find jobs on this allocation]
    F --> G[Mark jobs as failed]
    G --> H[Update ScheduledComputeNode to complete]
    H --> C
    C --> I[List pending ScheduledComputeNodes]
    I --> J{For each pending allocation}
    J --> K[Check squeue for job status]
    K --> L{Job exists?}
    L -->|Yes| J
    L -->|No| M[Update ScheduledComputeNode to complete]
    M --> J
    J --> N[Check for running jobs with no active nodes]
    N --> O[Mark orphaned jobs as failed]
    O --> P[Done]

    style A fill:#4a9eff,color:#fff
    style B fill:#4a9eff,color:#fff
    style C fill:#6c757d,color:#fff
    style D fill:#4a9eff,color:#fff
    style E fill:#6c757d,color:#fff
    style F fill:#4a9eff,color:#fff
    style G fill:#dc3545,color:#fff
    style H fill:#4a9eff,color:#fff
    style I fill:#4a9eff,color:#fff
    style J fill:#6c757d,color:#fff
    style K fill:#4a9eff,color:#fff
    style L fill:#6c757d,color:#fff
    style M fill:#4a9eff,color:#fff
    style N fill:#4a9eff,color:#fff
    style O fill:#dc3545,color:#fff
    style P fill:#28a745,color:#fff

Integration Points

Orphan detection is integrated into two commands:

  1. torc recover: Runs orphan detection automatically as the first step before checking preconditions. This ensures that orphaned jobs don't block recovery.

  2. torc workflows sync-status: Standalone command to run orphan detection without triggering a full recovery. Useful for debugging or when you want to clean up orphaned jobs without submitting new allocations.

The torc watch Command

The torc watch command also performs orphan detection during its polling loop. When it detects that no valid Slurm allocations exist (via a quick squeue check), it runs the full orphan detection logic to clean up any orphaned jobs before checking if the workflow can make progress.

Recovery Actions

The recovery system uses ephemeral recovery actions to solve these problems.

How It Works

When torc slurm regenerate runs:

flowchart TD
    A[torc slurm regenerate] --> B[Fetch pending jobs]
    B --> C{Has pending jobs?}
    C -->|No| D[Exit - nothing to do]
    C -->|Yes| E[Build WorkflowGraph from pending jobs]
    E --> F[Mark existing schedule_nodes actions as executed]
    F --> G[Group jobs using scheduler_groups]
    G --> H[Create schedulers for each group]
    H --> I[Update jobs with scheduler assignments]
    I --> J[Create on_jobs_ready recovery actions for deferred groups]
    J --> K{Submit allocations?}
    K -->|Yes| L[Submit Slurm allocations]
    K -->|No| M[Done]
    L --> M

    style A fill:#4a9eff,color:#fff
    style B fill:#4a9eff,color:#fff
    style C fill:#6c757d,color:#fff
    style D fill:#6c757d,color:#fff
    style E fill:#4a9eff,color:#fff
    style F fill:#4a9eff,color:#fff
    style G fill:#4a9eff,color:#fff
    style H fill:#4a9eff,color:#fff
    style I fill:#4a9eff,color:#fff
    style J fill:#ffc107,color:#000
    style K fill:#6c757d,color:#fff
    style L fill:#ffc107,color:#000
    style M fill:#28a745,color:#fff

Step 1: Mark Existing Actions as Executed

All existing schedule_nodes actions are marked as executed using the claim_action API. This prevents them from firing again and creating duplicate allocations:

sequenceDiagram
    participant R as regenerate
    participant S as Server
    participant A as workflow_action table

    R->>S: get_workflow_actions(workflow_id)
    S-->>R: [action1, action2, ...]

    loop For each schedule_nodes action
        R->>S: claim_action(action_id)
        S->>A: UPDATE executed=1, executed_at=NOW()
    end

Step 2: Group Jobs Using WorkflowGraph

The system builds a WorkflowGraph from pending jobs and uses scheduler_groups() to group them by (resource_requirements, has_dependencies). This aligns with the behavior of torc workflows create-slurm:

  • Jobs without dependencies: Can be scheduled immediately with on_workflow_start
  • Jobs with dependencies (deferred): Need on_jobs_ready recovery actions to schedule when they become ready
flowchart TD
    subgraph pending["Pending Jobs"]
        A[Job A: no deps, rr=default]
        B[Job B: no deps, rr=default]
        C[Job C: depends on A, rr=default]
        D[Job D: no deps, rr=gpu]
    end

    subgraph groups["Scheduler Groups"]
        G1[Group 1: default, no deps<br/>Jobs: A, B]
        G2[Group 2: default, has deps<br/>Jobs: C]
        G3[Group 3: gpu, no deps<br/>Jobs: D]
    end

    A --> G1
    B --> G1
    C --> G2
    D --> G3

    style A fill:#4a9eff,color:#fff
    style B fill:#4a9eff,color:#fff
    style C fill:#ffc107,color:#000
    style D fill:#17a2b8,color:#fff
    style G1 fill:#28a745,color:#fff
    style G2 fill:#28a745,color:#fff
    style G3 fill:#28a745,color:#fff

Step 3: Create Recovery Actions for Deferred Groups

For groups with has_dependencies = true, the system creates on_jobs_ready recovery actions. These actions:

  • Have is_recovery = true to mark them as ephemeral
  • Use a _deferred suffix in the scheduler name
  • Trigger when the blocked jobs become ready
  • Schedule additional Slurm allocations for those jobs
flowchart LR
    subgraph workflow["Original Workflow"]
        A[Job A: blocked] --> C[Job C: blocked]
        B[Job B: blocked] --> C
    end

    subgraph actions["Recovery Actions"]
        RA["on_jobs_ready: schedule_nodes<br/>job_ids: (A, B)<br/>is_recovery: true"]
        RC["on_jobs_ready: schedule_nodes<br/>job_ids: (C)<br/>is_recovery: true"]
    end

    style A fill:#6c757d,color:#fff
    style B fill:#6c757d,color:#fff
    style C fill:#6c757d,color:#fff
    style RA fill:#ffc107,color:#000
    style RC fill:#ffc107,color:#000

Recovery Action Lifecycle

Recovery actions are ephemeral - they exist only during the recovery period:

stateDiagram-v2
    [*] --> Created: regenerate creates action
    Created --> Executed: Jobs become ready, action triggers
    Executed --> Deleted: Workflow reinitialized
    Created --> Deleted: Workflow reinitialized

    classDef created fill:#ffc107,color:#000
    classDef executed fill:#28a745,color:#fff
    classDef deleted fill:#6c757d,color:#fff

    class Created created
    class Executed executed
    class Deleted deleted

When a workflow is reinitialized (e.g., after resetting jobs), all recovery actions are deleted and original actions are reset to their initial state. This ensures a clean slate for the next run.

Database Schema

Recovery actions are tracked using the is_recovery column in the workflow_action table:

ColumnTypeDescription
is_recoveryINTEGER0 = normal action, 1 = recovery action

Behavior Differences

OperationNormal ActionsRecovery Actions
On reset_actions_for_reinitializeReset executed to 0Deleted entirely
Created byWorkflow spectorc slurm regenerate
PurposeConfigured behaviorTemporary recovery

Usage

# Regenerate schedulers for pending jobs
torc slurm regenerate <workflow_id> --account <account>

# With automatic submission
torc slurm regenerate <workflow_id> --account <account> --submit

# Using a specific HPC profile
torc slurm regenerate <workflow_id> --account <account> --profile kestrel

Implementation Details

The recovery logic is implemented in:

  • src/client/commands/orphan_detection.rs: Shared orphan detection logic used by recover, watch, and workflows sync-status
  • src/client/commands/recover.rs: Main recovery command implementation
  • src/client/commands/slurm.rs: handle_regenerate function for Slurm scheduler regeneration
  • src/client/workflow_graph.rs: WorkflowGraph::from_jobs() and scheduler_groups() methods
  • src/server/api/workflow_actions.rs: reset_actions_for_reinitialize function
  • migrations/20251225000000_add_is_recovery_to_workflow_action.up.sql: Schema migration

Key implementation notes:

  1. WorkflowGraph construction: A WorkflowGraph is built from pending jobs using from_jobs(), which reconstructs the dependency structure from depends_on_job_ids
  2. Scheduler grouping: Jobs are grouped using scheduler_groups() by (resource_requirements, has_dependencies), matching create-slurm behavior
  3. Deferred schedulers: Groups with dependencies get a _deferred suffix in the scheduler name
  4. Allocation calculation: Number of allocations is based on job count and resources per node
  5. Recovery actions: Only deferred groups (jobs with dependencies) get on_jobs_ready recovery actions