Failure Handler Design
This document describes the internal architecture and implementation of failure handlers in Torc. For a user-focused tutorial, see Configurable Failure Handlers.
Overview
Failure handlers provide per-job automatic retry logic based on exit codes. They allow workflows to recover from transient failures without manual intervention or workflow-level recovery heuristics.
flowchart LR
subgraph workflow["Workflow Specification"]
FH["failure_handlers:<br/>- name: handler1<br/> rules: [...]"]
JOB["jobs:<br/>- name: my_job<br/> failure_handler: handler1"]
end
subgraph server["Server"]
DB[(Database)]
API["REST API"]
end
subgraph client["Job Runner"]
RUNNER["JobRunner"]
RECOVERY["Recovery Logic"]
end
FH --> DB
JOB --> DB
RUNNER --> API
API --> DB
RUNNER --> RECOVERY
style FH fill:#4a9eff,color:#fff
style JOB fill:#4a9eff,color:#fff
style DB fill:#ffc107,color:#000
style API fill:#28a745,color:#fff
style RUNNER fill:#17a2b8,color:#fff
style RECOVERY fill:#dc3545,color:#fff
Problem Statement
When jobs fail, workflows traditionally have two recovery options:
- Manual intervention: User investigates and restarts failed jobs
- Workflow-level recovery:
torc watch --recoverapplies heuristics based on detected failure patterns (OOM, timeout, etc.)
Neither approach handles application-specific failures where:
- The job itself knows why it failed (via exit code)
- A specific recovery action can fix the issue
- Immediate retry is appropriate
Failure handlers solve this by allowing jobs to define exit-code-specific retry behavior with optional recovery scripts.
Architecture
Component Interaction
sequenceDiagram
participant WS as Workflow Spec
participant API as Server API
participant DB as Database
participant JR as JobRunner
participant RS as Recovery Script
participant JOB as Job Process
Note over WS,DB: Workflow Creation
WS->>API: Create workflow with failure_handlers
API->>DB: INSERT failure_handler
API->>DB: INSERT job with failure_handler_id
Note over JR,JOB: Job Execution
JR->>API: Claim job
JR->>JOB: Execute command
JOB-->>JR: Exit code (e.g., 10)
Note over JR,API: Failure Recovery
JR->>API: GET failure_handler
API->>DB: SELECT rules
DB-->>API: Rules JSON
API-->>JR: FailureHandlerModel
JR->>JR: Match exit code to rule
JR->>API: POST retry_job (reserves retry)
alt Recovery Script Defined
JR->>RS: Execute with env vars
RS-->>JR: Exit code
end
JR->>JR: Job returns to Ready queue
Data Model
erDiagram
WORKFLOW ||--o{ JOB : contains
WORKFLOW ||--o{ FAILURE_HANDLER : contains
FAILURE_HANDLER ||--o{ JOB : "referenced by"
WORKFLOW {
int id PK
string name
int status_id FK
}
FAILURE_HANDLER {
int id PK
int workflow_id FK
string name
string rules "JSON array"
}
JOB {
int id PK
int workflow_id FK
string name
string command
int status
int failure_handler_id FK "nullable"
int attempt_id "starts at 1"
}
Rule Matching
Failure handler rules are stored as a JSON array. When a job fails, rules are evaluated in a specific order to find a match.
Rule Structure
#![allow(unused)] fn main() { pub struct FailureHandlerRule { pub exit_codes: Vec<i32>, // Specific codes to match pub match_all_exit_codes: bool, // Catch-all flag pub recovery_script: Option<String>, pub max_retries: i32, // Default: 3 } }
Matching Priority
Rules are evaluated with specific matches taking priority over catch-all rules:
flowchart TD
START["Job fails with exit code X"]
SPECIFIC{"Find rule where<br/>exit_codes contains X?"}
CATCHALL{"Find rule where<br/>match_all_exit_codes = true?"}
FOUND["Rule matched"]
NONE["No matching rule<br/>Job marked Failed"]
START --> SPECIFIC
SPECIFIC -->|Found| FOUND
SPECIFIC -->|Not found| CATCHALL
CATCHALL -->|Found| FOUND
CATCHALL -->|Not found| NONE
style START fill:#dc3545,color:#fff
style SPECIFIC fill:#4a9eff,color:#fff
style CATCHALL fill:#ffc107,color:#000
style FOUND fill:#28a745,color:#fff
style NONE fill:#6c757d,color:#fff
This ensures that specific exit code handlers always take precedence, regardless of rule order in the JSON array.
Implementation (job_runner.rs):
#![allow(unused)] fn main() { let matching_rule = rules .iter() .find(|rule| rule.exit_codes.contains(&(exit_code as i32))) .or_else(|| rules.iter().find(|rule| rule.match_all_exit_codes)); }
Recovery Flow
The recovery process is designed to be atomic and safe:
flowchart TD
subgraph JobRunner["JobRunner (Client)"]
FAIL["Job fails"]
FETCH["Fetch failure handler"]
MATCH["Match rule to exit code"]
CHECK{"attempt_id<br/>< max_retries?"}
RESERVE["POST /jobs/{id}/retry/{run_id}<br/>Reserves retry slot"]
SCRIPT{"Recovery<br/>script defined?"}
RUN["Execute recovery script"]
DONE["Job queued for retry"]
FAILED["Mark job as Failed"]
end
subgraph Server["Server (API)"]
VALIDATE["Validate run_id matches"]
STATUS["Check job status"]
MAX["Validate max_retries"]
UPDATE["UPDATE job<br/>status=Ready<br/>attempt_id += 1"]
EVENT["INSERT event record"]
COMMIT["COMMIT transaction"]
end
FAIL --> FETCH
FETCH --> MATCH
MATCH --> CHECK
CHECK -->|Yes| RESERVE
CHECK -->|No| FAILED
RESERVE --> VALIDATE
VALIDATE --> STATUS
STATUS --> MAX
MAX --> UPDATE
UPDATE --> EVENT
EVENT --> COMMIT
COMMIT --> SCRIPT
SCRIPT -->|Yes| RUN
SCRIPT -->|No| DONE
RUN -->|Success or Failure| DONE
style FAIL fill:#dc3545,color:#fff
style RESERVE fill:#4a9eff,color:#fff
style RUN fill:#ffc107,color:#000
style DONE fill:#28a745,color:#fff
style FAILED fill:#6c757d,color:#fff
style UPDATE fill:#17a2b8,color:#fff
style COMMIT fill:#17a2b8,color:#fff
Key Design Decisions
-
Retry reservation before recovery script: The
retry_jobAPI is called before the recovery script runs. This ensures:- The retry slot is reserved atomically
- Recovery scripts don't run for retries that won't happen
- External resources modified by recovery scripts are consistent
-
Recovery script failure is non-fatal: If the recovery script fails, the job is still retried. This prevents recovery script bugs from blocking legitimate retries.
-
Transaction isolation: The
retry_jobAPI usesBEGIN IMMEDIATEto prevent race conditions where multiple processes might try to retry the same job.
API Endpoints
GET /failure_handlers/
Fetches a failure handler by ID.
Response:
{
"id": 1,
"workflow_id": 42,
"name": "simulation_recovery",
"rules": "[{\"exit_codes\":[10,11],\"max_retries\":3}]"
}
POST /jobs/{id}/retry/{run_id}?max_retries=N
Retries a failed job by resetting its status to Ready.
Query Parameters:
max_retries(required): Maximum retries allowed by the matching rule
Validations:
- Job must exist
run_idmust match workflow's current run- Job status must be Running, Failed, or Terminated
attempt_idmust be less thanmax_retries
Transaction Safety:
BEGIN IMMEDIATE; -- Acquire write lock
SELECT j.*, ws.run_id as workflow_run_id
FROM job j
JOIN workflow w ON j.workflow_id = w.id
JOIN workflow_status ws ON w.status_id = ws.id
WHERE j.id = ?;
-- Validate conditions...
UPDATE job SET status = 2, attempt_id = ? WHERE id = ?;
INSERT INTO event (workflow_id, timestamp, data) VALUES (?, ?, ?);
COMMIT;
Response:
{
"id": 123,
"workflow_id": 42,
"name": "my_job",
"status": "ready",
"attempt_id": 2
}
Recovery Script Execution
Recovery scripts run in a subprocess with environment variables providing context:
flowchart LR
subgraph env["Environment Variables"]
WID["TORC_WORKFLOW_ID"]
JID["TORC_JOB_ID"]
JN["TORC_JOB_NAME"]
URL["TORC_API_URL"]
OUT["TORC_OUTPUT_DIR"]
AID["TORC_ATTEMPT_ID"]
RC["TORC_RETURN_CODE"]
end
subgraph script["Recovery Script"]
SHELL["bash -c<br/>(or cmd /C on Windows)"]
CODE["User script code"]
end
env --> SHELL
SHELL --> CODE
style WID fill:#4a9eff,color:#fff
style JID fill:#4a9eff,color:#fff
style JN fill:#4a9eff,color:#fff
style URL fill:#4a9eff,color:#fff
style OUT fill:#4a9eff,color:#fff
style AID fill:#ffc107,color:#000
style RC fill:#dc3545,color:#fff
style SHELL fill:#6c757d,color:#fff
style CODE fill:#28a745,color:#fff
Log File Naming
Each job attempt produces separate log files to preserve history:
output/job_stdio/
├── job_wf{W}_j{J}_r{R}_a1.o # Attempt 1 stdout
├── job_wf{W}_j{J}_r{R}_a1.e # Attempt 1 stderr
├── job_wf{W}_j{J}_r{R}_a2.o # Attempt 2 stdout
├── job_wf{W}_j{J}_r{R}_a2.e # Attempt 2 stderr
└── ...
Where:
W= workflow_idJ= job_idR= run_idaN= attempt number
Database Schema
failure_handler Table
CREATE TABLE failure_handler (
id INTEGER PRIMARY KEY AUTOINCREMENT,
workflow_id INTEGER NOT NULL REFERENCES workflow(id) ON DELETE CASCADE,
name TEXT NOT NULL,
rules TEXT NOT NULL, -- JSON array of FailureHandlerRule
UNIQUE(workflow_id, name)
);
job Table (relevant columns)
ALTER TABLE job ADD COLUMN failure_handler_id INTEGER
REFERENCES failure_handler(id) ON DELETE SET NULL;
ALTER TABLE job ADD COLUMN attempt_id INTEGER NOT NULL DEFAULT 1;
Slurm Integration
When a job is retried, it returns to the Ready queue and will be picked up by any available compute node. For Slurm workflows, this may require additional allocations if existing nodes have terminated.
flowchart TD
RETRY["Job retried<br/>(status = Ready)"]
CHECK{"Compute nodes<br/>available?"}
RUN["Job runs on<br/>existing allocation"]
SCHEDULE["Auto-schedule triggers<br/>new Slurm allocation"]
WAIT["Job waits for<br/>allocation to start"]
EXEC["Job executes"]
RETRY --> CHECK
CHECK -->|Yes| RUN
CHECK -->|No| SCHEDULE
SCHEDULE --> WAIT
WAIT --> EXEC
RUN --> EXEC
style RETRY fill:#28a745,color:#fff
style CHECK fill:#6c757d,color:#fff
style RUN fill:#4a9eff,color:#fff
style SCHEDULE fill:#ffc107,color:#000
style WAIT fill:#17a2b8,color:#fff
style EXEC fill:#28a745,color:#fff
If auto_schedule_on_ready_jobs actions are configured, new Slurm allocations will be created
automatically when retried jobs become ready. See Workflow Actions for
details.
Implementation Files
| File | Purpose |
|---|---|
src/client/job_runner.rs | try_recover_job(), rule matching |
src/client/utils.rs | shell_command() cross-platform shell |
src/server/api/jobs.rs | retry_job() API endpoint |
src/server/api/failure_handlers.rs | CRUD operations for failure handlers |
src/client/workflow_spec.rs | Parsing failure handlers from specs |
migrations/20260110*.sql | Database schema for failure handlers |
Comparison with Workflow Recovery
| Aspect | Failure Handlers | Workflow Recovery (torc watch) |
|---|---|---|
| Scope | Per-job | Workflow-wide |
| Trigger | Specific exit codes | OOM detection, timeout patterns |
| Timing | Immediate (during job run) | After job completion |
| Recovery Action | Custom scripts | Resource adjustment, resubmission |
| Configuration | In workflow spec | Command-line flags |
| State | Preserved (same workflow run) | May start new run |
| Slurm | Reuses or auto-schedules nodes | Creates new schedulers |
Recommendation: Use both mechanisms together:
- Failure handlers for immediate, exit-code-specific recovery
torc watch --recoverfor workflow-level resource adjustments and allocation recovery
Recovery Outcome and pending_failed Status
When try_recover_job is called, it returns a RecoveryOutcome enum that determines the final job
status:
#![allow(unused)] fn main() { pub enum RecoveryOutcome { /// Job was successfully scheduled for retry Retried, /// No failure handler defined - use PendingFailed status NoHandler, /// Failure handler exists but no rule matched - use PendingFailed status NoMatchingRule, /// Max retries exceeded - use Failed status MaxRetriesExceeded, /// API call or other error - use Failed status Error(String), } }
Status Assignment Flow
flowchart TD
FAIL["Job fails"]
TRY["try_recover_job()"]
RETRIED{"Outcome?"}
READY["Status: ready<br/>attempt_id += 1"]
PENDING["Status: pending_failed"]
FAILED["Status: failed"]
FAIL --> TRY
TRY --> RETRIED
RETRIED -->|Retried| READY
RETRIED -->|NoHandler / NoMatchingRule| PENDING
RETRIED -->|MaxRetriesExceeded / Error| FAILED
style FAIL fill:#dc3545,color:#fff
style READY fill:#28a745,color:#fff
style PENDING fill:#ffc107,color:#000
style FAILED fill:#6c757d,color:#fff
pending_failed Status (value 10)
The pending_failed status is a new job state that indicates:
- The job failed with a non-zero exit code
- No failure handler rule matched the exit code
- The job is awaiting classification (retry or fail)
Key properties:
- Not terminal: Workflow is not considered complete while jobs are
pending_failed - Downstream blocked: Dependent jobs remain in
blockedstatus (not canceled) - Resettable:
reset-status --failed-onlyincludespending_failedjobs
Integration with AI-Assisted Recovery
Jobs in pending_failed status can be classified by an AI agent using MCP tools:
sequenceDiagram
participant JR as JobRunner
participant API as Torc API
participant MCP as torc-mcp-server
participant AI as AI Agent
JR->>API: complete_job(status=pending_failed)
Note over JR,API: Job awaiting classification
AI->>MCP: list_pending_failed_jobs(workflow_id)
MCP->>API: GET /jobs?status=pending_failed
API-->>MCP: Jobs with stderr content
MCP-->>AI: Pending jobs + stderr
AI->>AI: Analyze error patterns
AI->>MCP: classify_and_resolve_failures(classifications)
alt action = retry
MCP->>API: PUT /jobs/{id} status=ready
Note over API: Triggers re-execution
else action = fail
MCP->>API: PUT /jobs/{id} status=failed
Note over API: Triggers downstream cancellation
end
See AI-Assisted Recovery Design for full details.