Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Rerun Failed Jobs

When jobs in a workflow fail, you have several options for retrying them depending on your execution environment and how much automation you want.

Looking to rerun jobs after editing an input file? That's a different operation — see Intelligent Restart.

Slurm Workflows: torc recover

For Slurm workflows, torc recover is the comprehensive option. It diagnoses each failure (OOM, timeout, unknown), adjusts resource requirements, resets the failed jobs, reinitializes the workflow, and resubmits Slurm allocations:

# Preview what recovery would do
torc recover <workflow_id> --dry-run

# Interactive recovery wizard (default)
torc recover <workflow_id>

# Non-interactive recovery (for scripts/CI)
torc recover <workflow_id> --no-prompts

For continuous monitoring with auto-recovery, use torc watch --recover instead — it polls until the workflow completes and re-runs recovery on each round of failures.

See Automatic Failure Recovery for the full guide.

Local Workflows: torc workflows reset-status

For local (non-Slurm) workflows, or when you just want to retry without resource adjustment:

# Reset only failed jobs to ready and rerun
torc workflows reset-status <workflow_id> --failed-only --reinitialize

# Or reset failed jobs without reinitializing (e.g. transient infrastructure issue)
torc workflows reset-status <workflow_id> --failed-only

Then resume execution with torc run <workflow_id> (local) or torc submit <workflow_id> (Slurm).

Choosing the Right Tool

ScenarioUse
Slurm workflow with OOM/timeout failurestorc recover
Slurm workflow, want continuous self-healingtorc watch --recover
Local workflow with failurestorc workflows reset-status --failed-only
Want to retry without changing resource allocationstorc workflows reset-status --failed-only
Workflow ran fine but inputs changedIntelligent Restart
Need AI-driven classification of unfamiliar failure modesAI-Assisted Recovery

See Also