Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Rerun Failed Jobs

When jobs in a workflow fail, you have several options for retrying them depending on your execution environment and how much automation you want.

Looking to rerun jobs after editing an input file? That's a different operation — see Intelligent Restart.

Slurm Workflows: torc recover

For Slurm workflows, torc recover is the comprehensive option. It diagnoses each failure (OOM, timeout, unknown), adjusts resource requirements, resets the failed jobs, reinitializes the workflow, and resubmits Slurm allocations:

# Preview what recovery would do
torc recover <workflow_id> --dry-run

# Interactive recovery wizard (default)
torc recover <workflow_id>

# Non-interactive recovery (for scripts/CI)
torc recover <workflow_id> --no-prompts

For continuous monitoring with auto-recovery, use torc watch --recover instead — it polls until the workflow completes and re-runs recovery on each round of failures.

See Automatic Failure Recovery for the full guide.

Local Workflows: torc workflows reset-status

For local (non-Slurm) workflows, or when you just want to retry without resource adjustment:

# Reset only failed jobs to ready and rerun
torc workflows reset-status <workflow_id> --failed-only --reinitialize

# Or reset failed jobs without reinitializing (e.g. transient infrastructure issue)
torc workflows reset-status <workflow_id> --failed-only

Then resume execution with torc run <workflow_id> (local) or torc submit <workflow_id> (Slurm).

Selective Job Reset: torc jobs reset-status

When you need to rerun only specific jobs (not all failed ones), use torc jobs reset-status. This is useful when you know exactly which jobs need to be rerun without resetting the whole workflow.

Jobs are selected one of three mutually exclusive ways:

  • By job ID — list explicit IDs (they must all belong to the same workflow).
  • By status--status resets every job currently in one of the given statuses. The value is repeatable / comma-separated (e.g. --status terminated,canceled,failed).
  • By return code--return-code resets every job whose latest result exited with the given code.

The --status and --return-code modes operate on a whole workflow, selected with --workflow-id (you are prompted to choose one if it is omitted). If a filter matches no jobs, the command exits non-zero with an error — this catches a mistaken assumption (e.g. expecting failed jobs when there are none) in scripts and CI.

Unlike torc workflows reset-status, this command:

  • Resets only the selected jobs. Downstream dependents are not reset by this command — it lists them for you, and they are reset transitively when you run torc workflows reinit (a rerun job produces new outputs that consumers must consume again).
  • Does not bump the workflow run_id or reset workflow state — you follow up with torc workflows reinit once, which does the run_id bump exactly once.
# Preview what would be reset (no changes applied)
torc jobs reset-status 101 102 --dry-run

# Reset every terminated, canceled, or failed job in a workflow
torc jobs reset-status --status terminated,canceled,failed --workflow-id <workflow_id>

# Reset all jobs whose latest result exited with return code 42
torc jobs reset-status --return-code 42 --workflow-id <workflow_id>

# Reset and reinitialize in one step, then run
torc jobs reset-status 101 102 --reinit
torc run <workflow_id>      # local execution
# or: torc submit <workflow_id>   # Slurm

# Reset and rerun (manual two-step flow)
torc jobs reset-status 101 102 --no-prompts
torc workflows reinit <workflow_id>
torc run <workflow_id>

# Override safety checks (e.g. workers still active)
torc jobs reset-status 101 --force

# JSON output for scripting
torc -f json jobs reset-status 101 102 --no-prompts

The workflow does not need to be complete — the command can be run repeatedly (e.g. to reset additional jobs after an earlier reset) as long as no workers are active. If a selected job completed successfully, the command warns you before resetting it, since resetting discards its results and reruns it.

The --force flag bypasses two checks: (1) the no-active-workers check (compute nodes and Slurm allocations), and (2) the active-status guard (jobs in Running or Pending are normally rejected).

Choosing the Right Tool

ScenarioUse
Slurm workflow with OOM/timeout failurestorc recover
Slurm workflow, want continuous self-healingtorc watch --recover
Local workflow with failurestorc workflows reset-status --failed-only
Want to retry without changing resource allocationstorc workflows reset-status --failed-only
Rerun only specific known jobstorc jobs reset-status <id>...
Rerun every job in a given status (or return code)torc jobs reset-status --status <s> / --return-code <n>
Workflow ran fine but inputs changedIntelligent Restart
Need AI-driven classification of unfamiliar failure modesAI-Assisted Recovery

See Also