Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Workflow Definition

A workflow is a collection of jobs with dependencies. You define workflows in YAML, JSON5, or JSON files.

Minimal Example

name: hello_world
jobs:
  - name: greet
    command: echo "Hello, World!"

That's it. One job, no dependencies.

Jobs with Dependencies

name: two_stage
jobs:
  - name: prepare
    command: ./prepare.sh

  - name: process
    command: ./process.sh
    depends_on: [prepare]

The process job waits for prepare to complete.

Job Parameterization

Create multiple jobs from a single definition using parameters:

name: parameter_sweep
jobs:
  - name: task_{i}
    command: ./run.sh --index {i}
    parameters:
      i: "1:10"

This expands to 10 jobs: task_1, task_2, ..., task_10.

Parameter Formats

FormatExampleExpands To
Range"1:5"1, 2, 3, 4, 5
Range with step"0:10:2"0, 2, 4, 6, 8, 10
List"[a,b,c]"a, b, c
Float range"0.0:1.0:0.25"0.0, 0.25, 0.5, 0.75, 1.0

Format Specifiers

Control how values appear in names:

- name: job_{i:03d}      # job_001, job_002, ...
  parameters:
    i: "1:100"

- name: lr_{lr:.4f}      # lr_0.0010, lr_0.0100, ...
  parameters:
    lr: "[0.001,0.01,0.1]"

Resource Requirements

Specify what resources each job needs:

name: gpu_workflow

resource_requirements:
  - name: gpu_job
    num_cpus: 8
    num_gpus: 1
    memory: 16g
    runtime: PT2H

jobs:
  - name: train
    command: python train.py
    resource_requirements: gpu_job

Resource requirements are used for:

  • Local execution: ensuring jobs don't exceed available resources
  • HPC/Slurm: requesting appropriate allocations

Complete Example

name: data_pipeline
description: Process data in parallel, then aggregate

resource_requirements:
  - name: worker
    num_cpus: 4
    memory: 8g
    runtime: PT1H

jobs:
  - name: process_{i}
    command: python process.py --chunk {i} --output results/chunk_{i}.json
    resource_requirements: worker
    parameters:
      i: "1:10"

  - name: aggregate
    command: python aggregate.py --input results/ --output final.json
    resource_requirements: worker
    depends_on:
      - process_{i}
    parameters:
      i: "1:10"

This creates:

  • 10 parallel process_* jobs
  • 1 aggregate job that waits for all 10 to complete

Failure Recovery Options

Control how Torc handles job failures:

Default Behavior

By default, jobs that fail without a matching failure handler use Failed status:

name: my_workflow
jobs:
  - name: task
    command: ./run.sh  # If this fails, status = Failed

AI-Assisted Recovery (Opt-in)

Enable intelligent classification of ambiguous failures:

name: ml_training
use_pending_failed: true  # Enable AI-assisted recovery

jobs:
  - name: train_model
    command: python train.py

With use_pending_failed: true:

  • Jobs without matching failure handlers get PendingFailed status
  • AI agent can analyze stderr and decide whether to retry or fail
  • See AI-Assisted Recovery for details

See Also