Introduction

Torc is a distributed workflow orchestration system for managing computational pipelines ranging from simple workflows needing to parallelize independent jobs to complex workflows with job dependencies, mixed resource requirements, and multiple stages.

Key Features

Declarative Workflow Definitions — Define workflows in YAML, JSON, JSON5, or KDL
Automatic Dependency Resolution — Dependencies inferred from file and data relationships
Distributed Execution — Run jobs across local machines, HPC clusters, and networked compute nodes
Resource Management — Track CPU and memory usage across all jobs
Automatic Failure Recovery — Detect OOM/timeout failures and retry with adjusted resources
Fault Tolerance — Resume workflows after failures without losing progress
AI-assisted configuration and management — Use AI tools like Claude Code and Copilot to configure workflows and diagnose problems.

Job Parameterization

Create parameter sweeps with simple syntax:

jobs:
  - name: job_{index}
    command: bash work.sh {index}
    parameters:
      index: "1:100"

This expands to 100 jobs.

Implicit Dependencies

Dependencies between jobs are automatically inferred from file relationships. Consider this diamond workflow where one job fans out to parallel jobs, which then converge:

name: diamond_workflow
jobs:
  - name: preprocess
    command: "preprocess.sh -i ${files.input.f1} -o ${files.output.f2} -o ${files.output.f3}"

  - name: work1
    command: "work.sh -i ${files.input.f2} -o ${files.output.f4}"

  - name: work2
    command: "work.sh -i ${files.input.f3} -o ${files.output.f5}"

  - name: postprocess
    command: "postprocess.sh -i ${files.input.f4} -i ${files.input.f5} -o ${files.output.f6}"

files:
  - name: f1
    path: input.json
  - name: f2
    path: intermediate_a.json
  - name: f3
    path: intermediate_b.json
  - name: f4
    path: result_a.json
  - name: f5
    path: result_b.json
  - name: f6
    path: final_output.json

Torc analyzes which jobs produce and consume each file, automatically building the dependency graph:

flowchart TD
    f1([input.json])
    preprocess[preprocess]
    f2([intermediate_a.json])
    f3([intermediate_b.json])
    work1[work1]
    work2[work2]
    f4([result_a.json])
    f5([result_b.json])
    postprocess[postprocess]
    f6([final_output.json])

    f1 --> preprocess
    preprocess --> f2 & f3
    f2 --> work1
    f3 --> work2
    work1 --> f4
    work2 --> f5
    f4 & f5 --> postprocess
    postprocess --> f6

    style f1 fill:#d4edda,stroke:#28a745,color:#155724
    style f2 fill:#d4edda,stroke:#28a745,color:#155724
    style f3 fill:#d4edda,stroke:#28a745,color:#155724
    style f4 fill:#d4edda,stroke:#28a745,color:#155724
    style f5 fill:#d4edda,stroke:#28a745,color:#155724
    style f6 fill:#d4edda,stroke:#28a745,color:#155724
    style preprocess fill:#4a9eff,color:#fff
    style work1 fill:#4a9eff,color:#fff
    style work2 fill:#4a9eff,color:#fff
    style postprocess fill:#4a9eff,color:#fff

No explicit depends_on declarations needed — Torc infers that work1 and work2 depend on preprocess, and postprocess waits for both to complete.

Who Should Use Torc?

Torc is designed for:

HPC Users who need to parallelize jobs across cluster resources
Computational Scientists running parameter sweeps and simulations
Data Engineers building complex data processing pipelines
ML/AI Researchers managing training workflows and hyperparameter searches
Anyone who needs reliable, resumable workflow orchestration

Documentation Structure

This documentation is organized to help you find what you need quickly:

Core Documentation (for all users):

Getting Started — Installation and local quick start
Core Concepts — Architecture, job states, and dependencies
Working with Workflows — Creating and managing workflows
Tutorials — Step-by-step workflow patterns
Monitoring & Debugging — TUI, reports, and troubleshooting
Reference — CLI commands, formats, and configuration

Specialized Topics (below the separator in the sidebar):

HPC & Slurm — Running on compute clusters
Fault Tolerance & Recovery — Failure handlers, checkpointing, automatic recovery
Administration & Security — Server deployment, authentication, access control
Design & Architecture — Internal design for contributors

Next Steps

New to Torc? Start with Quick Start (Local)
Want to understand how it works? Read the Architecture Overview
Ready to create workflows? Jump to Creating Workflows
On an HPC cluster? See Quick Start (HPC)

Getting Started

Torc is a distributed workflow orchestration system for managing complex computational pipelines with job dependencies, resource requirements, and distributed execution.

Torc uses a client-server architecture where a central server manages workflow state and coordination, while clients create workflows and job runners execute tasks on compute resources.

How You Interact with Torc

Choose the interface that fits your workflow:

CLI — Primary interface which provides access to all torc functionality
Dashboard — Web UI (torc-dash) for visual configuration and monitoring
TUI — Terminal User Interface (torc tui) for visual monitoring in a terminal
AI Assistants — Use Claude Code or GitHub Copilot to manage workflows through natural language. "Create a workflow with 10 parallel jobs" or "Why did job 5 fail?"
Spec Files — YAML, KDL, or JSON5 files for version-controlled workflow definitions
Python/Julia APIs — Libraries for programmatic workflow generation

All interfaces work with the same server — mix and match as needed.

Architecture

flowchart LR
    subgraph you["You"]
        CLI["torc CLI"]
    end

    subgraph server["Server"]
        API["torc-server"]
    end

    subgraph workers["Workers"]
        W1["Job Runner"]
        W2["Job Runner"]
    end

    CLI -->|"create workflow"| API
    CLI -->|"start workers"| W1
    CLI -->|"start workers"| W2
    W1 -->|"claim & report"| API
    W2 -->|"claim & report"| API

    style CLI fill:#4a9eff,color:#fff
    style API fill:#28a745,color:#fff
    style W1 fill:#ffc107,color:#000
    style W2 fill:#ffc107,color:#000

You create workflows and start job runners via the CLI
Server tracks workflow state and job dependencies
Workers poll the server, claim ready jobs, execute them, and report results

Example Files

The repository includes ready-to-run workflow specifications in YAML, JSON5, and KDL formats:

Example	Description	Tutorial
diamond_workflow.yaml	Fan-out/fan-in pattern	Diamond Workflow
hundred_jobs_parameterized.yaml	100 parallel jobs via parameterization	Many Jobs
hyperparameter_sweep.yaml	ML grid search (3×3×2 = 18 jobs)	Advanced Params
multi_stage_barrier_pattern.yaml	Efficient multi-stage workflow	Barriers
resource_monitoring_demo.yaml	CPU/memory tracking	—
workflow_actions_simple_slurm.yaml	Automated Slurm scheduling	—

Browse all examples:

See the examples README for the complete list.

Choose Your Execution Mode

Torc supports three fundamentally different execution environments. Choose the one that matches your use case:

Local Execution

Best for: Development, testing, small-scale workflows on your workstation or a single server

Jobs run directly on the machine where you start the job runner
No scheduler needed — simple setup with torc run
Resource management via local CPU/memory/GPU tracking
→ Quick Start (Local)

HPC/Slurm

Best for: Large-scale computations on institutional HPC clusters

Jobs submitted to Slurm scheduler for compute node allocation
Automatic resource matching to partitions/QOS
Built-in profiles for common HPC systems
→ Quick Start (HPC/Slurm)

Remote Workers

Best for: Distributed execution across multiple machines you control via SSH

Jobs distributed to remote workers over SSH
No HPC scheduler required — you manage the machines
Flexible heterogeneous resources (mix of CPU/GPU machines)
→ Quick Start (Remote Workers)

All three modes:

Share the same workflow specification format
Use the same server API for coordination
Support the same monitoring tools (CLI, TUI, Dashboard)
Can be used together (e.g., develop locally, deploy to HPC)

Continue to the Quick Start guide to run your first workflow.

Installation

Precompiled Binaries (Recommended)

Download the appropriate archive for your platform from the releases page:
- Linux: torc-<version>-x86_64-unknown-linux-gnu.tar.gz
- macOS (Intel): torc-<version>-x86_64-apple-darwin.tar.gz
- macOS (Apple Silicon): torc-<version>-aarch64-apple-darwin.tar.gz

Extract the archive:

# For .tar.gz files
tar -xzf torc-<version>-<platform>.tar.gz

# For .zip files
unzip torc-<version>-<platform>.zip

Add the binaries to a directory in your system PATH:

# Option 1: Copy to an existing PATH directory
cp torc* ~/.local/bin/

# Option 2: Add the extracted directory to your PATH
export PATH="/path/to/extracted/torc:$PATH"

To make the PATH change permanent, add the export line to your shell configuration file (~/.bashrc, ~/.zshrc, etc.).

macOS users: The precompiled binaries are not signed with an Apple Developer certificate. macOS Gatekeeper will block them by default. To allow the binaries to run, remove the quarantine attribute after downloading:

xattr -cr /path/to/torc*

Alternatively, you can right-click each binary and select "Open" to add a security exception.

Site-Specific Installations

Some HPC facilities maintain pre-installed Torc binaries and shared servers. Check if your site is listed below.

NREL Kestrel

Pre-installed binaries are available at:

/scratch/dthom/torc/
├── 0.8.0/
├── ...
└── latest -> 0.8.0  (symlink to current version)

Recommended: Use the latest directory. Torc maintains backwards compatibility, so you'll automatically receive updates and bug fixes without changing your configuration.

Add to your PATH:

export PATH="/scratch/dthom/torc/latest:$PATH"

Or add to your ~/.bashrc for persistence:

echo 'export PATH="/scratch/dthom/torc/latest:$PATH"' >> ~/.bashrc

Shared server: A torc-server instance runs on a dedicated VM within the Kestrel environment. Contact Daniel Thom for access credentials and the server URL. Once you have access:

export TORC_API_URL="http://<server-address>/torc-service/v1"
export TORC_PASSWORD="<your-password>"

Building from Source

Prerequisites

Rust 1.70 or later
SQLite 3.35 or later (usually included with Rust via sqlx)

Clone the Repository

git clone https://github.com/NREL/torc.git
cd torc

Building All Components

Note that the file .env designates the database URL as ./db/sqlite/dev.db Change as desired or set the environment variable DATABASE_URL.

Initialize the database

# Install sqlx-cli if needed
cargo install sqlx-cli --no-default-features --features sqlite
sqlx database setup

Build everything (server, client, dashboard, job runners):

# Development build
cargo build --workspace

# Release build (optimized, recommended)
cargo build --workspace --release

Build individual components:

# Server
cargo build --release -p torc-server

# Client CLI
cargo build --release -p torc

# Web Dashboard
cargo build --release -p torc-dash

# Slurm job runner
cargo build --release -p torc-slurm-job-runner

Binaries will be in target/release/.

Required: Add this directory to your system path or copy the binaries to a directory already in your path (e.g., ~/.local/bin/).

Python Client

The Python client provides programmatic workflow management for Python users.

Prerequisites

Python 3.11 or later

Installation

pip install torc-client

The pytorc command will be available after installation.

Julia Client

The Julia client provides programmatic workflow management for Julia users.

Prerequisites

Julia 1.10 or later

Installation

Since the package is not yet registered in the Julia General registry, install it directly from GitHub:

using Pkg
Pkg.add(url="https://github.com/NREL/torc.git", subdir="julia_client/Torc")

Then use it in your code:

using Torc

For Developers

Running Tests

Run all tests

cargo test -- --test-threads=1

# Run specific test
cargo test --test test_workflow_manager test_initialize_files_with_updated_files

# Run with debug logging
RUST_LOG=debug cargo test -- --nocapture

Setting Up the Server

Start the server:

# Development mode
cargo run -p torc-server -- run

# Production mode (release build)
./target/release/torc-server run

# Custom port
./target/release/torc-server run --port 8080

Server will start on http://localhost:8080.

When running small workflows for testing and demonstration purposes, we recommend setting this option so that the server detects job completions faster than the default value of 30 seconds.

./target/release/torc-server run --completion-check-interval-secs 5

Quick Start (Local)

This guide walks you through creating and running your first Torc workflow with local execution. Jobs run directly on the current machine, making this ideal for testing, development, or non-HPC environments.

For running workflows on HPC clusters with Slurm, see Quick Start (HPC).

Start the Server

Start a Torc server with a local database. Setting --completion-check-interval-secs ensures job completions are processed quickly (use this for personal servers, not shared deployments).

torc-server run --database torc.db --completion-check-interval-secs 5

Test the Connection

In a new terminal, verify the client can connect:

torc workflows list

Create a Workflow

Save this as workflow.yaml:

name: hello_world
description: Simple hello world workflow

jobs:
  - name: job 1
    command: echo "Hello from torc!"
  - name: job 2
    command: echo "Hello again from torc!"

Note: Torc also accepts .json, .json5 and .kdl workflow specifications. See Workflow Specification Formats for details.

Run the Workflow

Run jobs locally with a short poll interval for demo purposes:

torc run workflow.yaml --poll-interval 1

This creates the workflow, initializes it, and runs all jobs on the current machine.

View Results

torc results list

Or use the TUI for an interactive view:

torc tui

Example: Diamond Workflow

A workflow with fan-out and fan-in dependencies:

name: diamond_workflow
description: Example workflow with implicit dependencies

jobs:
  - name: preprocess
    command: "bash tests/scripts/preprocess.sh -i ${files.input.f1} -o ${files.output.f2} -o ${files.output.f3}"

  - name: work1
    command: "bash tests/scripts/work.sh -i ${files.input.f2} -o ${files.output.f4}"

  - name: work2
    command: "bash tests/scripts/work.sh -i ${files.input.f3} -o ${files.output.f5}"

  - name: postprocess
    command: "bash tests/scripts/postprocess.sh -i ${files.input.f4} -i ${files.input.f5} -o ${files.output.f6}"

files:
  - name: f1
    path: f1.json
  - name: f2
    path: f2.json
  - name: f3
    path: f3.json
  - name: f4
    path: f4.json
  - name: f5
    path: f5.json
  - name: f6
    path: f6.json

Dependencies are automatically inferred from file inputs/outputs:

work1 and work2 wait for preprocess (depend on its output files)
postprocess waits for both work1 and work2 to complete

More Examples

The examples directory contains many more workflow examples in YAML, JSON5, and KDL formats.

Next Steps

Quick Start (HPC) - Run workflows on Slurm clusters
Creating Workflows - Detailed workflow creation guide
Terminal UI - Interactive workflow monitoring

Quick Start (HPC)

This guide walks you through running your first Torc workflow on an HPC cluster with Slurm. Jobs are submitted to Slurm and run on compute nodes.

For local execution (testing, development, or non-HPC environments), see Quick Start (Local).

Prerequisites

Access to an HPC cluster with Slurm
A Slurm account/allocation for submitting jobs
Torc installed (see Installation)

Start the Server

On the login node, start a Torc server with a local database:

torc-server run --database torc.db --completion-check-interval-secs 5

Note: For larger deployments, your team may provide a shared Torc server. In that case, skip this step and set TORC_API_URL to the shared server address.

Check Your HPC Profile

Torc includes built-in profiles for common HPC systems. Check if your system is detected:

torc hpc detect

If detected, you'll see your HPC system name. To see available partitions:

torc hpc partitions <profile-name>

Note: If your HPC system isn't detected, see Custom HPC Profile or request built-in support.

Create a Workflow with Resource Requirements

Save this as workflow.yaml:

name: hpc_hello_world
description: Simple HPC workflow

resource_requirements:
  - name: small
    num_cpus: 4
    memory: 8g
    runtime: PT30M

jobs:
  - name: job1
    command: echo "Hello from compute node!" && hostname
    resource_requirements: small

  - name: job2
    command: echo "Hello again!" && hostname
    resource_requirements: small
    depends_on: [job1]

Key differences from local workflows:

resource_requirements: Define CPU, memory, and runtime needs
Jobs reference these requirements by name
Torc matches requirements to appropriate Slurm partitions

Submit the Workflow

Submit with your Slurm account:

torc submit-slurm --account <your-account> workflow.yaml

Torc will:

Detect your HPC system
Match job requirements to appropriate partitions
Generate Slurm scheduler configurations
Create and submit the workflow

Monitor Progress

Check workflow status:

torc workflows list
torc jobs list <workflow-id>

Or use the interactive TUI:

torc tui

Check Slurm queue:

squeue --me

View Results

Once jobs complete:

torc results list <workflow-id>

Job output is stored in the output/ directory by default.

Example: Multi-Stage Pipeline

A more realistic workflow with different resource requirements per stage:

name: analysis_pipeline
description: Data processing pipeline

resource_requirements:
  - name: light
    num_cpus: 4
    memory: 8g
    runtime: PT30M

  - name: compute
    num_cpus: 32
    memory: 64g
    runtime: PT2H

  - name: gpu
    num_cpus: 8
    num_gpus: 1
    memory: 32g
    runtime: PT1H

jobs:
  - name: preprocess
    command: python preprocess.py
    resource_requirements: light

  - name: train
    command: python train.py
    resource_requirements: gpu
    depends_on: [preprocess]

  - name: evaluate
    command: python evaluate.py
    resource_requirements: compute
    depends_on: [train]

Torc stages resource allocation based on dependencies:

preprocess resources are allocated at workflow start
train resources are allocated when preprocess completes
evaluate resources are allocated when train completes

This prevents wasting allocation time on resources that aren't needed yet.

Preview Before Submitting

For production workflows, preview the generated Slurm configuration first:

torc slurm generate --account <your-account> workflow.yaml

This shows what schedulers and actions Torc will create without submitting anything.

Next Steps

Slurm Workflows — How Torc manages Slurm
Resource Requirements — All resource options
HPC Profiles — Managing HPC configurations
Working with Slurm — Advanced Slurm configuration
Debugging Slurm Workflows — Troubleshooting

Quick Start (Remote Workers)

This guide walks you through running a Torc workflow on multiple remote machines via SSH. Jobs are distributed across workers without requiring an HPC scheduler like Slurm.

For local execution, see Quick Start (Local). For HPC/Slurm execution, see Quick Start (HPC).

Prerequisites

SSH key-based authentication to all remote machines (no password prompts)
Torc installed on all machines with matching versions
Torc server accessible from all machines

Start the Server

Start a Torc server that's accessible from the remote machines. This typically means binding to a network interface (not just localhost):

torc-server run --database torc.db --host 0.0.0.0 --port 8080

Create a Worker File

Create a file listing the remote machines. Each line contains one machine in the format [user@]hostname[:port]:

# workers.txt
worker1.example.com
alice@worker2.example.com
admin@192.168.1.10:2222

Lines starting with # are comments. Empty lines are ignored.

Create a Workflow

Save this as workflow.yaml:

name: distributed_hello
description: Distributed hello world workflow

jobs:
  - name: job 1
    command: echo "Hello from $(hostname)!"
  - name: job 2
    command: echo "Hello again from $(hostname)!"
  - name: job 3
    command: echo "And once more from $(hostname)!"

Create the Workflow on the Server

torc workflows create workflow.yaml

Note the workflow ID in the output.

Run Workers on Remote Machines

Start workers on all remote machines. Each worker will poll for available jobs and execute them:

torc remote run --workers workers.txt <workflow-id> --poll-interval 5

This will:

Check SSH connectivity to all machines
Verify all machines have the same torc version
Start a worker process on each machine (detached via nohup)
Report which workers started successfully

Check Worker Status

Monitor which workers are still running:

torc remote status <workflow-id>

View Workflow Progress

Check job status from any machine:

torc jobs list <workflow-id>

Or use the interactive TUI:

torc tui

Collect Logs

After the workflow completes, collect logs from all workers:

torc remote collect-logs <workflow-id> --local-output-dir ./logs

This creates a tarball for each worker containing:

Worker logs: torc_worker_<workflow_id>.log
Job stdout/stderr: job_stdio/job_*.o and job_stdio/job_*.e
Resource utilization data (if enabled): resource_utilization/resource_metrics_*.db

Stop Workers

If you need to stop workers before the workflow completes:

torc remote stop <workflow-id>

Add --force to send SIGKILL instead of SIGTERM.

Next Steps

Remote Workers Guide - Detailed configuration and troubleshooting
Creating Workflows - Workflow specification format
Resource Monitoring - Track CPU/memory usage per job

Core Concepts

This section covers the fundamental concepts you need to understand to work effectively with Torc.

Architecture Overview - How Torc's components work together
Workflow Definition - Structure of workflows and jobs
Job State Transitions - How jobs move through their lifecycle
Dependency Resolution - How job dependencies are managed
Job Runners - Executing jobs locally and on clusters
Parallelization Strategies - Running jobs in parallel
Workflow Reinitialization - Restarting workflows

Architecture

Overview

Torc uses a client-server architecture where a central server manages workflow state and coordination, while clients create workflows and job runners execute tasks on compute resources.

flowchart TB
    subgraph ui["User Interfaces"]
        DASH["torc-dash<br/>(web)"]
        TUI["torc tui<br/>(terminal)"]
        CLI["torc CLI"]
    end

    subgraph server["Server (torc-server)"]
        API["HTTP API"]
        DB[(SQLite)]
        API <--> DB
    end

    subgraph workers["Job Runners"]
        W1["Runner 1"]
        W2["Runner 2"]
        WN["Runner N"]
    end

    DASH -->|"HTTP"| API
    DASH -->|"executes"| CLI
    TUI -->|"HTTP"| API
    TUI -->|"executes"| CLI
    CLI -->|"HTTP"| API
    W1 -->|"claim jobs"| API
    W2 -->|"claim jobs"| API
    WN -->|"claim jobs"| API

    style DASH fill:#17a2b8,color:#fff
    style TUI fill:#17a2b8,color:#fff
    style CLI fill:#4a9eff,color:#fff
    style API fill:#28a745,color:#fff
    style DB fill:#28a745,color:#fff
    style W1 fill:#ffc107,color:#000
    style W2 fill:#ffc107,color:#000
    style WN fill:#ffc107,color:#000

Key Components:

Component	Description
torc-dash	Web dashboard for visual workflow management
torc tui	Terminal UI for monitoring in SSH environments
torc CLI	Command-line tool for all workflow operations
Server	HTTP API service that manages workflow state via SQLite
Job Runner	Worker process that polls for ready jobs, executes them, and reports results

Workflow Definition

A workflow is a collection of jobs with dependencies. You define workflows in YAML, JSON5, or JSON files.

Minimal Example

name: hello_world
jobs:
  - name: greet
    command: echo "Hello, World!"

That's it. One job, no dependencies.

Jobs with Dependencies

name: two_stage
jobs:
  - name: prepare
    command: ./prepare.sh

  - name: process
    command: ./process.sh
    depends_on: [prepare]

The process job waits for prepare to complete.

Job Parameterization

Create multiple jobs from a single definition using parameters:

name: parameter_sweep
jobs:
  - name: task_{i}
    command: ./run.sh --index {i}
    parameters:
      i: "1:10"

This expands to 10 jobs: task_1, task_2, ..., task_10.

Parameter Formats

Format	Example	Expands To
Range	`"1:5"`	1, 2, 3, 4, 5
Range with step	`"0:10:2"`	0, 2, 4, 6, 8, 10
List	`"[a,b,c]"`	a, b, c
Float range	`"0.0:1.0:0.25"`	0.0, 0.25, 0.5, 0.75, 1.0

Format Specifiers

Control how values appear in names:

- name: job_{i:03d}      # job_001, job_002, ...
  parameters:
    i: "1:100"

- name: lr_{lr:.4f}      # lr_0.0010, lr_0.0100, ...
  parameters:
    lr: "[0.001,0.01,0.1]"

Resource Requirements

Specify what resources each job needs:

name: gpu_workflow

resource_requirements:
  - name: gpu_job
    num_cpus: 8
    num_gpus: 1
    memory: 16g
    runtime: PT2H

jobs:
  - name: train
    command: python train.py
    resource_requirements: gpu_job

Resource requirements are used for:

Local execution: ensuring jobs don't exceed available resources
HPC/Slurm: requesting appropriate allocations

Complete Example

name: data_pipeline
description: Process data in parallel, then aggregate

resource_requirements:
  - name: worker
    num_cpus: 4
    memory: 8g
    runtime: PT1H

jobs:
  - name: process_{i}
    command: python process.py --chunk {i} --output results/chunk_{i}.json
    resource_requirements: worker
    parameters:
      i: "1:10"

  - name: aggregate
    command: python aggregate.py --input results/ --output final.json
    resource_requirements: worker
    depends_on:
      - process_{i}
    parameters:
      i: "1:10"

This creates:

10 parallel process_* jobs
1 aggregate job that waits for all 10 to complete

Failure Recovery Options

Control how Torc handles job failures:

Default Behavior

By default, jobs that fail without a matching failure handler use Failed status:

name: my_workflow
jobs:
  - name: task
    command: ./run.sh  # If this fails, status = Failed

AI-Assisted Recovery (Opt-in)

Enable intelligent classification of ambiguous failures:

name: ml_training
use_pending_failed: true  # Enable AI-assisted recovery

jobs:
  - name: train_model
    command: python train.py

With use_pending_failed: true:

Jobs without matching failure handlers get PendingFailed status
AI agent can analyze stderr and decide whether to retry or fail
See AI-Assisted Recovery for details

Job State Transitions

Jobs progress through a defined lifecycle:

stateDiagram-v2
    [*] --> uninitialized
    uninitialized --> ready: initialize_jobs
    uninitialized --> blocked: has dependencies

    blocked --> ready: dependencies met
    ready --> pending: runner claims
    pending --> running: execution starts

    running --> completed: exit 0
    running --> failed: exit != 0 (handler match + max retries)
    running --> pending_failed: exit != 0 (no handler match)
    running --> ready: exit != 0 (failure handler retry)
    running --> canceled: user cancels
    running --> terminated: system terminates

    pending_failed --> failed: AI classifies as permanent
    pending_failed --> ready: AI classifies as transient
    pending_failed --> uninitialized: reset-status

    completed --> [*]
    failed --> [*]
    canceled --> [*]
    terminated --> [*]

    classDef waiting fill:#6c757d,color:#fff
    classDef ready fill:#17a2b8,color:#fff
    classDef active fill:#ffc107,color:#000
    classDef success fill:#28a745,color:#fff
    classDef error fill:#dc3545,color:#fff
    classDef stopped fill:#6f42c1,color:#fff
    classDef classification fill:#fd7e14,color:#fff

    class uninitialized,blocked waiting
    class ready ready
    class pending,running active
    class completed success
    class failed error
    class canceled,terminated stopped
    class pending_failed classification

State Descriptions

uninitialized (0) - Job created but dependencies not evaluated
blocked (1) - Waiting for dependencies to complete
ready (2) - All dependencies satisfied, ready for execution
pending (3) - Job claimed by runner
running (4) - Currently executing
completed (5) - Finished successfully (exit code 0)
failed (6) - Finished with error (exit code != 0)
canceled (7) - Explicitly canceled by user or torc. Never executed.
terminated (8) - Explicitly terminated by system, such as at wall-time timeout
pending_failed (10) - Job failed without a matching failure handler. Awaiting AI-assisted classification to determine if the error is transient (retry) or permanent (fail). See AI-Assisted Recovery.

Dependency Resolution

Torc supports two types of dependencies: explicit (declared via depends_on) and implicit (inferred from file/data relationships). For a hands-on tutorial, see Diamond Workflow with File Dependencies.

Explicit Dependencies

Declared via depends_on:

jobs:
  - name: preprocess
    command: preprocess.sh
  - name: analyze
    command: analyze.sh
    depends_on:
      - preprocess

Implicit Dependencies via Files

Torc automatically infers dependencies from file relationships. When one job outputs a file and another job inputs it, the dependency is created automatically:

jobs:
  - name: preprocess
    command: "preprocess.sh -o ${files.output.intermediate}"
  - name: analyze
    command: "analyze.sh -i ${files.input.intermediate}"

files:
  - name: intermediate
    path: data/intermediate.json

This creates a diamond workflow pattern:

flowchart TD
    input([input.json])
    preprocess[preprocess]
    intermediate([intermediate.json])
    analyze[analyze]
    output([output.json])

    input --> preprocess
    preprocess --> intermediate
    intermediate --> analyze
    analyze --> output

    style input fill:#d4edda,stroke:#28a745,color:#155724
    style intermediate fill:#d4edda,stroke:#28a745,color:#155724
    style output fill:#d4edda,stroke:#28a745,color:#155724
    style preprocess fill:#4a9eff,color:#fff
    style analyze fill:#4a9eff,color:#fff

No explicit depends_on needed — Torc infers that analyze depends on preprocess because they share the intermediate file.

Implicit Dependencies via User Data

User data works like files but stores JSON in the database instead of the filesystem:

jobs:
  - name: generate_config
    command: |
      torc user-data update ${user_data.output.config} --data '{"lr": 0.001}'
  - name: run_simulation
    command: |
      CONFIG=$(torc user-data get ${user_data.input.config} | jq '.data')
      python simulate.py --config "$CONFIG"

user_data:
  - name: config

flowchart LR
    gen[generate_config]
    config[(config)]
    sim[run_simulation]

    gen -->|writes| config
    config -->|reads| sim

    style config fill:#fff3cd,stroke:#ffc107,color:#856404
    style gen fill:#4a9eff,color:#fff
    style sim fill:#4a9eff,color:#fff

Use user data for small configuration objects; use files for large datasets.

Resolution Process

During workflow creation, the server:

Resolves all names to IDs
Stores explicit dependencies in job_depends_on
Stores file/user_data relationships in junction tables
During initialize_jobs, queries junction tables to add implicit dependencies

Dependency Graph Evaluation

When initialize is called:

All jobs start in uninitialized state
Server builds complete dependency graph from explicit and implicit dependencies
Jobs with no unsatisfied dependencies are marked ready
Jobs waiting on dependencies are marked blocked
As jobs complete, blocked jobs are re-evaluated and may become ready

Variable Substitution Syntax

In workflow specification files (YAML, JSON5, KDL), use these patterns to reference files and user data in job commands:

Pattern	Description
`${files.input.NAME}`	File path this job reads (creates implicit dependency)
`${files.output.NAME}`	File path this job writes (satisfies dependencies)
`${user_data.input.NAME}`	User data this job reads
`${user_data.output.NAME}`	User data this job writes

Example:

jobs:
  - name: process
    command: "python process.py -i ${files.input.raw} -o ${files.output.result}"

See Workflow Specification Formats for complete syntax details.

Job Runners

Job runners are worker processes that execute jobs on compute resources.

Job Runner Modes

Torc supports three execution modes:

Local Runner (torc run) - Runs jobs on the local machine with resource tracking
HPC/Slurm Runner (torc submit-slurm) - Runs jobs on Slurm-allocated compute nodes
Remote Workers (torc remote run) - Distributes jobs across SSH-accessible machines

Local Runner

The local runner executes jobs directly on the current machine. Start it with:

torc run <workflow-id>

HPC/Slurm Runner

For HPC clusters, jobs run on Slurm-allocated compute nodes. The torc-slurm-job-runner binary is launched by Slurm on each allocated node and polls the server for work.

Remote Workers

Remote workers enable distributed execution without a scheduler. The torc remote run command SSH-es into multiple machines and starts a torc run process on each:

torc remote run workers.txt <workflow-id>

Each remote worker runs as a detached process and polls the server for jobs, just like the local runner. The server coordinates job distribution to prevent double-allocation.

Job Allocation Strategies

The job runner supports two different strategies for retrieving and executing jobs:

Resource-Based Allocation (Default)

Used when: --max-parallel-jobs is NOT specified

Behavior:

Retrieves jobs from the server via the command claim_jobs_based_on_resources
Server filters jobs based on available compute node resources (CPU, memory, GPU)
Only returns jobs that fit within the current resource capacity
Prevents resource over-subscription and ensures jobs have required resources
Defaults to requiring one CPU and 1 MB of memory for each job.

Use cases:

When you want parallelization based on one CPU per job.
When you have heterogeneous jobs with different resource requirements and want intelligent resource management.

Example 1: Run jobs at queue depth of num_cpus:

parameters:
  i: "1:100"
jobs:
  - name: "work_{i}"
    command: bash my_script.sh {i}
    use_parameters:
    - i

Example 2: Resource-based parallelization:

resource_requirements:
  - name: "work_resources"
    num_cpus: 32
    memory: "200g"
    runtime: "PT4H"
    num_nodes: 1
    
parameters:
  i: "1:100"
jobs:
  - name: "work_{i}"
    command: bash my_script.sh {i}
    resource_requirements: work_resources  
    use_parameters:
    - i

Simple Queue-Based Allocation

Used when: --max-parallel-jobs is specified

Behavior:

Retrieves jobs from the server via the command claim_next_jobs
Server returns the next N ready jobs from the queue (up to the specified limit)
Ignores job resource requirements completely
Simply limits the number of concurrent jobs

Use cases: When all jobs have similar resource needs or when the resource bottleneck is not tracked by Torc, such as network or storage I/O. This is the only way to run jobs at a queue depth higher than the number of CPUs in the worker.

Example:

torc run $WORKFLOW_ID \
  --max-parallel-jobs 10 \
  --output-dir ./results

Job Runner Workflow

The job runner executes a continuous loop with these steps:

flowchart TD
    Start([Start]) --> CheckStatus[Check workflow status]
    CheckStatus --> IsComplete{Workflow complete<br/>or canceled?}
    IsComplete -->|Yes| End([Exit])
    IsComplete -->|No| MonitorJobs[Monitor running jobs]
    MonitorJobs --> CompleteFinished[Complete finished jobs<br/>Update server status]
    CompleteFinished --> ExecuteActions[Execute workflow actions<br/>e.g., schedule Slurm allocations]
    ExecuteActions --> ClaimJobs[Claim new jobs from server]
    ClaimJobs --> ResourceCheck{Allocation<br/>strategy?}
    ResourceCheck -->|Resource-based| ClaimResources[claim_jobs_based_on_resources<br/>Filter by CPU/memory/GPU]
    ResourceCheck -->|Queue-based| ClaimQueue[claim_next_jobs<br/>Up to max-parallel-jobs]
    ClaimResources --> StartJobs
    ClaimQueue --> StartJobs
    StartJobs[Start claimed jobs] --> ForEachJob[For each job:<br/>1. Call start_job<br/>2. Execute command<br/>3. Record stdout/stderr]
    ForEachJob --> Sleep[Sleep for poll interval]
    Sleep --> CheckStatus

    style Start fill:#10b981,stroke:#059669,color:#fff
    style End fill:#ef4444,stroke:#dc2626,color:#fff
    style IsComplete fill:#f59e0b,stroke:#d97706,color:#fff
    style ResourceCheck fill:#f59e0b,stroke:#d97706,color:#fff
    style CheckStatus fill:#3b82f6,stroke:#2563eb,color:#fff
    style MonitorJobs fill:#3b82f6,stroke:#2563eb,color:#fff
    style CompleteFinished fill:#3b82f6,stroke:#2563eb,color:#fff
    style ExecuteActions fill:#3b82f6,stroke:#2563eb,color:#fff
    style ClaimJobs fill:#3b82f6,stroke:#2563eb,color:#fff
    style StartJobs fill:#3b82f6,stroke:#2563eb,color:#fff
    style ForEachJob fill:#3b82f6,stroke:#2563eb,color:#fff
    style Sleep fill:#6b7280,stroke:#4b5563,color:#fff
    style ClaimResources fill:#8b5cf6,stroke:#7c3aed,color:#fff
    style ClaimQueue fill:#ec4899,stroke:#db2777,color:#fff

Check workflow status - Poll server to check if workflow is complete or canceled
Monitor running jobs - Check status of currently executing jobs
Execute workflow actions - Check for and execute any pending workflow actions, such as scheduling new Slurm allocations.
Claim new jobs - Request ready jobs from server based on allocation strategy:
- Resource-based: claim_jobs_based_on_resources
- Queue-based: claim_next_jobs
Start jobs - For each claimed job:
- Call start_job to mark job as started in database
- Execute job command in a non-blocking subprocess
- Record stdout/stderr output to files
Complete jobs - When running jobs finish:
- Call complete_job with exit code and result
- Server updates job status and automatically marks dependent jobs as ready
Sleep and repeat - Wait for job completion poll interval, then repeat loop

The runner continues until the workflow is complete or canceled.

Resource Management (Resource-Based Allocation Only)

When using resource-based allocation (default), the local job runner tracks:

Number of CPUs in use
Memory allocated to running jobs
GPUs in use
Job runtime limits

When a ready job is retrieved, the runner checks if sufficient resources are available before executing it.

Parallelization Strategies

Torc provides flexible parallelization strategies to accommodate different workflow patterns and resource allocation scenarios. Understanding these strategies helps you optimize job execution for your specific use case.

Overview

Torc supports two primary approaches to parallel job execution:

Resource-aware allocation - Define per-job resource requirements and let runners intelligently select jobs that fit available resources
Queue-depth parallelism - Control the number of concurrent jobs without resource tracking

The choice between these approaches depends on your workflow characteristics and execution environment.

Use Case 1: Resource-Aware Job Allocation

This strategy is ideal for heterogeneous workflows where jobs have varying resource requirements (CPU, memory, GPU, runtime). The server intelligently allocates jobs based on available compute node resources.

How It Works

When you define resource requirements for each job:

resource_requirements:
  - name: small
    num_cpus: 2
    num_gpus: 0
    memory: 4g
    runtime: PT30M

  - name: large
    num_cpus: 16
    num_gpus: 2
    memory: 128g
    runtime: PT8H

jobs:
  - name: preprocessing
    command: ./preprocess.sh
    resource_requirements: small

  - name: model_training
    command: python train.py
    resource_requirements: large

The job runner pulls jobs from the server by detecting its available resources automatically.

torc run $WORKFLOW_ID

The server's GET /workflows/{id}/claim_jobs_based_on_resources endpoint:

Receives the runner's resource capacity
Queries the ready queue for jobs that fit within those resources
Returns a set of jobs that can run concurrently without over-subscription
Updates job status from ready to pending atomically

Job Allocation Ambiguity: Two Approaches

When you have multiple compute nodes or schedulers with different capabilities, there are two ways to handle job allocation:

Approach 1: Sort Method (Flexible but Potentially Ambiguous)

How it works:

Jobs do NOT specify a particular scheduler/compute node
The server uses a job_sort_method parameter to prioritize jobs when allocating
Any runner with sufficient resources can claim any ready job

Available sort methods: Define the field job_sort_method in the workflow specification file (YAML/JSON/KDL)

gpus_runtime_memory - Prioritize jobs by GPU count (desc), then runtime (desc), then memory (desc)
gpus_memory_runtime - Prioritize jobs by GPU count (desc), then memory (desc), then runtime (desc)
none - No sorting, jobs selected in queue order

Tradeoffs:

✅ Advantages:

Maximum flexibility - any runner can execute any compatible job
Better resource utilization - if GPU runner is idle, it can pick up CPU-only jobs
Simpler workflow specifications - no need to explicitly map jobs to schedulers
Fault tolerance - if one runner fails, others can pick up its jobs

❌ Disadvantages:

Ambiguity - no guarantee GPU jobs go to GPU runners
Potential inefficiency - high-memory jobs might land on low-memory nodes if timing is unlucky
Requires careful sort method selection
Less predictable job placement

When to use:

Homogeneous or mostly-homogeneous compute resources
Workflows where job placement flexibility is valuable
When you want runners to opportunistically pick up work
Development and testing environments

Approach 2: Scheduler ID (Deterministic but Less Flexible)

How it works:

Define scheduler configurations in your workflow spec
Assign each job a specific scheduler_id
Runners provide their scheduler_config_id when requesting jobs
Server only returns jobs matching that scheduler ID

Example workflow specification:

slurm_schedulers:
  - name: gpu_cluster
    partition: gpu
    account: myproject

  - name: highmem_cluster
    partition: highmem
    account: myproject

jobs:
  - name: model_training
    command: python train.py
    resource_requirements: large
    slurm_scheduler: gpu_cluster     # Binds to specific scheduler

  - name: large_analysis
    command: ./analyze.sh
    resource_requirements: highmem
    slurm_scheduler: highmem_cluster

Example runner invocation:

# GPU runner - only pulls jobs assigned to gpu_cluster
torc-slurm-job-runner $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-cpus 32 \
  --num-gpus 8

# High-memory runner - only pulls jobs assigned to highmem_cluster
torc-slurm-job-runner $WORKFLOW_ID \
  --scheduler-config-id 2 \
  --num-cpus 64 \
  --memory-gb 512

Tradeoffs:

✅ Advantages:

Zero ambiguity - jobs always run on intended schedulers
Predictable job placement
Prevents GPU jobs from landing on CPU-only nodes
Clear workflow specification - explicit job→scheduler mapping
Better for heterogeneous clusters (GPU vs CPU vs high-memory)

❌ Disadvantages:

Less flexibility - idle runners can't help other queues
Potential resource underutilization - GPU runner sits idle while CPU queue is full
More complex workflow specifications
If a scheduler fails, its jobs remain stuck until that scheduler returns

When to use:

Highly heterogeneous compute resources (GPU clusters, high-memory nodes, specialized hardware)
Production workflows requiring predictable job placement
Multi-cluster environments
When job-resource matching is critical (e.g., GPU-only codes, specific hardware requirements)
Slurm or HPC scheduler integrations

Choosing Between Sort Method and Scheduler ID

Scenario	Recommended Approach	Rationale
All jobs can run anywhere	Sort method	Maximum flexibility, simplest spec
Some jobs need GPUs, some don't	Scheduler ID	Prevent GPU waste on CPU jobs
Multi-cluster Slurm environment	Scheduler ID	Jobs must target correct clusters
Development/testing	Sort method	Easier to experiment
Production with SLAs	Scheduler ID	Predictable resource usage
Homogeneous compute nodes	Sort method	No benefit to restricting
Specialized hardware (GPUs, high-memory, FPGAs)	Scheduler ID	Match jobs to capabilities

You can also mix approaches: Use scheduler_id for jobs with strict requirements, leave it NULL for flexible jobs.

Use Case 2: Queue-Depth Parallelism

This strategy is ideal for workflows with homogeneous resource requirements where you simply want to control the level of parallelism.

How It Works

Instead of tracking resources, you specify a maximum number of concurrent jobs:

torc run $WORKFLOW_ID \
  --max-parallel-jobs 10 \
  --output-dir ./results

or with Slurm:

torc slurm schedule-nodes $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-hpc-jobs 4 \
  --max-parallel-jobs 8

Server behavior:

The GET /workflows/{id}/claim_next_jobs endpoint:

Accepts limit parameter specifying maximum jobs to return
Ignores all resource requirements
Returns the next N ready jobs from the queue
Updates their status from ready to pending

Runner behavior:

Maintains a count of running jobs
When count falls below max_parallel_jobs, requests more work
Does NOT track CPU, memory, GPU, or other resources
Simply enforces the concurrency limit

Ignoring Resource Consumption

This is a critical distinction: when using --max-parallel-jobs, the runner completely ignores current resource consumption.

Normal resource-aware mode:

Runner has: 32 CPUs, 128 GB memory
Job A needs: 16 CPUs, 64 GB
Job B needs: 16 CPUs, 64 GB
Job C needs: 16 CPUs, 64 GB

Runner starts Job A and Job B (resources fully allocated)
Job C waits until resources free up

Queue-depth mode with --max-parallel-jobs 3:

Runner has: 32 CPUs, 128 GB memory (IGNORED)
Job A needs: 16 CPUs, 64 GB (IGNORED)
Job B needs: 16 CPUs, 64 GB (IGNORED)
Job C needs: 16 CPUs, 64 GB (IGNORED)

Runner starts Job A, Job B, and Job C simultaneously
Total requested: 48 CPUs, 192 GB (exceeds node capacity!)
System may: swap, OOM, or throttle performance

When to Use Queue-Depth Parallelism

✅ Use queue-depth parallelism when:

All jobs have similar resource requirements

# All jobs use ~4 CPUs, ~8GB memory
jobs:
  - name: process_file_1
    command: ./process.sh file1.txt
  - name: process_file_2
    command: ./process.sh file2.txt
  # ... 100 similar jobs

Resource requirements are negligible compared to node capacity
- Running 100 lightweight Python scripts on a 64-core machine
- I/O-bound jobs that don't consume much CPU/memory
Jobs are I/O-bound or sleep frequently
- Data download jobs
- Jobs waiting on external services
- Polling or monitoring tasks
You want simplicity over precision
- Quick prototypes
- Testing workflows
- Simple task queues
Jobs self-limit their resource usage
- Application has built-in thread pools
- Container resource limits
- OS-level cgroups or resource controls

❌ Avoid queue-depth parallelism when:

Jobs have heterogeneous resource requirements
- Mix of 2-CPU and 32-CPU jobs
- Some jobs need 4GB, others need 128GB
Resource contention causes failures
- Out-of-memory errors
- CPU thrashing
- GPU memory exhaustion
You need efficient bin-packing
- Maximizing node utilization
- Complex resource constraints
Jobs are compute-intensive
- CPU-bound numerical simulations
- Large matrix operations
- Video encoding

Queue-Depth Parallelism in Practice

Example 1: Slurm with Queue Depth

# Schedule 4 Slurm nodes, each running up to 8 concurrent jobs
torc slurm schedule-nodes $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-hpc-jobs 4 \
  --max-parallel-jobs 8

This creates 4 Slurm job allocations. Each allocation runs a worker that:

Pulls up to 8 jobs at a time
Runs them concurrently
Requests more when any job completes

Total concurrency: up to 32 jobs (4 nodes × 8 jobs/node)

Example 2: Local Runner with Queue Depth

# Run up to 20 jobs concurrently on local machine
torc-job-runner $WORKFLOW_ID \
  --max-parallel-jobs 20 \
  --output-dir ./output

Example 3: Mixed Approach

You can even run multiple runners with different strategies:

# Terminal 1: Resource-aware runner for large jobs
torc run $WORKFLOW_ID \
  --num-cpus 32 \
  --memory-gb 256

# Terminal 2: Queue-depth runner for small jobs
torc run $WORKFLOW_ID \
  --max-parallel-jobs 50

The ready queue serves both runners. The resource-aware runner gets large jobs that fit its capacity, while the queue-depth runner gets small jobs for fast parallel execution.

Performance Characteristics

Resource-aware allocation:

Query complexity: O(jobs in ready queue)
Requires computing resource sums
Slightly slower due to filtering and sorting
Better resource utilization

Queue-depth allocation:

Query complexity: O(1) with limit
Simple LIMIT clause, no resource computation
Faster queries
Simpler logic

For workflows with thousands of ready jobs, queue-depth allocation has lower overhead.

Best Practices

Start with resource-aware allocation for new workflows
- Better default behavior
- Prevents resource over-subscription
- Easier to debug resource issues
Use scheduler_id for production multi-cluster workflows
- Explicit job placement
- Predictable resource usage
- Better for heterogeneous resources
Use sort_method for flexible single-cluster workflows
- Simpler specifications
- Better resource utilization
- Good for homogeneous resources
Use queue-depth parallelism for homogeneous task queues
- Many similar jobs
- I/O-bound workloads
- When simplicity matters more than precision
Monitor resource usage when switching strategies
- Check for over-subscription
- Verify expected parallelism
- Look for resource contention
Test with small workflows first
- Validate job allocation behavior
- Check resource accounting
- Ensure jobs run on intended schedulers

Summary

Strategy	Use When	Allocation Method	Resource Tracking
Resource-aware + sort_method	Heterogeneous jobs, flexible allocation	Server filters by resources	Yes
Resource-aware + scheduler_id	Heterogeneous jobs, strict allocation	Server filters by resources AND scheduler	Yes
Queue-depth	Homogeneous jobs, simple parallelism	Server returns next N jobs	No

Choose the strategy that best matches your workflow characteristics and execution environment. You can even mix strategies across different runners for maximum flexibility.

Workflow Reinitialization

When you modify input files or configuration after a workflow has run, you need a way to re-execute only the affected jobs. Reinitialization handles this by detecting what changed and marking the appropriate jobs for re-execution.

When to Use Reinitialization

Use torc workflows reinitialize when:

Input files changed — You modified an input file and want dependent jobs to rerun
Configuration updated — You changed user_data parameters
Output files missing — Output files were deleted and need regeneration
Job definition changed — You modified a job's command or other attributes
Iterative development — You're refining a workflow and need quick iteration

Basic Usage

# Preview what would change (recommended first step)
torc workflows reinitialize <workflow_id> --dry-run

# Reinitialize the workflow
torc workflows reinitialize <workflow_id>

# Force reinitialization even with warnings
torc workflows reinitialize <workflow_id> --force

How Change Detection Works

Reinitialization detects changes through three mechanisms:

1. File Modification Times

For files tracked in the workflow, Torc compares the current st_mtime (modification time) against the stored value. If a file was modified since the last run, jobs that use it as input are marked for re-execution.

# Modify an input file
echo "new data" > input.json

# Reinitialize detects the change
torc workflows reinitialize <workflow_id>
# Output: Reset 3 jobs due to changed inputs

2. Job Attribute and User Data Hashing

Torc computes SHA256 hashes of critical job attributes (such as the command) and user_data input values. If any hash differs from the stored value, the job is marked for re-execution. This detects changes like modified commands, updated scripts, or changed configuration parameters.

3. Missing Output Files

If a job's output file no longer exists on disk, the job is marked for re-execution regardless of whether inputs changed.

The Reinitialization Process

When you run reinitialize, Torc performs these steps:

Bump run_id — Increments the workflow's run counter for tracking
Reset workflow status — Clears the previous run's completion state
Check file modifications — Compares current st_mtime values to stored values
Check missing outputs — Identifies jobs whose output files no longer exist
Check user_data changes — Computes and compares input hashes
Mark affected jobs — Sets jobs needing re-execution to uninitialized
Re-evaluate dependencies — Runs initialize_jobs to set jobs to ready or blocked

Dependency Propagation

When a job is marked for re-execution, all downstream jobs that depend on its outputs are also marked. This ensures the entire dependency chain is re-executed:

preprocess (input changed) → marked for rerun
    ↓
process (depends on preprocess output) → also marked
    ↓
postprocess (depends on process output) → also marked

Dry Run Mode

Always use --dry-run first to preview changes without modifying anything:

torc workflows reinitialize <workflow_id> --dry-run

Example output:

Dry run: 5 jobs would be reset due to changed inputs
  - preprocess
  - analyze_batch_1
  - analyze_batch_2
  - merge_results
  - generate_report

Retrying Failed Jobs

Important: Reinitialization does not automatically retry failed jobs. To retry failed jobs, use reset-status:

# Reset failed jobs to ready status, then reinitialize to check for other changes
torc workflows reset-status <workflow_id> --failed-only --reinitialize

# Or just reset failed jobs without reinitialization
torc workflows reset-status <workflow_id> --failed-only

Comparison with Full Reset

Scenario	Use `reinitialize`	Use `reset-status`
Input file changed	Yes	No
Job command changed	Yes	No
Want to rerun everything	No	Yes
Retry failed jobs only	No	Yes (`--failed-only`)
Iterative development	Yes	Depends
Changed workflow spec	Create new workflow	Create new workflow

Working with Workflows

This section covers how to create, configure, and manage workflows.

Creating Workflows - Getting started with workflow creation
Workflow Specification Formats - JSON, YAML, and other formats
Visualizing Workflow Structure - Viewing workflow graphs
Exporting and Importing Workflows - Moving workflows between systems
Archiving Workflows - Long-term workflow storage

How to Create Workflows

This guide shows different methods for creating Torc workflows, from the most common (specification files) to more advanced approaches (CLI, API).

Using Workflow Specification Files (Recommended)

The easiest way to create workflows is with specification files. Torc supports YAML, JSON5, and KDL formats.

Create from a YAML File

torc workflows create workflow.yaml

Create from JSON5 or KDL

torc workflows create workflow.json5
torc workflows create workflow.kdl

Torc detects the format from the file extension.

Create and Run in One Step

For quick iteration, combine creation and execution:

# Create and run locally
torc run workflow.yaml

# Create and submit to Slurm
torc submit workflow.yaml

For format syntax and examples, see the Workflow Specification Formats guide. For a complete reference of all fields, see the Workflow Specification Reference.

Using the CLI (Step by Step)

For programmatic workflow construction or when you need fine-grained control, create workflows piece by piece using the CLI.

Step 1: Create an Empty Workflow

torc workflows new \
  --name "my_workflow" \
  --description "My test workflow"

Output:

Successfully created workflow:
  ID: 1
  Name: my_workflow
  User: dthom
  Description: My test workflow

Note the workflow ID (1) for subsequent commands.

Step 2: Add Resource Requirements

torc resource-requirements create \
  --name "small" \
  --num-cpus 1 \
  --memory "1g" \
  --runtime "PT10M" \
  1  # workflow ID

Output:

Successfully created resource requirements:
  ID: 2
  Workflow ID: 1
  Name: small

Step 3: Add Files (Optional)

torc files create \
  --name "input_file" \
  --path "/data/input.txt" \
  1  # workflow ID

Step 4: Add Jobs

torc jobs create \
  --name "process_data" \
  --command "python process.py" \
  --resource-requirements-id 2 \
  --input-file-ids 1 \
  1  # workflow ID

Step 5: Initialize and Run

# Initialize the workflow (resolves dependencies)
torc workflows initialize-jobs 1

# Run the workflow
torc run 1

Using the Python API

For complex programmatic workflow construction, use the Python client:

from torc import make_api
from torc.openapi_client import (
    WorkflowModel,
    JobModel,
    ResourceRequirementsModel,
)

# Connect to the server
api = make_api("http://localhost:8080/torc-service/v1")

# Create workflow
workflow = api.create_workflow(WorkflowModel(
    name="my_workflow",
    user="myuser",
    description="Programmatically created workflow",
))

# Add resource requirements
rr = api.create_resource_requirements(ResourceRequirementsModel(
    workflow_id=workflow.id,
    name="small",
    num_cpus=1,
    memory="1g",
    runtime="PT10M",
))

# Add jobs
api.create_job(JobModel(
    workflow_id=workflow.id,
    name="job1",
    command="echo 'Hello World'",
    resource_requirements_id=rr.id,
))

print(f"Created workflow {workflow.id}")

For more details, see the Map Python Functions tutorial.

Using the Julia API

The Julia client provides similar functionality for programmatic workflow construction:

using Torc
import Torc: APIClient

# Connect to the server
api = make_api("http://localhost:8080/torc-service/v1")

# Create workflow
workflow = send_api_command(
    api,
    APIClient.create_workflow,
    APIClient.WorkflowModel(;
        name = "my_workflow",
        user = get_user(),
        description = "Programmatically created workflow",
    ),
)

# Add resource requirements
rr = send_api_command(
    api,
    APIClient.create_resource_requirements,
    APIClient.ResourceRequirementsModel(;
        workflow_id = workflow.id,
        name = "small",
        num_cpus = 1,
        memory = "1g",
        runtime = "PT10M",
    ),
)

# Add jobs
send_api_command(
    api,
    APIClient.create_job,
    APIClient.JobModel(;
        workflow_id = workflow.id,
        name = "job1",
        command = "echo 'Hello World'",
        resource_requirements_id = rr.id,
    ),
)

println("Created workflow $(workflow.id)")

The Julia client also supports map_function_to_jobs for mapping a function across parameters, similar to the Python client.

Choosing a Method

Method	Best For
Specification files	Most workflows; declarative, version-controllable
CLI step-by-step	Scripted workflows, testing individual components
Python API	Complex dynamic workflows, integration with Python pipelines
Julia API	Complex dynamic workflows, integration with Julia pipelines

Common Tasks

Validate a Workflow File Without Creating

Use --dry-run to validate a workflow specification without creating it on the server:

torc workflows create --dry-run workflow.yaml

Example output:

Workflow Validation Results
===========================

Workflow: my_workflow
Description: A sample workflow

Components to be created:
  Jobs: 100 (expanded from 1 parameterized job specs)
  Files: 5
  User data records: 2
  Resource requirements: 2
  Slurm schedulers: 2
  Workflow actions: 3

Submission: Ready for scheduler submission (has on_workflow_start schedule_nodes action)

Validation: PASSED

For programmatic use (e.g., in scripts or the dashboard), get JSON output:

torc -f json workflows create --dry-run workflow.yaml

What Validation Checks

The dry-run performs comprehensive validation:

Structural Checks:

Valid file format (YAML, JSON5, KDL, or JSON)
Required fields present
Parameter expansion (shows expanded job count vs. original spec count)

Reference Validation:

depends_on references existing jobs
depends_on_regexes patterns are valid and match at least one job
resource_requirements references exist
scheduler references exist
input_files and output_files reference defined files
input_user_data and output_user_data reference defined user data
All regex patterns (*_regexes fields) are valid

Duplicate Detection:

Duplicate job names
Duplicate file names
Duplicate user data names
Duplicate resource requirement names
Duplicate scheduler names

Dependency Analysis:

Circular dependency detection (reports all jobs in the cycle)

Action Validation:

Actions reference existing jobs and schedulers
schedule_nodes actions have required scheduler and scheduler_type

Scheduler Configuration:

Slurm scheduler node requirements are valid
Warns about heterogeneous schedulers without jobs_sort_method (see below)

Heterogeneous Scheduler Warning

When you have multiple Slurm schedulers with different resource profiles (memory, GPUs, walltime, partition) and jobs without explicit scheduler assignments, the validation warns about potential suboptimal job-to-node matching:

Warnings (1):
  - Workflow has 3 schedulers with different memory (mem), walltime but 10 job(s)
    have no explicit scheduler assignment and jobs_sort_method is not set. The
    default sort method 'gpus_runtime_memory' will be used (jobs sorted by GPUs,
    then runtime, then memory). If this doesn't match your workload, consider
    setting jobs_sort_method explicitly to 'gpus_memory_runtime' (prioritize
    memory over runtime) or 'none' (no sorting).

This warning helps you avoid situations where:

Long-walltime nodes pull short-runtime jobs
High-memory nodes pull low-memory jobs
GPU nodes pull non-GPU jobs

Solutions:

Set jobs_sort_method explicitly in your workflow spec
Assign jobs to specific schedulers using the scheduler field on each job
Accept the default gpus_runtime_memory sorting if it matches your workload

Bypassing Validation

To create a workflow despite validation warnings:

torc workflows create --skip-checks workflow.yaml

Note: This bypasses scheduler node validation checks (which are treated as errors), but does not bypass all errors. Errors such as missing references or circular dependencies will always prevent creation.

List Available Workflows

torc workflows list

Delete a Workflow

torc workflows delete <workflow_id>

View Workflow Details

torc workflows get <workflow_id>

Defining File Dependencies

Jobs often need to read input files and produce output files. Torc can automatically infer job dependencies from these file relationships using variable substitution:

files:
  - name: raw_data
    path: /data/raw.csv
  - name: processed_data
    path: /data/processed.csv

jobs:
  - name: preprocess
    command: "python preprocess.py -o ${files.output.raw_data}"

  - name: analyze
    command: "python analyze.py -i ${files.input.raw_data} -o ${files.output.processed_data}"

Key concepts:

${files.input.NAME} - References a file this job reads (creates a dependency on the job that outputs it)
${files.output.NAME} - References a file this job writes (satisfies dependencies for downstream jobs)

In the example above, analyze automatically depends on preprocess because it needs raw_data as input, which preprocess produces as output.

For a complete walkthrough, see Tutorial: Diamond Workflow.

Next Steps

Tutorial: Diamond Workflow - Learn file-based dependencies with the fan-out/fan-in pattern
Workflow Specification Formats - Detailed format reference
Workflow Specification Reference - Complete field reference for all data models
Job Parameterization - Generate multiple jobs from templates
Tutorial: Many Independent Jobs - Your first workflow

Workflow Specification Formats

Torc supports three workflow specification formats: YAML, JSON5, and KDL. All formats provide the same functionality with different syntaxes to suit different preferences and use cases.

Format Overview

Feature	YAML	JSON5	KDL
Parameter Expansion	✓	✓	✓
Comments	✓	✓	✓
Trailing Commas	✗	✓	N/A
Human-Readable	✓✓✓	✓✓	✓✓✓
Programmatic Generation	✓✓	✓✓✓	✓
Industry Standard	✓✓✓	✓✓	✓
Jobs, Files, Resources	✓	✓	✓
User Data	✓	✓	✓
Workflow Actions	✓	✓	✓
Resource Monitoring	✓	✓	✓
Slurm Schedulers	✓	✓	✓

YAML Format

Best for: Most workflows, especially those using multi-line commands.

File Extension: .yaml or .yml

Example:

name: data_processing_workflow
user: datauser
description: Multi-stage data processing pipeline

# File definitions
files:
  - name: raw_data
    path: /data/input/raw_data.csv
  - name: processed_data
    path: /data/output/processed_data.csv

# Resource requirements
resource_requirements:
  - name: small_job
    num_cpus: 2
    num_gpus: 0
    num_nodes: 1
    memory: 4g
    runtime: PT30M

# Jobs
jobs:
  - name: download_data
    command: wget https://example.com/data.csv -O ${files.output.raw_data}
    resource_requirements: small_job

  - name: process_data
    command: python process.py ${files.input.raw_data} -o ${files.output.processed_data}
    resource_requirements: small_job
    depends_on:
      - download_data

# Workflow actions
actions:
  - trigger_type: on_workflow_start
    action_type: run_commands
    commands:
      - mkdir -p /data/input /data/output
      - echo "Workflow started"

Advantages:

Most widely used configuration format
Excellent for complex workflows with many jobs
Clean, readable syntax without brackets

Disadvantages:

Indentation-sensitive
Can be verbose for deeply nested structures

JSON5 Format

Best for: Programmatic workflow generation and JSON compatibility.

File Extension: .json5

Example:

{
  name: "data_processing_workflow",
  user: "datauser",
  description: "Multi-stage data processing pipeline",

  // File definitions
  files: [
    {name: "raw_data", path: "/data/input/raw_data.csv"},
    {name: "processed_data", path: "/data/output/processed_data.csv"},
  ],

  // Resource requirements
  resource_requirements: [
    {
      name: "small_job",
      num_cpus: 2,
      num_gpus: 0,
      num_nodes: 1,
      memory: "4g",
      runtime: "PT30M",
    },
  ],

  // Jobs
  jobs: [
    {
      name: "download_data",
      command: "wget https://example.com/data.csv -O ${files.output.raw_data}",
      resource_requirements: "small_job",
    },
    {
      name: "process_data",
      command: "python process.py ${files.input.raw_data} -o ${files.output.processed_data}",
      resource_requirements: "small_job",
      depends_on: ["download_data"],
    },
  ],

  // Workflow actions
  actions: [
    {
      trigger_type: "on_workflow_start",
      action_type: "run_commands",
      commands: [
        "mkdir -p /data/input /data/output",
        "echo 'Workflow started'",
      ],
    },
  ],
}

Advantages:

JSON-compatible (easy programmatic manipulation)
Supports comments and trailing commas
Familiar to JavaScript/JSON users

Disadvantages:

More verbose than YAML
More brackets and commas than YAML

KDL Format

Best for: Simple to moderate workflows with clean syntax.

File Extension: .kdl

Example:

name "data_processing_workflow"
user "datauser"
description "Multi-stage data processing pipeline"

// File definitions
file "raw_data" path="/data/input/raw_data.csv"
file "processed_data" path="/data/output/processed_data.csv"

// Resource requirements
resource_requirements "small_job" {
    num_cpus 2
    num_gpus 0
    num_nodes 1
    memory "4g"
    runtime "PT30M"
}

// Jobs
job "download_data" {
    command "wget https://example.com/data.csv -O ${files.output.raw_data}"
    resource_requirements "small_job"
}

job "process_data" {
    command "python process.py ${files.input.raw_data} -o ${files.output.processed_data}"
    resource_requirements "small_job"
    depends_on_job "download_data"
}

// Workflow actions
action {
    trigger_type "on_workflow_start"
    action_type "run_commands"
    command "mkdir -p /data/input /data/output"
    command "echo 'Workflow started'"
}

Advantages:

Clean, minimal syntax
No indentation requirements
Supports all core Torc features

Disadvantages:

Less familiar to most users
Boolean values use special syntax (#true, #false)

KDL-Specific Syntax Notes

Boolean values: Use #true and #false (not true or false)

resource_monitor {
    enabled #true
    generate_plots #false
}

Repeated child nodes: Use multiple statements

action {
    command "echo 'First command'"
    command "echo 'Second command'"
}

User data: Requires child nodes for properties

user_data "metadata" {
    is_ephemeral #true
    data "{\"key\": \"value\"}"
}

Common Features Across All Formats

Variable Substitution

All formats support the same variable substitution syntax:

${files.input.NAME} - Input file path
${files.output.NAME} - Output file path
${user_data.input.NAME} - Input user data
${user_data.output.NAME} - Output user data

Supported Fields

All formats support:

Workflow metadata: name, user, description
Jobs: name, command, dependencies, resource requirements
Files: name, path, modification time
User data: name, data (JSON), ephemeral flag
Resource requirements: CPUs, GPUs, memory, runtime
Slurm schedulers: account, partition, walltime, etc.
Workflow actions: triggers, action types, commands
Resource monitoring: enabled, granularity, sampling interval

Examples Directory

The Torc repository includes comprehensive examples in all three formats:

examples/
├── yaml/     # All workflows (15 examples)
├── json/     # All workflows (15 examples)
└── kdl/      # Non-parameterized workflows (9 examples)

Compare the same workflow in different formats to choose your preference:

See the examples directory for the complete collection.

Creating Workflows

All formats use the same command:

torc workflows create examples/yaml/sample_workflow.yaml
torc workflows create examples/json/sample_workflow.json5
torc workflows create examples/kdl/sample_workflow.kdl

Or use the quick execution commands:

# Create and run locally
torc run examples/yaml/sample_workflow.yaml

# Create and submit to scheduler
torc submit examples/yaml/workflow_actions_data_pipeline.yaml

Recommendations

Start with YAML if you're unsure.

Switch to JSON5 if you need to programmatically generate workflows or prefer JSON syntax.

Try KDL if you prefer minimal syntax.

All three formats are fully supported and maintained. Choose based on your workflow complexity and personal preference.

Visualizing Workflow Structure

Understanding how your workflow will execute—which jobs run in parallel, how dependencies create stages, and when Slurm allocations are requested—is essential for debugging and optimization. Torc provides several tools for visualizing workflow structure.

Execution Plan Command

The torc workflows execution-plan command analyzes a workflow and displays its execution stages, showing how jobs are grouped and when schedulers allocate resources.

Basic Usage

# From a specification file
torc workflows execution-plan workflow.yaml

# From an existing workflow
torc workflows execution-plan <workflow_id>

Example Output

For a workflow with two independent processing pipelines that merge at the end:

Workflow: two_subgraph_pipeline
Total Jobs: 15

▶ Stage 1: Workflow Start
  Scheduler Allocations:
    • prep_sched (slurm) - 1 allocation(s)
  Jobs Becoming Ready:
    • prep_a
    • prep_b

→ Stage 2: When jobs 'prep_a', 'prep_b' complete
  Scheduler Allocations:
    • work_a_sched (slurm) - 1 allocation(s)
    • work_b_sched (slurm) - 1 allocation(s)
  Jobs Becoming Ready:
    • work_a_{1..5}
    • work_b_{1..5}

→ Stage 3: When 10 jobs complete
  Scheduler Allocations:
    • post_a_sched (slurm) - 1 allocation(s)
    • post_b_sched (slurm) - 1 allocation(s)
  Jobs Becoming Ready:
    • post_a
    • post_b

→ Stage 4: When jobs 'post_a', 'post_b' complete
  Scheduler Allocations:
    • final_sched (slurm) - 1 allocation(s)
  Jobs Becoming Ready:
    • final

Total Stages: 4

What the Execution Plan Shows

Stages: Groups of jobs that become ready at the same time based on dependency resolution
Scheduler Allocations: Which Slurm schedulers request resources at each stage (for workflows with Slurm configuration)
Jobs Becoming Ready: Which jobs transition to "ready" status at each stage
Subgraphs: Independent branches of the workflow that can execute in parallel

Workflows Without Slurm Schedulers

For workflows without pre-defined Slurm schedulers, the execution plan shows the job stages without scheduler information:

torc workflows execution-plan workflow_no_slurm.yaml

Workflow: my_pipeline
Total Jobs: 10

▶ Stage 1: Workflow Start
  Jobs Becoming Ready:
    • preprocess

→ Stage 2: When job 'preprocess' completes
  Jobs Becoming Ready:
    • work_{1..5}

→ Stage 3: When 5 jobs complete
  Jobs Becoming Ready:
    • postprocess

Total Stages: 3

This helps you understand the workflow topology before adding Slurm configuration with torc slurm generate.

Use Cases

Validate workflow structure: Ensure dependencies create the expected execution order
Identify parallelism: See which jobs can run concurrently
Debug slow workflows: Find stages that serialize unnecessarily
Plan Slurm allocations: Understand when resources will be requested
Verify auto-generated schedulers: Check that torc slurm generate created appropriate staging

DAG Visualization in the Dashboard

The web dashboard provides interactive DAG (Directed Acyclic Graph) visualization.

Viewing the DAG

Navigate to the Details tab
Select a workflow
Click View DAG in the Visualization section

DAG Types

The dashboard supports three DAG visualization types:

Type	Description
Job Dependencies	Shows explicit and implicit dependencies between jobs
Job-File Relations	Shows how jobs connect through input/output files
Job-UserData Relations	Shows how jobs connect through user data

DAG Features

Color-coded nodes: Jobs are colored by status (ready, running, completed, failed, etc.)
Interactive: Zoom, pan, and click nodes for details
Layout: Automatic hierarchical layout using Dagre algorithm
Legend: Status color reference

TUI DAG View

The terminal UI (torc tui) also includes DAG visualization:

Select a workflow
Press d to toggle the DAG view
Use arrow keys to navigate

Comparing Visualization Tools

Tool	Best For
`execution-plan`	Understanding execution stages, Slurm allocation timing
Dashboard DAG	Interactive exploration, status monitoring
TUI DAG	Quick terminal-based visualization

Example: Analyzing a Complex Workflow

Consider a workflow with preprocessing, parallel work, and aggregation:

# First, view the execution plan
torc workflows execution-plan examples/subgraphs/subgraphs_workflow.yaml

# If no schedulers, generate them
torc slurm generate --account myproject examples/subgraphs/subgraphs_workflow_no_slurm.yaml

# View the plan again to see scheduler allocations
torc workflows execution-plan examples/subgraphs/subgraphs_workflow.yaml

The execution plan helps you verify that:

Independent subgraphs are correctly identified
Stages align with your expected execution order
Slurm allocations are timed appropriately

How to Export and Import Workflows

This guide shows how to export workflows to portable JSON files and import them into the same or different Torc servers. This is useful for:

Backup and restore: Save workflow definitions for disaster recovery
Migration: Move workflows between development, staging, and production environments
Sharing: Share workflow templates with teammates or the community
Duplication: Create copies of workflows for testing or experimentation

Exporting Workflows

Basic Export

Export a workflow to a JSON file:

torc workflows export 123 --output my_workflow.json

This creates a self-contained JSON document containing:

Workflow metadata
All jobs with their dependencies
Files and user data
Resource requirements
Slurm and local schedulers
Workflow actions

Export with Results

Include job results (stdout, stderr, return codes) in the export:

torc workflows export 123 --output my_workflow.json --include-results

Export with Events

Include workflow events (job status changes, scheduler events):

torc workflows export 123 --output my_workflow.json --include-events

Export with Everything

Include both results and events:

torc workflows export 123 --output my_workflow.json --include-results --include-events

Export to Stdout

Omit --output to write to stdout (useful for piping):

torc workflows export 123 > my_workflow.json

JSON Output Format

Use --format json for machine-readable output with export statistics:

torc workflows export 123 --output my_workflow.json --format json

Output:

{
  "success": true,
  "workflow_id": 123,
  "workflow_name": "my_workflow",
  "output_file": "my_workflow.json",
  "jobs": 5,
  "files": 3,
  "user_data": 2,
  "results": 0,
  "events": 0
}

Importing Workflows

Basic Import

Import a workflow from a JSON file:

torc workflows import my_workflow.json

Output:

Successfully imported workflow:
  Workflow ID: 456
  Name: my_workflow
  Jobs: 5
  Files: 3
  User data: 2

Import with Custom Name

Override the workflow name during import:

torc workflows import my_workflow.json --name "new_workflow_name"

Skip Results During Import

If the export includes results but you don't want to import them:

torc workflows import my_workflow.json --skip-results

Skip Events During Import

If the export includes events but you don't want to import them:

torc workflows import my_workflow.json --skip-events

JSON Output Format

Use --format json for machine-readable output:

torc workflows import my_workflow.json --format json

Output:

{
  "success": true,
  "workflow_id": 456,
  "workflow_name": "my_workflow",
  "jobs": 5,
  "files": 3,
  "user_data": 2
}

How Import Works

ID Remapping

When importing, all entity IDs are remapped to new IDs assigned by the target server. This ensures no conflicts with existing workflows. Cross-references between entities (e.g., job dependencies on files) are automatically updated to use the new IDs.

Job Status Reset

Imported jobs always start in the uninitialized status, regardless of their status in the exported file. After import, you need to initialize and run the workflow:

# Initialize the imported workflow
torc workflows initialize 456

# Run locally
torc workflows run 456

# Or submit to scheduler
torc workflows submit 456

Default Resource Requirements

Each workflow automatically gets a "default" resource requirements entry. During import, the exported "default" resource requirements are mapped to the new workflow's default entry.

Export Format

The export format is a versioned JSON document. Here's the structure:

{
  "export_version": "1.0",
  "exported_at": "2024-01-15T10:30:00Z",
  "workflow": { ... },
  "files": [ ... ],
  "user_data": [ ... ],
  "resource_requirements": [ ... ],
  "slurm_schedulers": [ ... ],
  "local_schedulers": [ ... ],
  "jobs": [ ... ],
  "workflow_actions": [ ... ],
  "results": [ ... ],
  "events": [ ... ]
}

The results and events fields are only present when --include-results or --include-events are specified.

Common Workflows

Backup All Active Workflows

for id in $(torc workflows list --format json | jq -r '.items[].id'); do
  torc workflows export $id --output "backup_workflow_${id}.json"
done

Migrate to Another Server

# On source server
torc workflows export 123 --output workflow.json

# On target server (different TORC_API_URL)
export TORC_API_URL="http://new-server:8080/torc-service/v1"
torc workflows import workflow.json

Clone a Workflow for Testing

# Export existing workflow
torc workflows export 123 --output original.json

# Import as a new workflow with different name
torc workflows import original.json --name "test_copy"

Troubleshooting

Import Fails with "File not found"

Ensure the export file exists and the path is correct:

ls -la my_workflow.json
torc workflows import ./my_workflow.json

Import Fails with API Error

Check that:

The Torc server is running and accessible
You have permission to create workflows
The export file is valid JSON (not corrupted)

Validate the export file:

python -m json.tool my_workflow.json > /dev/null && echo "Valid JSON"

Jobs Not Running After Import

Imported jobs start in uninitialized status. You must initialize the workflow:

torc workflows initialize 456

Then check job status:

torc jobs list 456

Jobs should now show ready or blocked status depending on their dependencies.

Archiving Workflows

Workflow archiving provides a way to hide completed or inactive workflows from default list views while preserving all workflow data and execution history. Archived workflows remain fully accessible but don't clutter everyday workflow management operations.

Purpose and Motivation

As projects mature and accumulate workflows over time, the list of active workflows can become difficult to navigate. Archiving addresses this by:

Reducing visual clutter - Completed workflows no longer appear in default list views
Preserving historical data - All workflow data, jobs, results, and logs remain accessible
Improving usability - Users can focus on active workflows without losing access to past work
Maintaining audit trails - Archived workflows can be retrieved for analysis, debugging, or compliance

Archiving is particularly useful for:

Completed experiments that may need future reference
Successful production runs that serve as historical records
Development workflows that are no longer active but contain valuable examples
Workflows from completed projects that need to be retained for documentation

How It Works

When you archive a workflow, it's marked with an "archived" flag. This flag controls whether the workflow appears in default list views:

Active workflows (not archived): Appear in standard workflows list commands
Archived workflows: Hidden from default lists but accessible with the --archived-only flag

The archive status is just metadata - it doesn't affect the workflow's data, results, or any other functionality.

Archiving Workflows

Use the workflows archive command to archive or unarchive workflows:

# Archive a specific workflow
torc workflows archive true <workflow_id>

# Archive multiple workflows
torc workflows archive true 123 456 789

# Interactive selection (prompts user to choose)
torc workflows archive true

# With JSON output
torc --format json workflows archive true <workflow_id>

The command will output confirmation messages:

Successfully archived workflow 123
Successfully archived workflow 456
Successfully archived workflow 789

Unarchiving Workflows

To restore an archived workflow to active status, use the same command with false:

# Unarchive a specific workflow
torc workflows archive false <workflow_id>

# Unarchive multiple workflows
torc workflows archive false 123 456 789

# Interactive selection
torc workflows archive false

Output:

Successfully unarchived workflow 123

Viewing Workflows

Default Behavior

By default, the workflows list command shows only non-archived workflows:

# Shows active (non-archived) workflows only
torc workflows list

# Shows active workflows for a specific user
torc workflows list --user alice

Viewing Archived Workflows

Use the --archived-only flag to see archived workflows:

# List only archived workflows for current user
torc workflows list --archived-only

Viewing All Workflows

Use the --include-archived flag to see all workflows:

torc workflows list --include-archived

Accessing Specific Workflows

You can always access a workflow directly by its ID, regardless of archive status:

# Get details of any workflow (archived or not)
torc workflows get <workflow_id>

# Check workflow status
torc workflows status <workflow_id>

Impact on Workflow Operations

Operations Restricted on Archived Workflows

Certain workflow operations are not allowed on archived workflows to prevent accidental modifications:

❌ Status reset: Cannot use workflows reset-status on archived workflows
- Error message: "Cannot reset archived workflow status. Unarchive the workflow first."
- To reset status, unarchive the workflow first, then reset

Interactive Selection Behavior

When commands prompt for interactive workflow selection (when workflow ID is not specified), archived workflows are excluded by default:

# These will NOT show archived workflows in the interactive menu
torc-client workflows delete
torc-client workflows status
torc-client workflows initialize

This prevents accidentally operating on archived workflows while still allowing explicit access by ID.

Archive vs. Delete

Understanding when to archive versus delete workflows:

Operation	Data Preserved	Reversible	Use Case
Archive	✅ Yes	✅ Yes	Completed workflows you may reference later
Delete	❌ No	❌ No	Failed experiments, test workflows, unwanted data

Archive when:

Workflow completed successfully and may need future reference
Results should be preserved for reproducibility or compliance
Workflow represents a milestone or important historical run
You want to declutter lists but maintain data integrity

Delete when:

Workflow failed and results are not useful
Workflow was created for testing purposes only
Data is no longer needed and storage space is a concern
Workflow contains errors that would confuse future users

Common Use Cases

Completed Experiments

After completing an experiment and validating results:

# Archive the completed experiment
torc-client workflows archive true 123

# Later, if you need to reference it
torc-client workflows get 123
torc-client results list 123

Development Cleanup

Clean up development workflows while preserving examples:

# Delete test workflows
torc-client workflows delete 301 302 303

# Archive useful development examples
torc-client workflows archive true 304 305

Periodic Maintenance

Regularly archive old workflows to keep lists manageable:

# List workflows, identify completed ones
torc-client workflows list

# Archive workflows from completed projects
torc workflows archive true 401 402 403 404 405

Best Practices

When to Archive

After successful completion - Archive workflows once they've completed successfully and been validated
Project milestones - Archive workflows representing project phases or releases
Regular cleanup - Establish periodic archiving of workflows older than a certain timeframe
Before major changes - Archive working versions before making significant modifications

Summary

Workflow archiving provides a simple, reversible way to hide completed or inactive workflows from default views while preserving all data and functionality. It's designed for long-term workflow management in active projects where historical data is valuable but visual clutter is undesirable.

Key points:

Archive workflows with: torc workflows archive true <id>
Unarchive workflows with: torc workflows archive false <id>
Archived workflows are hidden from default lists but remain fully functional
View archived workflows with: torc workflows list --archived-only
Archiving is reversible and does not affect data storage
Use archiving for completed workflows; use deletion for unwanted data

How-Tos

Step-by-step guides for common tasks.

Track Workflow Status - Monitoring workflow progress
Cancel a Workflow - Stopping running workflows
View Job Logs - Accessing job output
Debug a Failed Job - Troubleshooting failures
Check Resource Utilization - Monitoring resource usage
View Resource Plots - Visualizing resource metrics
Parameterize Jobs with Files - Using file-based parameters

How to Track Workflow Status

Monitor a running workflow's progress using the CLI, TUI, or dashboard.

Quick Status Check

torc reports summary <workflow_id>

Example output:

Workflow 42: data_pipeline

Jobs by Status:
  Completed:  45
  Running:     5
  Ready:      10
  Blocked:    40

Continuous Monitoring

Watch status update every 10 seconds:

watch -n 10 torc reports summary <workflow_id>

Interactive TUI

Launch the terminal UI for a visual dashboard:

torc tui

The TUI shows:

Job status breakdown with progress bars
Running job details
Failed job information
Real-time updates

List Individual Jobs

View job-level status:

# All jobs
torc jobs list <workflow_id>

# Filter by status
torc jobs list <workflow_id> --status running
torc jobs list <workflow_id> --status failed

Check Completion

Verify if a workflow has finished:

torc workflows is-complete <workflow_id>

For scripting:

if torc -f json workflows is-complete "$WORKFLOW_ID" | jq -e '.is_complete' > /dev/null; then
    echo "Workflow complete"
fi

How to Cancel a Workflow

Stop a running workflow and terminate its jobs.

Cancel a Workflow

torc workflows cancel <workflow_id>

This:

Marks the workflow as canceled
Stops claiming new jobs
Sends SIGKILL to all running processes
Sends scancel to all active or pending Slurm allocations

Check Cancellation Status

Verify the workflow was canceled:

torc workflows status <workflow_id>

Or check completion status:

torc workflows is-complete <workflow_id>

Output:

Workflow 42 completion status:
  Is Complete: true
  Is Canceled: true

Restart After Cancellation

To resume a canceled workflow:

# Reinitialize to reset canceled jobs
torc workflows reinitialize <workflow_id>

# Run again
torc workflows run <workflow_id>
torc workflows submit <workflow_id>

Jobs that completed before cancellation remain completed.

How to View Job Logs

Find and read the stdout/stderr output from job execution.

Find Log File Paths

torc reports results <workflow_id>
torc reports results <workflow_id> --job-id 15

Output includes:

{
  "job_id": 15,
  "job_name": "work_2",
  "status": "Failed",
  "run_id": 1,
  "return_code": 137,
  "completion_time": "2026-01-06T20:30:00.200Z",
  "exec_time_minutes": 0.09313333333333332,
  "compute_node_id": 47,
  "job_stdout": "output/job_stdio/job_wf43_j15_r1_a1.o",
  "job_stderr": "output/job_stdio/job_wf43_j15_r1_a1.e",
  "compute_node_type": "slurm"
},

Read Logs Directly

Once you have the path, view the logs:

# View stdout
cat output/job_stdio/job_wf43_j15_r1_a1.o

# View stderr
cat output/job_stdio/job_wf43_j15_r1_a1.e

# Follow logs in real-time (for running jobs)
tail -f output/job_stdio/job_wf43_j15_r1_a1.*

Default Log Location

By default, logs are stored in the output directory:

output/
└── job_stdio/
    ├── job_wf<workflow_id>_j<job_id>_r<run_id>_a<attempt_id>.o
    ├── job_wf<workflow_id>_j<job_id>_r<run_id>_a<attempt_id>.3

The output directory can be configured via the run/submit CLI options.

View Logs for Failed Jobs

Quickly find logs for failed jobs:

# Get failed job IDs
torc jobs list <workflow_id> --status failed

# Then view each job's logs
torc reports results <workflow_id> --job-id <failed_job_id>

View Logs in TUI or Dashboard

You can also view job logs interactively:

TUI — Run torc tui and select a job to view its stdout/stderr in the interface. See Terminal UI.
Dashboard — The web dashboard displays job logs when you click on a job. See Web Dashboard.

How to Debug a Failed Job

Systematically diagnose why a job failed.

Step 1: Identify the Failed Job

torc jobs list <workflow_id> --status failed

Note the job ID and name.

Step 2: Check the Exit Code

torc results get <workflow_id> --job-id <job_id>

Common exit codes:

Code	Meaning
1	General error
2	Misuse of shell command
126	Permission denied
127	Command not found
137	Killed (SIGKILL) — often OOM
139	Segmentation fault
143	Terminated (SIGTERM)

Step 3: Read the Logs

# Get log paths
torc reports results <workflow_id> --job-id <job_id>

# View stderr (usually contains error messages)
cat output/job_stdio/job_wf43_j15_r1_a1.e

# View stdout
cat output/job_stdio/job_wf43_j15_r1_a1.o

Step 4: Check Resource Usage

Did the job exceed its resource limits?

torc reports check-resource-utilization <workflow_id>

Look for:

Memory exceeded — Job was likely OOM-killed (exit code 137)
Runtime exceeded — Job was terminated for running too long

Step 5: Reproduce Locally

Get the exact command that was run:

torc jobs get <job_id>

Try running it manually to see the error:

# Copy the command from the output and run it
python process.py --input data.csv

Common Fixes

Problem	Solution
OOM killed	Increase `memory` in resource requirements
File not found	Verify input files exist, check dependencies
Permission denied	Check file permissions, execution bits
Timeout	Increase `runtime` in resource requirements

Step 6: Fix and Retry

After fixing the issue:

# Reinitialize to reset failed jobs
torc workflows reset-status --failed --reinitialize <workflow_id>

# Run again
torc workflows run <workflow_id>
torc submit-slurm <workflow_id>

How to Check Resource Utilization

Compare actual resource usage against specified requirements to identify jobs that exceeded their limits.

Quick Start

torc reports check-resource-utilization <workflow_id>

Example output:

⚠ Found 2 resource over-utilization violations:

Job ID | Job Name    | Resource | Specified | Peak Used | Over-Utilization
-------|-------------|----------|-----------|-----------|------------------
15     | train_model | Memory   | 8.00 GB   | 10.50 GB  | +31.3%
15     | train_model | Runtime  | 2h 0m 0s  | 2h 45m 0s | +37.5%

Show All Jobs

Include jobs that stayed within limits:

torc reports check-resource-utilization <workflow_id> --all

Check a Specific Run

For workflows that have been reinitialized multiple times:

torc reports check-resource-utilization <workflow_id> --run-id 2

Adjusting Requirements

When jobs exceed their limits, update your workflow specification with a buffer:

resource_requirements:
  - name: training
    memory: 12g       # 10.5 GB peak + 15% buffer
    runtime: PT3H     # 2h 45m actual + buffer

Guidelines:

Memory: Add 10-20% above peak usage
Runtime: Add 15-30% above actual duration
CPU: Round up to next core count

How to View Resource Utilization Plots

Generate interactive visualizations of CPU and memory usage over time.

Prerequisites

Enable time series monitoring in your workflow specification:

resource_monitor:
  granularity: "time_series"
  sample_interval_seconds: 2

This creates a SQLite database with resource samples at regular intervals.

Generate Plots

After your workflow completes, generate plots from the collected data:

torc plot-resources output/resource_utilization/resource_metrics_*.db -o plots/

This creates:

Individual job plots — CPU, memory, and process count over time for each job
Overview plots — Comparison across all jobs
Summary dashboard — Bar charts of peak and average usage

Plot Specific Jobs

Generate plots for only certain jobs:

torc plot-resources output/resource_utilization/resource_metrics_*.db \
  -o plots/ \
  --job-ids 15,16

View the Plots

Open the generated HTML files in your browser:

open plots/job_15_resources.html

Query Data Directly

For custom analysis, query the SQLite database:

sqlite3 -table output/resource_utilization/resource_metrics_1_1.db

-- View samples for a specific job
SELECT timestamp, cpu_percent, memory_bytes
FROM job_resource_samples
WHERE job_id = 1
ORDER BY timestamp;

How to Parameterize Jobs with Files

Process multiple input files by combining parameterization with file path templating.

Basic Pattern

Use a parameter to generate jobs for each file:

name: process_files

jobs:
  - name: process_{dataset}
    command: python process.py --input data/{dataset}.csv --output results/{dataset}.json
    parameters:
      dataset: "[train, test, validation]"

This creates 3 jobs:

process_train → processes data/train.csv
process_test → processes data/test.csv
process_validation → processes data/validation.csv

With File Dependencies

Combine parameterization with explicit file definitions for dependency tracking:

name: file_pipeline

files:
  - name: raw_{dataset}
    path: data/{dataset}.csv
  - name: processed_{dataset}
    path: results/{dataset}.json

jobs:
  - name: process_{dataset}
    command: python process.py -i ${files.input.raw_{dataset}} -o ${files.output.processed_{dataset}}
    parameters:
      dataset: "[train, test, validation]"

  - name: aggregate
    command: python aggregate.py --input results/ --output summary.json
    depends_on:
      - process_{dataset}
    parameters:
      dataset: "[train, test, validation]"

The aggregate job automatically waits for all process_* jobs to complete.

Processing Numbered Files

Use range syntax for numbered file sequences:

jobs:
  - name: convert_{i:03d}
    command: ffmpeg -i video_{i:03d}.mp4 -o audio_{i:03d}.mp3
    parameters:
      i: "1:100"

Creates jobs for video_001.mp4 through video_100.mp4.

Multi-Dimensional Sweeps

Combine multiple parameters for Cartesian product expansion:

jobs:
  - name: analyze_{region}_{year}
    command: python analyze.py --region {region} --year {year} --output results/{region}_{year}.json
    parameters:
      region: "[north, south, east, west]"
      year: "2020:2024"

Creates 20 jobs (4 regions × 5 years).

Tutorials

Hands-on tutorials to learn Torc by example.

Many Independent Jobs - Running large batches of independent jobs
Diamond Workflow - Building workflows with dependencies
Simple Parameterization - Creating parameterized job sweeps
Advanced Parameterization - Complex parameter combinations
Multi-Stage Workflows with Barriers - Coordinating workflow stages
User Data Dependencies - Passing data between jobs

Tutorial 1: Many Independent Jobs

This tutorial teaches you how to create and run a workflow with many independent parallel jobs using Torc's parameterization feature.

Learning Objectives

By the end of this tutorial, you will:

Understand how to define parameterized jobs that expand into multiple instances
Learn how Torc executes independent jobs in parallel
Know how to monitor job execution and view results

Prerequisites

Torc server running (see Installation)
Basic familiarity with YAML syntax

Use Cases

This pattern is ideal for:

Parameter sweeps: Testing different configurations
Monte Carlo simulations: Running many independent trials
Batch processing: Processing many files with the same logic
Embarrassingly parallel workloads: Any task that can be split into independent units

Step 1: Start the Torc Server

First, ensure the Torc server is running:

torc-server run

By default, the server listens on port 8080, making the API URL http://localhost:8080/torc-service/v1.

If you use a custom port, set the environment variable:

export TORC_API_URL="http://localhost:8100/torc-service/v1"

Step 2: Create the Workflow Specification

Save the following as hundred_jobs.yaml:

name: hundred_jobs_parallel
description: 100 independent jobs that can run in parallel

jobs:
  - name: job_{i:03d}
    command: |
      echo "Running job {i}"
      sleep $((RANDOM % 10 + 1))
      echo "Job {i} completed"
    resource_requirements: minimal
    parameters:
      i: "1:100"

resource_requirements:
  - name: minimal
    num_cpus: 1
    num_gpus: 0
    num_nodes: 1
    memory: 1g
    runtime: PT5M

Understanding the Specification

Let's break down the key elements:

name: job_{i:03d}: The {i:03d} is a parameter placeholder. The :03d format specifier means "3-digit zero-padded integer", so jobs will be named job_001, job_002, ..., job_100.
parameters: i: "1:100": This defines a parameter i that ranges from 1 to 100 (inclusive). Torc will create one job for each value.
resource_requirements: minimal: Each job uses the "minimal" resource profile defined below.

When Torc processes this specification, it expands the single job definition into 100 separate jobs, each with its own parameter value substituted.

Step 3: Run the Workflow

Create and run the workflow in one command:

torc run hundred_jobs.yaml

This command:

Creates the workflow on the server
Expands the parameterized job into 100 individual jobs
Initializes the dependency graph (in this case, no dependencies)
Starts executing jobs in parallel

You'll see output showing the workflow ID and progress.

Step 4: Monitor Execution

While the workflow runs, you can monitor progress:

# Check workflow status
torc workflows status <workflow_id>

# List jobs and their states
torc jobs list <workflow_id>

# Or use the interactive TUI
torc tui

Since all 100 jobs are independent (no dependencies between them), Torc will run as many in parallel as your system resources allow.

Step 5: View Results

After completion, check the results:

torc results list <workflow_id>

This shows return codes, execution times, and resource usage for each job.

How It Works

When you run this workflow, Torc:

Expands parameters: The single job definition becomes 100 jobs (job_001 through job_100)
Marks all as ready: Since there are no dependencies, all jobs start in the "ready" state
Executes in parallel: The job runner claims and executes jobs based on available resources
Tracks completion: Each job's return code and metrics are recorded

The job runner respects the resource requirements you specified. With num_cpus: 1 per job, if your machine has 8 CPUs, approximately 8 jobs will run simultaneously.

What You Learned

In this tutorial, you learned how to:

✅ Use parameter expansion (parameters: i: "1:100") to generate multiple jobs from one definition
✅ Use format specifiers ({i:03d}) for consistent naming
✅ Run independent parallel jobs with torc run
✅ Monitor workflow progress and view results

Example Files

See hundred_jobs_parameterized.yaml for a ready-to-run version of this workflow.

Next Steps

Tutorial 2: Diamond Workflow - Learn how to create job dependencies using files
Tutorial 4: Simple Parameterization - Explore more parameter expansion options
Multi-Stage Workflows with Barriers - Scale to thousands of jobs efficiently

Tutorial 2: Diamond Workflow with File Dependencies

This tutorial teaches you how to create workflows where job dependencies are automatically inferred from file inputs and outputs—a core concept in Torc called implicit dependencies.

Learning Objectives

By the end of this tutorial, you will:

Understand how Torc infers job dependencies from file relationships
Learn the "diamond" workflow pattern (fan-out and fan-in)
Know how to use file variable substitution (${files.input.*} and ${files.output.*})
See how jobs automatically unblock when their input files become available

Prerequisites

Completed Tutorial 1: Many Independent Jobs
Torc server running

The Diamond Pattern

The "diamond" pattern is a common workflow structure where:

One job produces multiple outputs (fan-out)
Multiple jobs process those outputs in parallel
A final job combines all results (fan-in)

graph TD
    Input(["input.txt"]) --> Preprocess["preprocess"]
    Preprocess --> Int1(["intermediate1.txt"])
    Preprocess --> Int2(["intermediate2.txt"])

    Int1 --> Work1["work1"]
    Int2 --> Work2["work2"]

    Work1 --> Result1(["result1.txt"])
    Work2 --> Result2(["result2.txt"])

    Result1 --> Postprocess["postprocess"]
    Result2 --> Postprocess

    Postprocess --> Output(["output.txt"])

    style Input fill:#d4edda,stroke:#28a745,color:#155724
    style Int1 fill:#d4edda,stroke:#28a745,color:#155724
    style Int2 fill:#d4edda,stroke:#28a745,color:#155724
    style Result1 fill:#d4edda,stroke:#28a745,color:#155724
    style Result2 fill:#d4edda,stroke:#28a745,color:#155724
    style Output fill:#d4edda,stroke:#28a745,color:#155724
    style Preprocess fill:#4a9eff,color:#fff
    style Work1 fill:#4a9eff,color:#fff
    style Work2 fill:#4a9eff,color:#fff
    style Postprocess fill:#4a9eff,color:#fff

Notice that we never explicitly say "work1 depends on preprocess"—Torc figures this out automatically because work1 needs intermediate1.txt as input, and preprocess produces it as output.

Step 1: Create the Workflow Specification

Save as diamond.yaml:

name: diamond_workflow
description: Diamond workflow demonstrating fan-out and fan-in

jobs:
  - name: preprocess
    command: |
      cat ${files.input.input_file} |
      awk '{print $1}' > ${files.output.intermediate1}
      cat ${files.input.input_file} |
      awk '{print $2}' > ${files.output.intermediate2}
    resource_requirements: small

  - name: work1
    command: |
      cat ${files.input.intermediate1} |
      sort | uniq > ${files.output.result1}
    resource_requirements: medium

  - name: work2
    command: |
      cat ${files.input.intermediate2} |
      sort | uniq > ${files.output.result2}
    resource_requirements: medium

  - name: postprocess
    command: |
      paste ${files.input.result1} ${files.input.result2} > ${files.output.final_output}
    resource_requirements: small

files:
  - name: input_file
    path: /tmp/input.txt

  - name: intermediate1
    path: /tmp/intermediate1.txt

  - name: intermediate2
    path: /tmp/intermediate2.txt

  - name: result1
    path: /tmp/result1.txt

  - name: result2
    path: /tmp/result2.txt

  - name: final_output
    path: /tmp/output.txt

resource_requirements:
  - name: small
    num_cpus: 1
    num_gpus: 0
    num_nodes: 1
    memory: 1g
    runtime: PT10M

  - name: medium
    num_cpus: 4
    num_gpus: 0
    num_nodes: 1
    memory: 4g
    runtime: PT30M

Understanding File Variable Substitution

The key concept here is file variable substitution:

${files.input.filename} - References a file this job reads (creates a dependency)
${files.output.filename} - References a file this job writes (satisfies dependencies)

When Torc processes the workflow:

It sees preprocess outputs intermediate1 and intermediate2
It sees work1 inputs intermediate1 → dependency created
It sees work2 inputs intermediate2 → dependency created
It sees postprocess inputs result1 and result2 → dependencies created

This is more maintainable than explicit depends_on declarations because:

Dependencies are derived from actual data flow
Adding a new intermediate step automatically updates dependencies
The workflow specification documents the data flow

Step 2: Create Input Data

# Create test input file
echo -e "apple red\nbanana yellow\ncherry red\ndate brown" > /tmp/input.txt

Step 3: Create and Initialize the Workflow

# Create the workflow and capture the ID
WORKFLOW_ID=$(torc workflows create diamond.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

# Ensure the input file timestamp is current
touch /tmp/input.txt

# Initialize the workflow (builds dependency graph)
torc workflows initialize-jobs $WORKFLOW_ID

The initialize-jobs command is where Torc:

Analyzes file input/output relationships
Builds the dependency graph
Marks jobs with satisfied dependencies as "ready"

Step 4: Observe Dependency Resolution

# Check job statuses
torc jobs list $WORKFLOW_ID

Expected output:

╭────┬──────────────┬─────────┬────────╮
│ ID │ Name         │ Status  │ ...    │
├────┼──────────────┼─────────┼────────┤
│ 1  │ preprocess   │ ready   │ ...    │
│ 2  │ work1        │ blocked │ ...    │
│ 3  │ work2        │ blocked │ ...    │
│ 4  │ postprocess  │ blocked │ ...    │
╰────┴──────────────┴─────────┴────────╯

Only preprocess is ready because:

Its only input (input_file) already exists
The others are blocked waiting for files that don't exist yet

Step 5: Run the Workflow

torc run $WORKFLOW_ID

Watch the execution unfold:

preprocess runs first - Creates intermediate1.txt and intermediate2.txt
work1 and work2 unblock - Their input files now exist
work1 and work2 run in parallel - They have no dependency on each other
postprocess unblocks - Both result1.txt and result2.txt exist
postprocess runs - Creates the final output

Step 6: Verify Results

cat /tmp/output.txt

You should see the combined, sorted, unique values from both columns of the input.

How Implicit Dependencies Work

Torc determines job order through file relationships:

Job	Inputs	Outputs	Blocked By
preprocess	input_file	intermediate1, intermediate2	(nothing)
work1	intermediate1	result1	preprocess
work2	intermediate2	result2	preprocess
postprocess	result1, result2	final_output	work1, work2

The dependency graph is built automatically from these relationships. If you later add a validation step between preprocess and work1, you only need to update the file references—the dependencies adjust automatically.

What You Learned

In this tutorial, you learned:

✅ How to define files with files: section and reference them in jobs
✅ How ${files.input.*} creates implicit dependencies
✅ How ${files.output.*} satisfies dependencies for downstream jobs
✅ The diamond pattern: fan-out → parallel processing → fan-in
✅ How Torc automatically determines execution order from data flow

When to Use File Dependencies vs Explicit Dependencies

Use file dependencies when:

Jobs actually read/write files
Data flow defines the natural ordering
You want self-documenting workflows

Use explicit depends_on when:

Dependencies are logical, not data-based
Jobs communicate through side effects
You need precise control over ordering

Example Files

See the diamond workflow examples in all three formats:

A Python version is also available: diamond_workflow.py

Next Steps

Tutorial 3: User Data Dependencies - Pass JSON data between jobs without files
Tutorial 4: Simple Parameterization - Combine file dependencies with parameter expansion

Tutorial: Simple Job Parameterization

This tutorial teaches you how to create parameter sweeps—generating multiple related jobs from a single job definition using Torc's parameterization feature.

Learning Objectives

By the end of this tutorial, you will:

Understand how parameterization expands one job definition into many jobs
Learn the different parameter formats (lists, ranges)
Know how to use format specifiers for consistent naming

Prerequisites

Completed Tutorial 1: Many Independent Jobs
Torc server running

Why Parameterization?

Without parameterization, running 10 similar jobs requires 10 separate definitions. With parameterization, you write one definition and Torc expands it:

# Without parameterization: 10 separate definitions
jobs:
  - name: task_1
    command: ./process.sh --index 1
  - name: task_2
    command: ./process.sh --index 2
  # ... 8 more ...

# With parameterization: 1 definition
jobs:
  - name: task_{i}
    command: ./process.sh --index {i}
    parameters:
      i: "1:10"

Step 1: Create a Simple Parameterized Workflow

Save as parallel_tasks.yaml:

name: parallel_tasks
description: Run 10 parallel tasks

jobs:
  - name: task_{i}
    command: |
      echo "Processing task {i}"
      sleep 2
      echo "Task {i} complete"
    parameters:
      i: "1:10"

This expands to 10 jobs: task_1, task_2, ..., task_10.

Step 2: Create and Run the Workflow

WORKFLOW_ID=$(torc workflows create parallel_tasks.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

torc run $WORKFLOW_ID

All 10 jobs run in parallel since they have no dependencies.

Step 3: Verify the Expansion

torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[].name' | sort

Output:

task_1
task_10
task_2
task_3
task_4
task_5
task_6
task_7
task_8
task_9

Adding Dependencies Between Parameterized Jobs

Now let's create a workflow where parameterized jobs have dependencies:

name: process_and_aggregate
description: Process chunks then aggregate results

jobs:
  - name: process_{i}
    command: |
      echo "Processing chunk {i}" > results/chunk_{i}.txt
    parameters:
      i: "1:5"

  - name: aggregate
    command: |
      cat results/chunk_*.txt > results/final.txt
      echo "Aggregated all chunks"
    depends_on:
      - process_{i}
    parameters:
      i: "1:5"

This creates:

5 parallel process_* jobs
1 aggregate job that waits for all 5 to complete

The aggregate job lists all process_{i} jobs as dependencies, so Torc expands this to depend on all 5 process jobs.

Understanding the Specification

Parameter Syntax:

{lr} - Simple substitution with the parameter value
{lr:.4f} - Format specifier: 4 decimal places (e.g., 0.0010 not 0.001)

Parameter Values:

"[0.0001,0.0005,0.001,0.005,0.01]" - A list of 5 specific values

File Parameterization: Notice that both jobs AND files have parameters:. When Torc expands:

Each train_lr{lr:.4f} job gets a corresponding model_lr{lr:.4f} file
The file dependencies are matched by parameter value

Dependency Flow:

train_lr0.0001 → outputs model_lr0.0001 → unblocks evaluate_lr0.0001
train_lr0.0005 → outputs model_lr0.0005 → unblocks evaluate_lr0.0005
(and so on for each learning rate)
All evaluate_* jobs → unblock compare_results

Step 2: Create and Initialize the Workflow

WORKFLOW_ID=$(torc workflows create learning_rate_sweep.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

torc workflows initialize-jobs $WORKFLOW_ID

Step 3: Verify the Expansion

# Count jobs (should be 11: 5 train + 5 evaluate + 1 compare)
torc jobs list $WORKFLOW_ID -f json | jq '.jobs | length'

List the job names:

torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[].name' | sort

Output:

compare_results
evaluate_lr0.0001
evaluate_lr0.0005
evaluate_lr0.0010
evaluate_lr0.0050
evaluate_lr0.0100
train_lr0.0001
train_lr0.0005
train_lr0.0010
train_lr0.0050
train_lr0.0100

Notice:

One job per parameter value for train_* and evaluate_*
Only one compare_results job (it has the parameter for dependencies, but doesn't expand because its name has no {lr})

Step 4: Check Dependencies

torc jobs list $WORKFLOW_ID

Expected statuses:

All train_* jobs: ready (no input dependencies)
All evaluate_* jobs: blocked (waiting for corresponding model file)
compare_results: blocked (waiting for all metrics files)

Step 5: Run the Workflow

torc run $WORKFLOW_ID

Execution flow:

All 5 training jobs run in parallel - They have no dependencies on each other
Each evaluation unblocks independently - When train_lr0.0001 finishes, evaluate_lr0.0001 can start (doesn't wait for other training jobs)
Compare runs last - Only after all 5 evaluations complete

This is more efficient than a simple two-stage workflow because evaluations can start as soon as their specific training job completes.

Parameter Format Reference

List Format

Explicit list of values:

parameters:
  lr: "[0.0001,0.0005,0.001,0.005,0.01]"  # Numbers
  opt: "['adam','sgd','rmsprop']"          # Strings (note the quotes)

Range Format

For integer or float sequences:

parameters:
  i: "1:100"        # Integers 1 to 100 (inclusive)
  i: "0:100:10"     # Integers 0, 10, 20, ..., 100 (with step)
  lr: "0.0:1.0:0.1" # Floats 0.0, 0.1, 0.2, ..., 1.0

Format Specifiers

Control how values appear in names:

Specifier	Example Value	Result
`{i}`	5	`5`
`{i:03d}`	5	`005`
`{lr:.4f}`	0.001	`0.0010`
`{lr:.2e}`	0.001	`1.00e-03`

How Parameterization and File Dependencies Interact

When both jobs and files are parameterized with the same parameter:

jobs:
  - name: train_{i}
    output_files: [model_{i}]
    parameters:
      i: "1:3"

  - name: eval_{i}
    input_files: [model_{i}]
    parameters:
      i: "1:3"

files:
  - name: model_{i}
    path: /models/model_{i}.pt
    parameters:
      i: "1:3"

Torc creates these relationships:

train_1 → model_1 → eval_1
train_2 → model_2 → eval_2
train_3 → model_3 → eval_3

Each chain is independent—eval_2 doesn't wait for train_1.

Parameter Modes: Product vs Zip

By default, multiple parameters create a Cartesian product (all combinations). For paired parameters, use parameter_mode: zip:

jobs:
  # Default (product): 3 × 3 = 9 jobs
  - name: train_{dataset}_{model}
    command: python train.py --dataset={dataset} --model={model}
    parameters:
      dataset: "['cifar10', 'mnist', 'imagenet']"
      model: "['resnet', 'vgg', 'transformer']"

  # Zip mode: 3 paired jobs (cifar10+resnet, mnist+vgg, imagenet+transformer)
  - name: paired_{dataset}_{model}
    command: python train.py --dataset={dataset} --model={model}
    parameters:
      dataset: "['cifar10', 'mnist', 'imagenet']"
      model: "['resnet', 'vgg', 'transformer']"
    parameter_mode: zip

Use zip mode when parameters have a 1:1 correspondence (e.g., input/output file pairs, pre-determined configurations).

See Parameterization Reference for details.

What You Learned

In this tutorial, you learned:

✅ How to use parameters: to expand one job definition into many
✅ List format ("[a,b,c]") and range format ("1:100")
✅ Format specifiers ({i:03d}, {lr:.4f}) for consistent naming
✅ How parameterized files create one-to-one dependencies
✅ The efficiency of parameter-matched dependencies (each chain runs independently)
✅ The difference between product (default) and zip parameter modes

Next Steps

Tutorial 5: Advanced Parameterization - Multi-dimensional grid searches
Multi-Stage Workflows with Barriers - Scale to thousands of parameterized jobs

Tutorial 5: Advanced Multi-Dimensional Parameterization

This tutorial teaches you how to create multi-dimensional parameter sweeps—grid searches over multiple hyperparameters that generate all combinations automatically.

Learning Objectives

By the end of this tutorial, you will:

Understand how multiple parameters create a Cartesian product (all combinations)
Learn to structure complex workflows with data preparation, training, and aggregation stages
Know how to combine parameterization with explicit dependencies
See patterns for running large grid searches on HPC systems

Prerequisites

Completed Tutorial 4: Simple Parameterization
Torc server running
Understanding of file dependencies

Multi-Dimensional Parameters: Cartesian Product

When a job has multiple parameters, Torc creates the Cartesian product—every combination of values:

parameters:
  lr: "[0.001,0.01]"   # 2 values
  bs: "[16,32]"        # 2 values

This generates 2 × 2 = 4 jobs:

lr=0.001, bs=16
lr=0.001, bs=32
lr=0.01, bs=16
lr=0.01, bs=32

With three parameters:

parameters:
  lr: "[0.0001,0.001,0.01]"  # 3 values
  bs: "[16,32,64]"            # 3 values
  opt: "['adam','sgd']"       # 2 values

This generates 3 × 3 × 2 = 18 jobs.

Step 1: Create the Workflow Specification

Save as grid_search.yaml:

name: hyperparameter_grid_search
description: 3D grid search over learning rate, batch size, and optimizer

jobs:
  # Data preparation (runs once, no parameters)
  - name: prepare_data
    command: python prepare_data.py --output=/data/processed.pkl
    resource_requirements: data_prep
    output_files:
      - training_data

  # Training jobs (one per parameter combination)
  - name: train_lr{lr:.4f}_bs{bs}_opt{opt}
    command: |
      python train.py \
        --data=/data/processed.pkl \
        --learning-rate={lr} \
        --batch-size={bs} \
        --optimizer={opt} \
        --output=/models/model_lr{lr:.4f}_bs{bs}_opt{opt}.pt \
        --metrics=/results/metrics_lr{lr:.4f}_bs{bs}_opt{opt}.json
    resource_requirements: gpu_training
    input_files:
      - training_data
    output_files:
      - model_lr{lr:.4f}_bs{bs}_opt{opt}
      - metrics_lr{lr:.4f}_bs{bs}_opt{opt}
    parameters:
      lr: "[0.0001,0.001,0.01]"
      bs: "[16,32,64]"
      opt: "['adam','sgd']"

  # Aggregate results (depends on ALL training jobs via file dependencies)
  - name: aggregate_results
    command: |
      python aggregate.py \
        --input-dir=/results \
        --output=/results/summary.csv
    resource_requirements: minimal
    input_files:
      - metrics_lr{lr:.4f}_bs{bs}_opt{opt}
    parameters:
      lr: "[0.0001,0.001,0.01]"
      bs: "[16,32,64]"
      opt: "['adam','sgd']"

  # Find best model (explicit dependency, no parameters)
  - name: select_best_model
    command: |
      python select_best.py \
        --summary=/results/summary.csv \
        --output=/results/best_config.json
    resource_requirements: minimal
    depends_on:
      - aggregate_results

files:
  - name: training_data
    path: /data/processed.pkl

  - name: model_lr{lr:.4f}_bs{bs}_opt{opt}
    path: /models/model_lr{lr:.4f}_bs{bs}_opt{opt}.pt
    parameters:
      lr: "[0.0001,0.001,0.01]"
      bs: "[16,32,64]"
      opt: "['adam','sgd']"

  - name: metrics_lr{lr:.4f}_bs{bs}_opt{opt}
    path: /results/metrics_lr{lr:.4f}_bs{bs}_opt{opt}.json
    parameters:
      lr: "[0.0001,0.001,0.01]"
      bs: "[16,32,64]"
      opt: "['adam','sgd']"

resource_requirements:
  - name: data_prep
    num_cpus: 8
    memory: 32g
    runtime: PT1H

  - name: gpu_training
    num_cpus: 8
    num_gpus: 1
    memory: 16g
    runtime: PT4H

  - name: minimal
    num_cpus: 1
    memory: 2g
    runtime: PT10M

Understanding the Structure

Four-stage workflow:

prepare_data (1 job) - No parameters, runs once
train_* (18 jobs) - Parameterized, all depend on prepare_data
aggregate_results (1 job) - Has parameters only for file dependency matching
select_best_model (1 job) - Explicit dependency on aggregate_results

Key insight: Why aggregate_results has parameters

The aggregate_results job won't expand into multiple jobs (its name has no {}). However, it needs parameters: to match the parameterized input_files. This tells Torc: "this job depends on ALL 18 metrics files."

Step 2: Create and Initialize the Workflow

WORKFLOW_ID=$(torc workflows create grid_search.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

torc workflows initialize-jobs $WORKFLOW_ID

Step 3: Verify the Expansion

Count the jobs:

torc jobs list $WORKFLOW_ID -f json | jq '.jobs | length'

Expected: 21 jobs (1 prepare + 18 training + 1 aggregate + 1 select)

List the training jobs:

torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[] | select(.name | startswith("train_")) | .name' | sort

Output (18 training jobs):

train_lr0.0001_bs16_optadam
train_lr0.0001_bs16_optsgd
train_lr0.0001_bs32_optadam
train_lr0.0001_bs32_optsgd
train_lr0.0001_bs64_optadam
train_lr0.0001_bs64_optsgd
train_lr0.0010_bs16_optadam
train_lr0.0010_bs16_optsgd
train_lr0.0010_bs32_optadam
train_lr0.0010_bs32_optsgd
train_lr0.0010_bs64_optadam
train_lr0.0010_bs64_optsgd
train_lr0.0100_bs16_optadam
train_lr0.0100_bs16_optsgd
train_lr0.0100_bs32_optadam
train_lr0.0100_bs32_optsgd
train_lr0.0100_bs64_optadam
train_lr0.0100_bs64_optsgd

Step 4: Examine the Dependency Graph

torc jobs list $WORKFLOW_ID

Initial states:

prepare_data: ready (no dependencies)
All train_*: blocked (waiting for training_data file)
aggregate_results: blocked (waiting for all 18 metrics files)
select_best_model: blocked (waiting for aggregate_results)

Step 5: Run the Workflow

For local execution:

torc run $WORKFLOW_ID

Execution flow:

prepare_data runs and produces training_data
All 18 train_* jobs unblock and run in parallel (resource-limited)
aggregate_results waits for all training jobs, then runs
select_best_model runs last

Step 6: Monitor Progress

# Check status summary
torc workflows status $WORKFLOW_ID

# Watch job completion in real-time
watch -n 10 'torc jobs list-by-status $WORKFLOW_ID'

# Or use the TUI
torc tui

Step 7: Retrieve Results

After completion:

# View best configuration
cat /results/best_config.json

# View summary of all runs
cat /results/summary.csv

Scaling Considerations

Job Count Growth

Multi-dimensional parameters grow exponentially:

Dimensions	Values per Dimension	Total Jobs
1	10	10
2	10 × 10	100
3	10 × 10 × 10	1,000
4	10 × 10 × 10 × 10	10,000

Dependency Count

Without barriers, dependencies also grow quickly. In this tutorial:

18 training jobs each depend on 1 file = 18 dependencies
1 aggregate job depends on 18 files = 18 dependencies
Total: ~36 dependencies

For larger sweeps (1000+ jobs), consider the barrier pattern to reduce dependencies from O(n²) to O(n).

Common Patterns

Mixing Fixed and Parameterized Jobs

jobs:
  # Fixed job (no parameters)
  - name: setup
    command: ./setup.sh

  # Parameterized jobs depend on fixed job
  - name: experiment_{i}
    command: ./run.sh {i}
    depends_on: [setup]
    parameters:
      i: "1:100"

Aggregating Parameterized Results

Use the file dependency pattern shown in this tutorial:

- name: aggregate
  input_files:
    - result_{i}    # Matches all parameterized result files
  parameters:
    i: "1:100"      # Same parameters as producer jobs

Nested Parameter Sweeps

For workflows with multiple independent sweeps:

jobs:
  # Sweep 1
  - name: sweep1_job_{a}
    parameters:
      a: "1:10"

  # Sweep 2 (independent of sweep 1)
  - name: sweep2_job_{b}
    parameters:
      b: "1:10"

What You Learned

In this tutorial, you learned:

✅ How multiple parameters create a Cartesian product of jobs
✅ How to structure multi-stage workflows (prep → train → aggregate → select)
✅ How to use parameters in file dependencies to collect all outputs
✅ How to mix parameterized and non-parameterized jobs
✅ Scaling considerations for large grid searches

Example Files

See these example files for hyperparameter sweep patterns:

hyperparameter_sweep.yaml - Basic 3×3×2 grid search
hyperparameter_sweep_shared_params.yaml - Grid search with shared parameter definitions

Next Steps

Multi-Stage Workflows with Barriers - Essential for scaling to thousands of jobs
Working with Slurm - Deploy grid searches on HPC clusters
Resource Monitoring - Track resource usage across your sweep

Multi-Stage Workflows with Barriers

This tutorial teaches you how to efficiently structure workflows with multiple stages using the barrier pattern. This is essential for scaling workflows to thousands of jobs.

Learning Objectives

By the end of this tutorial, you will:

Understand the quadratic dependency problem in multi-stage workflows
Use barrier jobs to efficiently synchronize between stages
Scale workflows to thousands of jobs with minimal overhead
Know when to use barriers vs. direct dependencies

Prerequisites

Basic understanding of Torc workflows
Completed the Many Independent Jobs tutorial
Completed the Simple Parameterization tutorial

The Problem: Quadratic Dependencies

Let's start with a common but inefficient pattern. Suppose you want to:

Stage 1: Run 1000 preprocessing jobs in parallel
Stage 2: Run 1000 analysis jobs, but only after ALL stage 1 jobs complete
Stage 3: Run a final aggregation job

Naive Approach (DON'T DO THIS!)

name: "Inefficient Multi-Stage Workflow"
description: "This creates 1,000,000 dependencies!"

jobs:
  # Stage 1: 1000 preprocessing jobs
  - name: "preprocess_{i:03d}"
    command: "python preprocess.py --id {i}"
    parameters:
      i: "0:999"

  # Stage 2: Each analysis job waits for ALL preprocessing jobs
  - name: "analyze_{i:03d}"
    command: "python analyze.py --id {i}"
    depends_on_regexes: ["^preprocess_.*"]  # ⚠️ Creates 1,000,000 dependencies!
    parameters:
      i: "0:999"

  # Stage 3: Final aggregation
  - name: "final_report"
    command: "python generate_report.py"
    depends_on_regexes: ["^analyze_.*"]  # ⚠️ Creates 1,000 more dependencies

Why This is Bad

When Torc expands this workflow:

Each of the 1000 analyze_* jobs gets a dependency on each of the 1000 preprocess_* jobs
Total dependencies: 1000 × 1000 = 1,000,000 relationships
Workflow creation takes minutes instead of seconds
Database becomes bloated with dependency records
Job initialization is slow

The Solution: Barrier Jobs

A barrier job is a lightweight synchronization point that:

Depends on all jobs from the previous stage (using a regex)
Is depended upon by all jobs in the next stage
Reduces dependencies from O(n²) to O(n)

flowchart LR
    subgraph stage1["Stage 1"]
        P1["preprocess_1"]
        P2["preprocess_2"]
        PN["preprocess_N"]
    end

    B1(["barrier"])

    subgraph stage2["Stage 2"]
        A1["analyze_1"]
        A2["analyze_2"]
        AN["analyze_N"]
    end

    P1 --> B1
    P2 --> B1
    PN --> B1
    B1 --> A1
    B1 --> A2
    B1 --> AN

    style P1 fill:#4a9eff,color:#fff
    style P2 fill:#4a9eff,color:#fff
    style PN fill:#4a9eff,color:#fff
    style B1 fill:#28a745,color:#fff
    style A1 fill:#ffc107,color:#000
    style A2 fill:#ffc107,color:#000
    style AN fill:#ffc107,color:#000

Instead of N×N dependencies (every stage 2 job depending on every stage 1 job), you get 2N dependencies (N into the barrier, N out of the barrier).

Efficient Approach (DO THIS!)

name: "Efficient Multi-Stage Workflow"
description: "Uses barrier pattern with only ~3000 dependencies"

jobs:
  # ═══════════════════════════════════════════════════════════
  # STAGE 1: Preprocessing (1000 parallel jobs)
  # ═══════════════════════════════════════════════════════════
  - name: "preprocess_{i:03d}"
    command: "python preprocess.py --id {i} --output data/stage1_{i:03d}.json"
    resource_requirements: "medium"
    parameters:
      i: "0:999"

  # ═══════════════════════════════════════════════════════════
  # BARRIER: Wait for ALL stage 1 jobs
  # ═══════════════════════════════════════════════════════════
  - name: "barrier_stage1_complete"
    command: "echo 'Stage 1 complete: 1000 files preprocessed' && date"
    resource_requirements: "tiny"
    depends_on_regexes: ["^preprocess_.*"]  # ✓ 1000 dependencies

  # ═══════════════════════════════════════════════════════════
  # STAGE 2: Analysis (1000 parallel jobs)
  # ═══════════════════════════════════════════════════════════
  - name: "analyze_{i:03d}"
    command: "python analyze.py --input data/stage1_{i:03d}.json --output data/stage2_{i:03d}.csv"
    resource_requirements: "large"
    depends_on: ["barrier_stage1_complete"]  # ✓ 1000 dependencies (one per job)
    parameters:
      i: "0:999"

  # ═══════════════════════════════════════════════════════════
  # BARRIER: Wait for ALL stage 2 jobs
  # ═══════════════════════════════════════════════════════════
  - name: "barrier_stage2_complete"
    command: "echo 'Stage 2 complete: 1000 analyses finished' && date"
    resource_requirements: "tiny"
    depends_on_regexes: ["^analyze_.*"]  # ✓ 1000 dependencies

  # ═══════════════════════════════════════════════════════════
  # STAGE 3: Final report (single job)
  # ═══════════════════════════════════════════════════════════
  - name: "final_report"
    command: "python generate_report.py --output final_report.html"
    resource_requirements: "medium"
    depends_on: ["barrier_stage2_complete"]  # ✓ 1 dependency

resource_requirements:
  - name: "tiny"
    num_cpus: 1
    num_gpus: 0
    num_nodes: 1
    memory: "100m"
    runtime: "PT1M"

  - name: "medium"
    num_cpus: 4
    num_gpus: 0
    num_nodes: 1
    memory: "4g"
    runtime: "PT30M"

  - name: "large"
    num_cpus: 16
    num_gpus: 1
    num_nodes: 1
    memory: "32g"
    runtime: "PT2H"

Dependency Breakdown

Without barriers:

Stage 1 → Stage 2: 1000 × 1000 = 1,000,000 dependencies
Stage 2 → Stage 3: 1000 = 1,000 dependencies
Total: 1,001,000 dependencies

With barriers:

Stage 1 → Barrier 1: 1,000 dependencies
Barrier 1 → Stage 2: 1,000 dependencies
Stage 2 → Barrier 2: 1,000 dependencies
Barrier 2 → Stage 3: 1 dependency
Total: 3,001 dependencies ← 333× improvement!

Step-by-Step: Creating Your First Barrier Workflow

Let's create a simple 2-stage workflow.

Step 1: Create the Workflow Spec

Create barrier_demo.yaml:

name: "Barrier Pattern Demo"
description: "Simple demonstration of the barrier pattern"

jobs:
  # Stage 1: Generate 100 data files
  - name: "generate_data_{i:02d}"
    command: "echo 'Data file {i}' > output/data_{i:02d}.txt"
    parameters:
      i: "0:99"

  # Barrier: Wait for all data generation
  - name: "data_generation_complete"
    command: "echo 'All 100 data files generated' && ls -l output/ | wc -l"
    depends_on_regexes: ["^generate_data_.*"]

  # Stage 2: Process each data file
  - name: "process_data_{i:02d}"
    command: "cat output/data_{i:02d}.txt | wc -w > output/processed_{i:02d}.txt"
    depends_on: ["data_generation_complete"]
    parameters:
      i: "0:99"

  # Final barrier and report
  - name: "processing_complete"
    command: "echo 'All 100 files processed' && cat output/processed_*.txt | awk '{sum+=$1} END {print sum}'"
    depends_on_regexes: ["^process_data_.*"]

Step 2: Create the Output Directory

mkdir -p output

Step 3: Create the Workflow

torc workflows create barrier_demo.yaml

You should see output like:

Created workflow with ID: 1
- Created 100 stage 1 jobs
- Created 1 barrier job
- Created 100 stage 2 jobs
- Created 1 final barrier
Total: 202 jobs, 201 dependencies

Compare this to 10,000 dependencies without barriers!

Step 4: Run the Workflow

torc workflows run 1

Step 5: Monitor Progress

torc tui

You'll see:

All 100 generate_data_* jobs run in parallel
Once they finish, data_generation_complete executes
Then all 100 process_data_* jobs run in parallel
Finally, processing_complete executes

Making Effective Barrier Jobs

1. Keep Barriers Lightweight

Barriers should be quick and cheap:

✓ GOOD - Lightweight logging
- name: "stage1_complete"
  command: "echo 'Stage 1 done' && date"
  resource_requirements: "tiny"

✗ BAD - Heavy computation
- name: "stage1_complete"
  command: "python expensive_validation.py"  # Don't do this!
  resource_requirements: "large"

If you need validation, create a separate job:

# Barrier - lightweight
- name: "stage1_complete"
  command: "echo 'Stage 1 done'"
  resource_requirements: "tiny"
  depends_on_regexes: ["^stage1_.*"]

# Validation - heavier
- name: "validate_stage1"
  command: "python validate_all_outputs.py"
  resource_requirements: "medium"
  depends_on: ["stage1_complete"]

# Stage 2 depends on validation passing
- name: "stage2_job_{i}"
  command: "python stage2.py {i}"
  depends_on: ["validate_stage1"]
  parameters:
    i: "0:999"

2. Use Descriptive Names

Names should clearly indicate what stage completed:

✓ GOOD
- name: "barrier_preprocessing_complete"
- name: "barrier_training_complete"
- name: "all_simulations_finished"

✗ BAD
- name: "barrier1"
- name: "sync"
- name: "wait"

3. Add Useful Information

Make barriers informative:

- name: "preprocessing_complete"
  command: |
    echo "════════════════════════════════════════"
    echo "Preprocessing Complete: $(date)"
    echo "Files generated: $(ls output/stage1_*.json | wc -l)"
    echo "Total size: $(du -sh output/)"
    echo "Proceeding to analysis stage..."
    echo "════════════════════════════════════════"
  depends_on_regexes: ["^preprocess_.*"]

4. Be Careful with Regex Patterns

Ensure your regex matches exactly what you intend:

✓ GOOD - Anchored patterns
depends_on_regexes: ["^stage1_job_.*"]      # Matches "stage1_job_001", "stage1_job_042"
depends_on_regexes: ["^preprocess_\\d+$"]   # Matches "preprocess_0", "preprocess_999"

✗ BAD - Too broad
depends_on_regexes: ["stage1"]              # Matches "my_stage1_test" (unintended!)
depends_on_regexes: [".*"]                  # Matches EVERYTHING (disaster!)

Test your regex before deploying:

# Python regex tester
python3 -c "import re; print(re.match(r'^stage1_job_.*', 'stage1_job_001'))"

When NOT to Use Barriers

Barriers are not always the right solution:

1. One-to-One Dependencies

When each job in stage 2 only needs its corresponding stage 1 job:

# DON'T use a barrier here
jobs:
  - name: "preprocess_{i}"
    command: "preprocess.py {i}"
    output_files: ["data_{i}.json"]
    parameters:
      i: "0:99"

  # Each analysis only needs its own preprocessed file
  - name: "analyze_{i}"
    command: "analyze.py {i}"
    input_files: ["data_{i}.json"]  # ✓ Automatic dependency via files
    parameters:
      i: "0:99"

The file dependency system already handles this efficiently!

2. Specific Dependencies in DAGs

When you have a directed acyclic graph (DAG) with specific paths:

# Diamond pattern - specific dependencies
jobs:
  - name: "fetch_data"
    command: "fetch.py"

  - name: "process_weather"
    command: "process_weather.py"
    depends_on: ["fetch_data"]

  - name: "process_traffic"
    command: "process_traffic.py"
    depends_on: ["fetch_data"]

  - name: "generate_report"
    command: "report.py"
    depends_on: ["process_weather", "process_traffic"]  # ✓ Specific dependencies

Don't force this into stages - the specific dependencies are clearer!

3. Small Workflows

For small workflows (< 100 jobs), the overhead of barriers isn't worth it:

# Only 10 jobs - barriers not needed
jobs:
  - name: "job_{i}"
    command: "process.py {i}"
    depends_on_regexes: ["^prepare_.*"]  # This is fine for 10 jobs
    parameters:
      i: "0:9"

Scaling to Thousands of Jobs

The barrier pattern scales beautifully. Let's compare performance:

Stage 1 Jobs	Stage 2 Jobs	Without Barriers	With Barriers	Speedup
100	100	10,000 deps (~1s)	200 deps (<0.1s)	10×
1,000	1,000	1,000,000 deps (~45s)	2,000 deps (~0.5s)	90×
10,000	10,000	100,000,000 deps (hours)	20,000 deps (~5s)	1000×+

As you can see, barriers become essential for large-scale workflows.

Complete Example

See multi_stage_barrier_pattern.yaml for a comprehensive example with:

3 distinct stages (1000 → 1000 → 100 jobs)
Informative barrier jobs with progress logging
Different resource requirements per stage
Comments explaining the pattern

Summary

✓ Use barrier jobs when all jobs in one stage must complete before any job in the next stage starts

✓ Use file/data dependencies for one-to-one job relationships

✓ Use specific dependencies for DAG patterns with clear paths

✓ Keep barriers lightweight - just logging and simple checks

✓ Use descriptive names to track workflow progress

The barrier pattern is your key to scaling Torc workflows from hundreds to thousands of jobs efficiently!

Next Steps

Try modifying the demo workflow to have 3 or more stages
Experiment with adding validation logic to barrier jobs
Check out Advanced Parameterization for creating complex multi-stage pipelines
Learn about Workflow Actions for conditional execution between stages

Tutorial 3: User Data Dependencies

This tutorial teaches you how to pass structured data (JSON) between jobs using Torc's user_data feature—an alternative to file-based dependencies that stores data directly in the database.

Learning Objectives

By the end of this tutorial, you will:

Understand what user_data is and when to use it instead of files
Learn how to define user_data entries and reference them in jobs
Know how to update user_data from within a job
See how user_data creates implicit dependencies (like files)

Prerequisites

Completed Tutorial 2: Diamond Workflow
Torc server running
jq command-line tool installed (for JSON parsing)

What is User Data?

User data is Torc's mechanism for passing small, structured data between jobs without creating actual files. The data is stored in the Torc database and can be:

JSON objects (configurations, parameters)
Arrays
Simple values (strings, numbers)

Like files, user_data creates implicit dependencies: a job that reads user_data will be blocked until the job that writes it completes.

User Data vs Files

Feature	User Data	Files
Storage	Torc database	Filesystem
Size	Small (KB)	Any size
Format	JSON	Any format
Access	Via `torc user-data` CLI	Direct file I/O
Best for	Config, params, metadata	Datasets, binaries, logs

Step 1: Create the Workflow Specification

Save as user_data_workflow.yaml:

name: config_pipeline
description: Jobs that pass configuration via user_data

jobs:
  - name: generate_config
    command: |
      echo '{"learning_rate": 0.001, "batch_size": 32, "epochs": 10}' > /tmp/config.json
      torc user-data update ${user_data.output.ml_config} \
        --data "$(cat /tmp/config.json)"
    resource_requirements: minimal

  - name: train_model
    command: |
      echo "Training with config:"
      torc user-data get ${user_data.input.ml_config} | jq '.data'
      # In a real workflow: python train.py --config="${user_data.input.ml_config}"
    resource_requirements: gpu_large

  - name: evaluate_model
    command: |
      echo "Evaluating with config:"
      torc user-data get ${user_data.input.ml_config} | jq '.data'
      # In a real workflow: python evaluate.py --config="${user_data.input.ml_config}"
    resource_requirements: gpu_small

user_data:
  - name: ml_config
    data: null  # Will be populated by generate_config job

resource_requirements:
  - name: minimal
    num_cpus: 1
    memory: 1g
    runtime: PT5M

  - name: gpu_small
    num_cpus: 4
    num_gpus: 1
    memory: 16g
    runtime: PT1H

  - name: gpu_large
    num_cpus: 8
    num_gpus: 2
    memory: 32g
    runtime: PT4H

Understanding the Specification

Key elements:

user_data: section - Defines data entries, similar to files:
data: null - Initial value; will be populated by a job
${user_data.output.ml_config} - Job will write to this user_data (creates it)
${user_data.input.ml_config} - Job reads from this user_data (creates dependency)

The dependency flow:

generate_config outputs ml_config → runs first
train_model and evaluate_model input ml_config → blocked until step 1 completes
After generate_config finishes, both become ready and can run in parallel

Step 2: Create and Initialize the Workflow

# Create the workflow
WORKFLOW_ID=$(torc workflows create user_data_workflow.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

# Initialize jobs
torc workflows initialize-jobs $WORKFLOW_ID

Step 3: Check Initial State

Before running, examine the user_data:

# Check user_data - should be null
torc user-data list $WORKFLOW_ID

Output:

╭────┬───────────┬──────┬─────────────╮
│ ID │ Name      │ Data │ Workflow ID │
├────┼───────────┼──────┼─────────────┤
│ 1  │ ml_config │ null │ 1           │
╰────┴───────────┴──────┴─────────────╯

Check job statuses:

torc jobs list $WORKFLOW_ID

You should see:

generate_config: ready (no input dependencies)
train_model: blocked (waiting for ml_config)
evaluate_model: blocked (waiting for ml_config)

Step 4: Run the Workflow

torc run $WORKFLOW_ID

Step 5: Observe the Data Flow

After generate_config completes, check the updated user_data:

torc user-data list $WORKFLOW_ID -f json | jq '.[] | {name, data}'

Output:

{
  "name": "ml_config",
  "data": {
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 10
  }
}

The data is now stored in the database. At this point:

train_model and evaluate_model unblock
Both can read the configuration and run in parallel

Step 6: Verify Completion

After the workflow completes:

torc results list $WORKFLOW_ID

All three jobs should show return code 0.

How User Data Dependencies Work

The mechanism is identical to file dependencies:

Syntax	Meaning	Effect
`${user_data.input.name}`	Job reads this data	Creates dependency on producer
`${user_data.output.name}`	Job writes this data	Satisfies dependencies

Torc substitutes these variables with the actual user_data ID at runtime, and the torc user-data CLI commands use that ID to read/write the data.

Accessing User Data in Your Code

From within a job, you can:

Read user_data:

# Get the full record
torc user-data get $USER_DATA_ID

# Get just the data field
torc user-data get $USER_DATA_ID | jq '.data'

# Save to a file for your application
torc user-data get $USER_DATA_ID | jq '.data' > config.json

Write user_data:

# Update with JSON data
torc user-data update $USER_DATA_ID --data '{"key": "value"}'

# Update from a file
torc user-data update $USER_DATA_ID --data "$(cat results.json)"

What You Learned

In this tutorial, you learned:

✅ What user_data is: structured data stored in the Torc database
✅ When to use it: configurations, parameters, metadata (not large files)
✅ How to define user_data entries with the user_data: section
✅ How ${user_data.input.*} and ${user_data.output.*} create dependencies
✅ How to read and write user_data from within jobs

Common Patterns

Dynamic Configuration Generation

jobs:
  - name: analyze_data
    command: |
      # Analyze data and determine optimal parameters
      OPTIMAL_LR=$(python analyze.py --find-optimal-lr)
      torc user-data update ${user_data.output.optimal_params} \
        --data "{\"learning_rate\": $OPTIMAL_LR}"

Collecting Results from Multiple Jobs

jobs:
  - name: worker_{i}
    command: |
      RESULT=$(python process.py --id {i})
      torc user-data update ${user_data.output.result_{i}} --data "$RESULT"
    parameters:
      i: "1:10"

  - name: aggregate
    command: |
      # Collect all results
      for i in $(seq 1 10); do
        torc user-data get ${user_data.input.result_$i} >> all_results.json
      done
      python aggregate.py all_results.json

Next Steps

Tutorial 4: Simple Parameterization - Create parameter sweeps
Tutorial 5: Advanced Parameterization - Multi-dimensional grid searches

Monitoring & Debugging

Tools and techniques for monitoring workflows and debugging issues.

Web Dashboard - Browser-based workflow monitoring
Terminal UI (TUI) - Interactive terminal interface
Workflow Reports - Generating workflow summaries
Resource Monitoring - Tracking resource usage
Debugging Workflows - Troubleshooting workflow issues
Working with Logs - Accessing and analyzing logs

Web Dashboard (torc-dash)

The Torc Dashboard (torc-dash) provides a modern web-based interface for monitoring and managing workflows, offering an intuitive alternative to the command-line interface.

Overview

torc-dash is a Rust-based web application that allows you to:

Monitor workflows and jobs with real-time status updates
Create and run workflows by uploading specification files (YAML, JSON, JSON5, KDL)
Visualize workflow DAGs with interactive dependency graphs
Debug failed jobs with integrated log file viewer
Generate resource plots from time series monitoring data
Manage torc-server start/stop in standalone mode
Live event streaming via Server-Sent Events (SSE) for real-time job and compute node events

Installation

Building from Source

torc-dash is built as part of the Torc workspace:

# Build torc-dash
cargo build --release -p torc-dash

# Binary location
./target/release/torc-dash

Prerequisites

A running torc-server (or use --standalone mode to auto-start one)
The torc CLI binary in your PATH (for workflow execution features)

Running the Dashboard

Quick Start (Standalone Mode)

The easiest way to get started is standalone mode, which automatically starts torc-server:

torc-dash --standalone

This will:

Start torc-server on an automatically-detected free port
Start the dashboard on http://127.0.0.1:8090
Configure the dashboard to connect to the managed server

Connecting to an Existing Server

If you already have torc-server running:

# Use default API URL (http://localhost:8080/torc-service/v1)
torc-dash

# Specify custom API URL
torc-dash --api-url http://myserver:9000/torc-service/v1

# Or use environment variable
export TORC_API_URL="http://myserver:9000/torc-service/v1"
torc-dash

Command-Line Options

Options:
  -p, --port <PORT>           Dashboard port [default: 8090]
      --host <HOST>           Dashboard host [default: 127.0.0.1]
  -a, --api-url <API_URL>     Torc server API URL [default: http://localhost:8080/torc-service/v1]
      --torc-bin <PATH>       Path to torc CLI binary [default: torc]
      --torc-server-bin       Path to torc-server binary [default: torc-server]
      --standalone            Auto-start torc-server alongside dashboard
      --server-port <PORT>    Server port in standalone mode (0 = auto-detect) [default: 0]
      --database <PATH>       Database path for standalone server
      --completion-check-interval-secs <SECS>  Server polling interval [default: 5]

Features

Workflows Tab

The main workflows view provides:

Workflow list with ID, name, timestamp, user, and description
Create Workflow button to upload new workflow specifications
Quick actions for each workflow:
- View details and DAG visualization
- Initialize/reinitialize workflow
- Run locally or submit to scheduler
- Delete workflow

Creating Workflows

Click "Create Workflow" to open the creation dialog:

Upload a file: Drag and drop or click to select a workflow specification file
- Supports YAML, JSON, JSON5, and KDL formats
Or enter a file path: Specify a path on the server filesystem
Click "Create" to register the workflow

Details Tab

Explore workflow components with interactive tables:

Jobs: View all jobs with status, name, command, and dependencies
Files: Input/output files with paths and timestamps
User Data: Key-value data passed between jobs
Results: Execution results with return codes and resource metrics
Compute Nodes: Available compute resources
Resource Requirements: CPU, memory, GPU specifications
Schedulers: Slurm scheduler configurations

Features:

Workflow selector: Filter by workflow
Column sorting: Click headers to sort
Row filtering: Type in filter boxes (supports column:value syntax)
Auto-refresh: Toggle automatic updates

DAG Visualization

Click "View" on any workflow to see an interactive dependency graph:

Nodes represent jobs, colored by status
Edges show dependencies (file-based and explicit)
Zoom, pan, and click nodes for details
Legend shows status colors

Debugging Tab

Investigate failed jobs with the integrated debugger:

Select a workflow
Configure output directory (where logs are stored)
Toggle "Show only failed jobs" to focus on problems
Click "Generate Report" to fetch results
Click any job row to view its log files:
- stdout: Standard output from the job
- stderr: Error output and stack traces
- Copy file paths with one click

Events Tab (SSE Live Streaming)

Monitor workflow activity in real-time using Server-Sent Events (SSE):

Live event streaming - events appear instantly without polling
Connection status indicator - shows Live/Reconnecting/Disconnected status
Event types displayed:
- job_started / job_completed / job_failed - Job lifecycle events
- compute_node_started / compute_node_stopped - Worker node lifecycle
- workflow_started / workflow_reinitialized - Workflow initialization events
- scheduler_node_created - Slurm scheduler events
Clear button to reset the event list
Auto-reconnect on connection loss

Resource Plots Tab

Visualize CPU and memory usage over time:

Enter a base directory containing resource database files
Click "Scan for Databases" to find .db files
Select databases to plot
Click "Generate Plots" for interactive Plotly charts

Requires workflows run with granularity: "time_series" in resource_monitor config.

Configuration Tab

Server Management

Start and stop torc-server directly from the dashboard:

Server Port: Port to listen on (0 = auto-detect free port)
Database Path: SQLite database file location
Completion Check Interval: How often to check for job completions
Log Level: Server logging verbosity

Click "Start Server" to launch, "Stop Server" to terminate.

API Configuration

API URL: Torc server endpoint
Test Connection: Verify connectivity

Settings are saved to browser local storage.

Common Usage Patterns

Running a Workflow

Navigate to Workflows tab
Click Create Workflow
Upload your specification file
Click Create
Click Initialize on the new workflow
Click Run Locally (or Submit for Slurm)
Monitor progress in the Details tab or Events tab

Debugging a Failed Workflow

Go to the Debugging tab
Select the workflow
Check "Show only failed jobs"
Click Generate Report
Click on a failed job row
Review the stderr tab for error messages
Check stdout for context

Monitoring Active Jobs

Open Details tab
Select "Jobs" and your workflow
Enable Auto-refresh
Watch job statuses update in real-time

Security Considerations

Network Access: By default, binds to 127.0.0.1 (localhost only)
Remote Access: Use --host 0.0.0.0 with caution; consider a reverse proxy with HTTPS
Authentication: Torc server supports htpasswd-based authentication (see Authentication)

Troubleshooting

Cannot Connect to Server

Verify torc-server is running: curl http://localhost:8080/torc-service/v1/workflows
Check the API URL in Configuration tab
In standalone mode, check server output for startup errors

Workflow Creation Fails

Ensure workflow specification is valid YAML/JSON/KDL
Check file paths are accessible from the server
Review browser console for error details

Resource Plots Not Showing

Verify workflow used granularity: "time_series" mode
Confirm .db files exist in the specified directory
Check that database files contain data

Standalone Mode Server Won't Start

Verify torc-server binary is in PATH or specify --torc-server-bin
Check if the port is already in use
Review console output for error messages

Architecture

torc-dash is a self-contained Rust binary with:

Axum web framework for HTTP server
Embedded static assets (HTML, CSS, JavaScript)
API proxy to forward requests to torc-server
CLI integration for workflow operations

The frontend uses vanilla JavaScript with:

Cytoscape.js for DAG visualization
Plotly.js for resource charts
Custom components for tables and forms

Next Steps

Dashboard Deployment Tutorial - Detailed deployment scenarios
Authentication - Secure your deployment
Server Deployment - Production server configuration

Terminal User Interface (TUI)

The Torc TUI provides a full-featured terminal interface for managing workflows, designed for HPC users working in terminal-over-SSH environments.

Quick Start

# Option 1: Connect to an existing server
torc-server run &   # Start server in background
torc tui            # Launch the TUI

# Option 2: Standalone mode (auto-starts server)
torc tui --standalone

# Option 3: Start TUI without server (manual connection)
torc tui            # Shows warning, use 'S' to start server

Standalone Mode

Use --standalone for single-machine development or testing:

# Basic standalone mode
torc tui --standalone

# Custom port
torc tui --standalone --port 8090

# Custom database location
torc tui --standalone --database /path/to/workflows.db

In standalone mode, the TUI automatically starts a torc-server process with the specified configuration.

Features

Workflow Management: Create, initialize, run, submit, cancel, reset, and delete workflows
Job Management: View details, logs, cancel, terminate, or retry jobs
Real-time Monitoring: Auto-refresh, manual refresh, color-coded status
Live Event Streaming: Server-Sent Events (SSE) for real-time job and compute node events
Server Management: Start/stop torc-server directly from the TUI
File Viewing: Preview workflow files with search and navigation
DAG Visualization: Text-based dependency graph

Interface Overview

When the TUI starts, you'll see:

┌─ Torc Management Console ────────────────────────────────────────┐
│ ?: help | n: new | i: init | I: reinit | R: reset | x: run ...  │
└──────────────────────────────────────────────────────────────────┘
┌─ Server ─────────────────────────────────────────────────────────┐
│ http://localhost:8080/torc-service/v1  S: start | K: stop | O: output │
└──────────────────────────────────────────────────────────────────┘
┌─ User Filter ────────────────────────────────────────────────────┐
│ Current: yourname  (press 'w' to change, 'a' for all users)     │
└──────────────────────────────────────────────────────────────────┘
┌─ Workflows [FOCUSED] ────────────────────────────────────────────┐
│ >> 1  | my-workflow    | yourname | Example workflow            │
│    2  | data-pipeline  | yourname | Data processing pipeline    │
└──────────────────────────────────────────────────────────────────┘

Key	Action
`↑` / `↓`	Move up/down in the current table
`←` / `→`	Switch focus between Workflows and Details panes
`Tab`	Switch between detail tabs (Jobs → Files → Events → Results → DAG)
`Enter`	Load details for selected workflow
`q`	Quit (or close popup/dialog)
`?`	Show help popup with all keybindings

Workflow Actions

Select a workflow and use these keys:

Key	Action	Description
`n`	New	Create workflow from spec file
`i`	Initialize	Set up job dependencies, mark ready jobs
`I`	Re-initialize	Reset and re-initialize (prompts if output files exist)
`R`	Reset	Reset all job statuses
`x`	Run	Run workflow locally (shows real-time output)
`s`	Submit	Submit to HPC scheduler (Slurm)
`C`	Cancel	Cancel running workflow
`d`	Delete	Delete workflow (destructive!)

All destructive actions show a confirmation dialog.

Handling Existing Output Files

When initializing or re-initializing a workflow, if existing output files are detected, the TUI will show a confirmation dialog listing the files that will be deleted. Press y to proceed with --force or n to cancel.

Job Management

Navigate to the Jobs tab (→ then Tab if needed) to manage individual jobs:

Key	Action
`Enter`	View job details
`l`	View job logs (stdout/stderr)
`c`	Cancel job
`t`	Terminate job
`y`	Retry failed job
`f`	Filter jobs by column

Job Status Colors

Green: Completed
Yellow: Running
Red: Failed
Magenta: Canceled/Terminated
Blue: Pending/Scheduled
Cyan: Ready
Gray: Blocked

Log Viewer

Press l on a job to view its logs:

Key	Action
`Tab`	Switch between stdout and stderr
`↑` / `↓`	Scroll one line
`PgUp` / `PgDn`	Scroll 20 lines
`g` / `G`	Jump to top / bottom
`/`	Start search
`n` / `N`	Next / previous search match
`q`	Close log viewer

File Viewer

Navigate to the Files tab and press Enter on a file to view its contents. The file viewer supports:

Files up to 1MB
Binary files show a hex dump preview
Same navigation keys as the log viewer

Events Tab (SSE Live Streaming)

The Events tab shows real-time events from the server using Server-Sent Events (SSE). When you select a workflow and navigate to the Events tab, the TUI establishes a live connection to receive events as they occur.

Event types displayed:

job_started / job_completed / job_failed - Job lifecycle events
compute_node_started / compute_node_stopped - Worker node lifecycle
workflow_started / workflow_reinitialized - Workflow initialization events
scheduler_node_created - Slurm scheduler events

Features:

Events appear instantly without polling
Newest events shown at the top
Filter by Event Type or Data content using f
Title shows "[SSE Live]" indicator when connected

Server Management

The TUI can start and manage a torc-server instance:

Key	Action
`S`	Start torc-server
`K`	Stop/Kill server
`O`	Show server output

The server status indicator in the connection bar shows:

● (green): Server is running (managed by TUI)
○ (yellow): Server was started but has stopped
No indicator: External server (not managed by TUI)

Connection Settings

Key	Action
`u`	Change server URL
`w`	Change user filter
`a`	Toggle show all users

Auto-Refresh

Press A to toggle auto-refresh (30-second interval). When enabled, the workflow list and details refresh automatically.

Configuration

The TUI respects Torc's layered configuration system:

Interactive changes in TUI (press u to change server URL)
Environment variables (TORC_CLIENT__API_URL)
Local config file (./torc.toml)
User config file (~/.config/torc/config.toml)
System config file (/etc/torc/config.toml)
Default values

Troubleshooting

"Could not connect to server"

Ensure the Torc server is running: torc-server run
Check the server URL: press u to update if needed
Verify network connectivity

"No log content available"

Logs may not be available if:

The job hasn't run yet
You're on a different machine than where jobs ran
The output directory is in a different location

Screen rendering issues

Ensure your terminal supports UTF-8 and 256 colors
Try resizing your terminal window
Press r to force a refresh

TUI vs Web Dashboard

Feature	TUI (`torc tui`)	Web (`torc-dash`)
Environment	Terminal/SSH	Web browser
Startup	Instant	~2 seconds
Dependencies	None (single binary)	None (single binary)
Workflow actions	Yes	Yes
Job actions	Yes	Yes
Log viewing	Yes	Yes
DAG visualization	Text-based	Interactive graph
Resource plots	Planned	Yes

Choose the TUI for: SSH sessions, HPC environments, quick operations, low-bandwidth connections.

Choose torc-dash for: Rich visualizations, resource plots, team dashboards.

How to Generate Workflow Reports

This guide shows how to check workflow completion status and generate summary reports.

Check if a Workflow is Complete

Before generating reports, verify that your workflow has finished:

torc workflows is-complete <workflow_id>

If you omit the workflow ID, you'll be prompted to select from your workflows:

torc workflows is-complete

Example output:

Workflow 42 completion status:
  Is Complete: true
  Is Canceled: false
  Needs Completion Script: false

For JSON output:

torc -f json workflows is-complete <workflow_id>

Generate a Workflow Summary

Once a workflow is complete, generate a summary report:

torc reports summary <workflow_id>

If you omit the workflow ID, you'll be prompted to select from your workflows:

torc reports summary

Example output:

Workflow Summary
================

Workflow ID: 42
Name: data_processing_pipeline
User: jsmith

Job Status (total: 100):
  Completed:     95 ✓
  Failed:        5 ✗

Total Execution Time: 2h 30m 15s
Walltime:             3h 15m 42s

If all jobs succeeded:

Workflow Summary
================

Workflow ID: 42
Name: simulation_run
User: jsmith

Job Status (total: 50):
  Completed:     50 ✓

Total Execution Time: 45m 30s

✓ All jobs completed successfully!

Only non-zero status counts are displayed.

Continuous Monitoring

This command can be very convenient, but be mindful of your workflow size (number of jobs) and network load if you are using a shared server.

watch -n 10 torc reports summary <workflow_id>

JSON Output

This is useful for scripts:

torc -f json reports summary <workflow_id>

{
  "workflow_id": 42,
  "workflow_name": "data_processing_pipeline",
  "workflow_user": "jsmith",
  "total_jobs": 100,
  "jobs_by_status": {
    "uninitialized": 0,
    "blocked": 0,
    "ready": 0,
    "pending": 0,
    "running": 0,
    "completed": 95,
    "failed": 5,
    "canceled": 0,
    "terminated": 0,
    "disabled": 0
  },
  "total_exec_time_minutes": 150.25,
  "total_exec_time_formatted": "2h 30m 15s",
  "walltime_seconds": 11742.0,
  "walltime_formatted": "3h 15m 42s"
}

Use in Scripts

Combine these commands in automation scripts:

#!/bin/bash
WORKFLOW_ID=$1

# Check completion status
if torc -f json workflows is-complete "$WORKFLOW_ID" | jq -e '.is_complete' > /dev/null; then
    echo "Workflow complete, generating summary..."
    torc -f json reports summary "$WORKFLOW_ID" > "summary_${WORKFLOW_ID}.json"
else
    echo "Workflow not yet complete"
    exit 1
fi

Check Resource Utilization

After a workflow completes, check if any jobs exceeded their resource limits:

torc reports check-resource-utilization <workflow_id>

Example output when jobs stayed within limits:

Resource Utilization Report for Workflow 42
===========================================

All 50 jobs completed within resource limits.

Example output when jobs exceeded limits:

Resource Utilization Report for Workflow 42
===========================================

Jobs exceeding resource limits:

Job ID  Name           Memory Limit  Peak Memory  Status
------  -------------  ------------  -----------  ------
123     train_model_1  16g           18.2g        EXCEEDED
124     train_model_2  16g           17.8g        EXCEEDED

Recommendation: Increase memory allocation for affected jobs.

This helps identify jobs that may have been killed due to out-of-memory conditions or that are at risk of failure in future runs.

torc workflows status <id> - View current job status counts
torc results list <id> - List individual job results
torc reports check-resource-utilization <id> - Check for resource violations
torc reports results <id> - Generate detailed results with log file paths

Next Steps

Resource Monitoring - Track CPU and memory usage
Debugging Workflows - Troubleshoot failed jobs

How to Monitor Resource Usage

This guide shows how to track CPU and memory usage of your workflow jobs and identify resource requirement mismatches.

Enable Resource Monitoring

Resource monitoring is enabled by default for all workflows. To explicitly configure it, add a resource_monitor section to your workflow specification:

name: "My Workflow"

resource_monitor:
  enabled: true
  granularity: "summary"       # or "time_series"
  sample_interval_seconds: 5

jobs:
  # ... your jobs

To disable monitoring when creating a workflow:

torc workflows create my_workflow.yaml --no-resource-monitoring

View Summary Metrics

For workflows using summary mode (default), view resource metrics with:

torc results list <workflow_id>

The output includes columns for peak and average CPU and memory usage.

Check for Resource Violations

Use check-resource-utilization to identify jobs that exceeded their specified requirements:

# Check latest run
torc reports check-resource-utilization <workflow_id>

# Check a specific run
torc reports check-resource-utilization <workflow_id> --run-id <run_id>

# Show all jobs, not just violations
torc reports check-resource-utilization <workflow_id> --all

Example output:

⚠ Found 3 resource over-utilization violations:

Job ID | Job Name         | Resource | Specified | Peak Used | Over-Utilization
-------|------------------|----------|-----------|-----------|------------------
15     | train_model      | Memory   | 8.00 GB   | 10.50 GB  | +31.3%
15     | train_model      | Runtime  | 2h 0m 0s  | 2h 45m 0s | +37.5%
16     | large_preprocess | CPU      | 800%      | 950.5%    | +18.8%

Adjust Resource Requirements

After identifying violations, update your workflow specification:

# Before: job used 10.5 GB but was allocated 8 GB
resource_requirements:
  - name: training
    memory: 8g
    runtime: PT2H

# After: increased with buffer
resource_requirements:
  - name: training
    memory: 12g       # 10.5 GB peak + 15% buffer
    runtime: PT3H     # 2h 45m actual + buffer

Guidelines for buffers:

Memory: Add 10-20% above peak usage
Runtime: Add 15-30% above actual duration
CPU: Round up to next core count

Enable Time Series Monitoring

For detailed resource analysis over time, switch to time series mode:

resource_monitor:
  granularity: "time_series"
  sample_interval_seconds: 2

This creates a SQLite database with samples at regular intervals.

Generate Resource Plots

Create interactive visualizations from time series data:

# Generate all plots
torc plot-resources output/resource_utilization/resource_metrics_*.db \
  -o plots/

# Generate plots for specific jobs
torc plot-resources output/resource_utilization/resource_metrics_*.db \
  -o plots/ \
  --job-ids 15,16

The tool generates:

Individual job plots showing CPU, memory, and process count over time
Overview plots comparing all jobs
Summary dashboard with bar charts

Query Time Series Data Directly

Access the SQLite database for custom analysis:

sqlite3 -table output/resource_utilization/resource_metrics_1_1.db

-- View samples for a specific job
SELECT job_id, timestamp, cpu_percent, memory_bytes, num_processes
FROM job_resource_samples
WHERE job_id = 1
ORDER BY timestamp;

-- View job metadata
SELECT * FROM job_metadata;

Troubleshooting

No metrics recorded

Check that monitoring wasn't disabled with --no-resource-monitoring
Ensure jobs run long enough for at least one sample (default: 5 seconds)

Time series database not created

Verify the output directory is writable
Confirm granularity: "time_series" is set in the workflow spec

Missing child process metrics

Decrease sample_interval_seconds to catch short-lived processes

Next Steps

Resource Monitoring Reference - Configuration options and database schema
Managing Resources - Define job resource requirements

Debugging Workflows

When workflows fail or produce unexpected results, Torc provides comprehensive debugging tools to help you identify and resolve issues. The primary debugging tools are:

torc results list: Prints a table of return codes for each job execution (non-zero means failure)
torc reports results: Generates a detailed JSON report containing job results and all associated log file paths
torc logs analyze <output-dir>: Analyzes log files for known error patterns (see Working with Logs)
torc-dash Debug tab: Interactive web interface for visual debugging with log file viewer

Overview

Torc automatically captures return codes and multiple log files for each job execution:

Job stdout/stderr: Output from your job commands
Job runner logs: Internal logs from the Torc job runner
Slurm logs: Additional logs when using Slurm scheduler (see Debugging Slurm Workflows)

The reports results command consolidates all this information into a single JSON report, making it easy to locate and examine relevant log files for debugging.

Quick Start

View the job return codes in a table:

torc results list <workflow_id>

Results for workflow ID 2:
╭────┬────────┬───────┬────────┬─────────────┬───────────┬──────────┬────────────┬──────────────────────────┬────────╮
│ ID │ Job ID │ WF ID │ Run ID │ Return Code │ Exec Time │ Peak Mem │ Peak CPU % │ Completion Time          │ Status │
├────┼────────┼───────┼────────┼─────────────┼───────────┼──────────┼────────────┼──────────────────────────┼────────┤
│ 4  │ 6      │ 2     │ 1      │ 1           │ 1.01      │ 73.8MB   │ 21.9%      │ 2025-11-13T13:35:43.289Z │ Done   │
│ 5  │ 4      │ 2     │ 1      │ 0           │ 1.01      │ 118.1MB  │ 301.3%     │ 2025-11-13T13:35:43.393Z │ Done   │
│ 6  │ 5      │ 2     │ 1      │ 0           │ 1.01      │ 413.6MB  │ 19.9%      │ 2025-11-13T13:35:43.499Z │ Done   │
╰────┴────────┴───────┴────────┴─────────────┴───────────┴──────────┴────────────┴──────────────────────────┴────────╯

Total: 3 results

View only failed jobs:

torc results list <workflow_id> --failed

Results for workflow ID 2:
╭────┬────────┬───────┬────────┬─────────────┬───────────┬──────────┬────────────┬──────────────────────────┬────────╮
│ ID │ Job ID │ WF ID │ Run ID │ Return Code │ Exec Time │ Peak Mem │ Peak CPU % │ Completion Time          │ Status │
├────┼────────┼───────┼────────┼─────────────┼───────────┼──────────┼────────────┼──────────────────────────┼────────┤
│ 4  │ 6      │ 2     │ 1      │ 1           │ 1.01      │ 73.8MB   │ 21.9%      │ 2025-11-13T13:35:43.289Z │ Done   │
╰────┴────────┴───────┴────────┴─────────────┴───────────┴──────────┴────────────┴──────────────────────────┴────────╯

Generate a debugging report for a workflow:

# Generate report for a specific workflow
torc reports results <workflow_id>

# Specify custom output directory (default: "output")
torc reports results <workflow_id> --output-dir /path/to/output

# Include all workflow runs (default: only latest run)
torc reports results <workflow_id> --all-runs

# Interactive workflow selection (if workflow_id omitted)
torc reports results

The command outputs a comprehensive JSON report to stdout. Redirect it to a file for easier analysis:

torc reports results <workflow_id> > debug_report.json

Report Structure

Top-Level Fields

The JSON report includes workflow-level information:

{
  "workflow_id": 123,
  "workflow_name": "my_pipeline",
  "workflow_user": "researcher",
  "all_runs": false,
  "total_results": 5,
  "results": [...]
}

Fields:

workflow_id: Unique identifier for the workflow
workflow_name: Human-readable workflow name
workflow_user: Owner of the workflow
all_runs: Whether report includes all historical runs or just the latest
total_results: Number of job results in the report
results: Array of individual job result records

Job Result Records

Each entry in the results array contains detailed information about a single job execution:

{
  "job_id": 456,
  "job_name": "preprocess_data",
  "status": "Done",
  "run_id": 1,
  "return_code": 0,
  "completion_time": "2024-01-15T14:30:00.000Z",
  "exec_time_minutes": 5.2,
  "compute_node_id": 789,
  "compute_node_type": "local",
  "job_stdout": "output/job_stdio/job_456.o",
  "job_stderr": "output/job_stdio/job_456.e",
  "job_runner_log": "output/job_runner_hostname_123_1.log"
}

Core Fields:

job_id: Unique identifier for the job
job_name: Human-readable job name from workflow spec
status: Job status (Done, Terminated, Failed, etc.)
run_id: Workflow run number (increments on reinitialization)
return_code: Exit code from job command (0 = success)
completion_time: ISO 8601 timestamp when job completed
exec_time_minutes: Duration of job execution in minutes

Compute Node Fields:

compute_node_id: ID of the compute node that executed the job
compute_node_type: Type of compute node ("local" or "slurm")

Log File Paths

The report includes paths to all log files associated with each job. The specific files depend on the compute node type.

Local Runner Log Files

For jobs executed by the local job runner (compute_node_type: "local"):

{
  "job_stdout": "output/job_stdio/job_456.o",
  "job_stderr": "output/job_stdio/job_456.e",
  "job_runner_log": "output/job_runner_hostname_123_1.log"
}

Log File Descriptions:

job_stdout (output/job_stdio/job_<workflow_id>_<job_id>_<run_id>.o):
- Standard output from your job command
- Contains print statements, normal program output
- Use for: Checking expected output, debugging logic errors
job_stderr (output/job_stdio/job_<workflow_id>_<job_id>_<run_id>.e):
- Standard error from your job command
- Contains error messages, warnings, stack traces
- Use for: Investigating crashes, exceptions, error messages
job_runner_log (output/job_runner_<hostname>_<workflow_id>_<run_id>.log):
- Internal Torc job runner logging
- Shows job lifecycle events, resource monitoring, process management
- Use for: Understanding Torc's job execution behavior, timing issues

Log path format conventions:

Job stdio logs use job ID in filename
Runner logs use hostname, workflow ID, and run ID
All paths are relative to the specified --output-dir

Slurm Runner Log Files

For jobs executed via Slurm scheduler (compute_node_type: "slurm"), additional log files are available including Slurm stdout/stderr, environment logs, and dmesg logs.

See Debugging Slurm Workflows for detailed information about Slurm-specific log files and debugging tools.

Using the torc-dash Debugging Tab

The torc-dash web interface provides an interactive Debugging tab for visual debugging of workflow jobs. This is often the quickest way to investigate failed jobs without using command-line tools.

Accessing the Debugging Tab

Start torc-dash (standalone mode recommended for quick setup):
```
torc-dash --standalone
```
Open your browser to http://localhost:8090
Select a workflow from the dropdown in the sidebar
Click the Debugging tab in the navigation

Features

Job Results Report

The Debug tab provides a report generator with the following options:

Output Directory: Specify where job logs are stored (default: output). This must match the directory used during workflow execution.
Include all runs: Check this to see results from all workflow runs, not just the latest. Useful for comparing job behavior across reinitializations.
Show only failed jobs: Filter to display only jobs with non-zero return codes. This is checked by default to help you focus on problematic jobs.

Click Generate Report to fetch job results from the server.

Job Results Table

After generating a report, the Debug tab displays an interactive table showing:

Job ID: Unique identifier for the job
Job Name: Human-readable name from the workflow spec
Status: Job completion status (Done, Terminated, etc.)
Return Code: Exit code (0 = success, non-zero = failure)
Execution Time: Duration in minutes
Run ID: Which workflow run the result is from

Click any row to select a job and view its log files.

Log File Viewer

When you select a job from the table, the Log File Viewer displays:

stdout tab: Standard output from the job command
- Shows print statements and normal program output
- Useful for checking expected behavior and debugging logic
stderr tab: Standard error from the job command
- Shows error messages, warnings, and stack traces
- Primary location for investigating crashes and exceptions

Each tab includes:

Copy Path button: Copy the full file path to clipboard
File path display: Shows where the log file is located
Scrollable content viewer: Dark-themed viewer for easy reading

Quick Debugging Workflow with torc-dash

Open torc-dash and select your workflow from the sidebar
Go to the Debugging tab
Ensure "Show only failed jobs" is checked
Click Generate Report
Click on a failed job in the results table
Review the stderr tab for error messages
Check the stdout tab for context about what the job was doing

When to Use torc-dash vs CLI

Use torc-dash Debugging tab when:

You want a visual, interactive debugging experience
You need to quickly scan multiple failed jobs
You're investigating jobs and want to easily switch between stdout/stderr
You prefer not to construct jq queries manually

Use CLI tools (torc reports results) when:

You need to automate failure detection in CI/CD
You want to save reports for archival or version control
You're working on a remote server without browser access
You need to process results programmatically

Common Debugging Workflows

Investigating Failed Jobs

When a job fails, follow these steps:

Generate the debug report:

torc reports results <workflow_id> > debug_report.json

Find the failed job using jq or similar tool:

# Find jobs with non-zero return codes
jq '.results[] | select(.return_code != 0)' debug_report.json

# Find jobs with specific status
jq '.results[] | select(.status == "Done")' debug_report.json

Check the job's stderr for error messages:

# Extract stderr path for a specific job
STDERR_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_stderr' debug_report.json)

# View the error output
cat "$STDERR_PATH"

Review job stdout for context:

STDOUT_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_stdout' debug_report.json)
cat "$STDOUT_PATH"

Check runner logs for execution issues:

LOG_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_runner_log' debug_report.json)
cat "$LOG_PATH"

Searching Log Files with Grep

Torc's log messages use a structured key=value format that makes them easy to search with grep. This is especially useful for tracing specific jobs or workflows across multiple log files.

Search for all log entries related to a specific workflow:

# Find all log lines for workflow 123
grep -r "workflow_id=123" output/

# Find all log lines for workflow 123 in job runner logs only
grep -r "workflow_id=123" output/job_runner_*.log

Search for a specific job:

# Find all log lines for job 456
grep -r "job_id=456" output/

# Find log lines for job 456 with more context (2 lines before/after)
grep -r -C 2 "job_id=456" output/

Combine workflow and job searches:

# Find log lines for job 456 in workflow 123
grep -r "workflow_id=123" output/ | grep "job_id=456"

# Alternative using extended regex
grep -rE "workflow_id=123.*job_id=456" output/

Search for specific runs or attempts:

# Find all log lines for run 2 of workflow 123
grep -r "workflow_id=123" output/ | grep "run_id=2"

# Find retry attempts for a specific job
grep -r "job_id=456" output/ | grep "attempt_id="

# Find entries for a specific compute node
grep -r "compute_node_id=789" output/

Common log message patterns to search for:

# Find job start events
grep -r "Job started workflow_id=" output/

# Find job completion events
grep -r "Job completed workflow_id=" output/

# Find failed jobs
grep -r "status=failed" output/

# Find all job process completions with return codes
grep -r "Job process completed" output/ | grep "return_code="

Tip: Redirect grep output to a file for easier analysis of large result sets:

grep -r "workflow_id=123" output/ > workflow_123_logs.txt

Example: Complete Debugging Session

# 1. Generate report
torc reports results 123 > report.json

# 2. Check overall success/failure counts
echo "Total jobs: $(jq '.total_results' report.json)"
echo "Failed jobs: $(jq '[.results[] | select(.return_code != 0)] | length' report.json)"

# 3. List all failed jobs with their names
jq -r '.results[] | select(.return_code != 0) | "\(.job_id): \(.job_name) (exit code: \(.return_code))"' report.json

# Output:
# 456: process_batch_2 (exit code: 1)
# 789: validate_results (exit code: 2)

# 4. Examine stderr for first failure
jq -r '.results[] | select(.job_id == 456) | .job_stderr' report.json | xargs cat

# Output might show:
# FileNotFoundError: [Errno 2] No such file or directory: 'input/batch_2.csv'

# 5. Check if job dependencies completed successfully
# (The missing file might be an output from a previous job)
jq -r '.results[] | select(.job_name == "generate_batch_2") | "\(.status) (exit code: \(.return_code))"' report.json

Debugging Across Multiple Runs

When a workflow has been reinitialized multiple times, compare runs to identify regressions:

# Generate report with all historical runs
torc reports results <workflow_id> --all-runs > full_history.json

# Compare return codes across runs for a specific job
jq -r '.results[] | select(.job_name == "flaky_job") | "Run \(.run_id): exit code \(.return_code)"' full_history.json

# Output:
# Run 1: exit code 0
# Run 2: exit code 1
# Run 3: exit code 0
# Run 4: exit code 1

# Extract stderr paths for failed runs
jq -r '.results[] | select(.job_name == "flaky_job" and .return_code != 0) | "Run \(.run_id): \(.job_stderr)"' full_history.json

Log File Missing Warnings

The reports results command automatically checks for log file existence and prints warnings to stderr if files are missing:

Warning: job stdout log file does not exist for job 456: output/job_stdio/job_456.o
Warning: job runner log file does not exist for job 456: output/job_runner_host1_123_1.log

Common causes of missing log files:

Wrong output directory: Ensure --output-dir matches the directory used during workflow execution
Logs not yet written: Job may still be running or failed to start
Logs cleaned up: Files may have been manually deleted
Path mismatch: Output directory moved or renamed after execution

Solution: Verify the output directory and ensure it matches what was passed to torc run or torc slurm schedule-nodes.

Output Directory Management

The --output-dir parameter must match the directory used during workflow execution:

Local Runner

# Execute workflow with specific output directory
torc run <workflow_id> /path/to/my_output

# Generate report using the same directory
torc reports results <workflow_id> --output-dir /path/to/my_output

Slurm Scheduler

# Submit jobs to Slurm with output directory
torc slurm schedule-nodes <workflow_id> --output-dir /path/to/my_output

# Generate report using the same directory
torc reports results <workflow_id> --output-dir /path/to/my_output

Default behavior: If --output-dir is not specified, both the runner and reports command default to ./output.

Best Practices

Generate reports after each run: Create a debug report immediately after workflow execution for easier troubleshooting
Archive reports with logs: Store the JSON report alongside log files for future reference
```
torc reports results "$WF_ID" > "output/report_${WF_ID}_$(date +%Y%m%d_%H%M%S).json"
```
Use version control: Commit debug reports for important workflow runs to track changes over time
Automate failure detection: Use the report in CI/CD pipelines to automatically detect and report failures
Check warnings: Pay attention to warnings about missing log files - they often indicate configuration issues

Combine with resource monitoring: Use reports results for log files and reports check-resource-utilization for performance issues

# Check if job failed due to resource constraints
torc reports check-resource-utilization "$WF_ID"
torc reports results "$WF_ID" > report.json

Filter large reports: For workflows with many jobs, filter the report to focus on relevant jobs

# Only include failed jobs in filtered report
jq '{workflow_id, workflow_name, results: [.results[] | select(.return_code != 0)]}' report.json

Troubleshooting Common Issues

"Output directory does not exist" Error

Cause: The specified --output-dir path doesn't exist.

Solution: Verify the directory exists and the path is correct:

ls -ld output/  # Check if directory exists
torc reports results <workflow_id> --output-dir "$(pwd)/output"

Empty Results Array

Cause: No job results exist for the workflow (jobs not yet executed or initialized).

Solution: Check workflow status and ensure jobs have been completed:

torc workflows status <workflow_id>
torc results list <workflow_id>  # Verify results exist

All Log Paths Show Warnings

Cause: Output directory mismatch between execution and report generation.

Solution: Verify the output directory used during execution:

# Check where logs actually are
find . -name "job_*.o" -o -name "job_runner_*.log"

# Use correct output directory in report
torc reports results <workflow_id> --output-dir <correct_path>

torc results list: View summary of job results in table format
torc workflows status: Check overall workflow status
torc reports results: Generate debug report with all log file paths
torc reports check-resource-utilization: Analyze resource usage and find over-utilized jobs
torc jobs list: View all jobs and their current status
torc-dash: Launch web interface with interactive Debugging tab
torc tui: Launch terminal UI for workflow monitoring

Working with Logs

Torc provides tools for bundling and analyzing workflow logs. These are useful for:

Sharing logs with colleagues for help debugging
Archiving completed workflow logs for later reference
Scanning for errors across all log files at once

Log File Overview

Torc generates several types of log files during workflow execution:

Log Type	Path Pattern	Contents
Job stdout	`output/job_stdio/job_wf<id>_j<job>_r<run>.o`	Standard output from job commands
Job stderr	`output/job_stdio/job_wf<id>_j<job>_r<run>.e`	Error output, stack traces
Job runner	`output/job_runner_*.log`	Torc job runner internal logs
Slurm stdout	`output/slurm_output_wf<id>_sl<slurm_id>.o`	Slurm job allocation output
Slurm stderr	`output/slurm_output_wf<id>_sl<slurm_id>.e`	Slurm-specific errors
Slurm env	`output/slurm_env_*.log`	Slurm environment variables
dmesg	`output/dmesg_slurm_*.log`	Kernel messages (on failure)

For detailed information about log file contents, see Debugging Workflows and Debugging Slurm Workflows.

Bundling Logs

The torc logs bundle command packages all logs for a workflow into a compressed tarball:

# Bundle all logs for a workflow
torc logs bundle <workflow_id>

# Specify custom output directory (where logs are located)
torc logs bundle <workflow_id> --output-dir /path/to/output

# Save bundle to a specific directory
torc logs bundle <workflow_id> --bundle-dir /path/to/bundles

This creates a wf<id>.tar.gz file containing:

All job stdout/stderr files (job_wf*_j*_r*.o/e)
Job runner logs (job_runner_*.log)
Slurm output files (slurm_output_wf*_sl*.o/e)
Slurm environment logs (slurm_env_wf*_sl*.log)
dmesg logs (dmesg_slurm_wf*_sl*.log)
Bundle metadata (workflow info, collection timestamp)

# Bundle workflow logs
torc logs bundle 123 --bundle-dir ./bundles

# Share the bundle
ls ./bundles/
# wf123.tar.gz

# Recipient can extract and analyze
tar -xzf wf123.tar.gz
torc logs analyze wf123/

Analyzing Logs

The torc logs analyze command scans log files for known error patterns:

# Analyze a log bundle tarball
torc logs analyze wf123.tar.gz

# Analyze a log directory directly (auto-detects workflow if only one present)
torc logs analyze output/

# Analyze a directory with multiple workflows (specify which one)
torc logs analyze output/ --workflow-id 123

Detected Error Patterns

The analyzer scans for common failure patterns including:

Memory Errors:

Out of memory, OOM kills
std::bad_alloc (C++)
MemoryError (Python)

Slurm Errors:

Time limit exceeded
Node failures
Preemption

GPU/CUDA Errors:

CUDA out of memory
GPU memory exceeded

Crashes:

Segmentation faults
Bus errors
Signal kills

Python Errors:

Tracebacks
Import errors

File System Errors:

No space left on device
Permission denied

Network Errors:

Connection refused/timed out

Example Output

Log Analysis Results
====================

Analyzing: output/

Files with detected errors:

  output/job_stdio/job_wf123_j456_r1.e
    Line 42: MemoryError: Unable to allocate 8.00 GiB
    Severity: critical
    Type: Python Memory Error

  output/slurm_output_wf123_sl789.e
    Line 15: slurmstepd: error: Detected 1 oom-kill event(s)
    Severity: critical
    Type: Out of Memory (OOM) Kill

Summary:
  Total files scanned: 24
  Files with errors: 2
  Error types found: MemoryError, OOM Kill

Excluding Files

Environment variable files (slurm_env_*.log) are automatically excluded from error analysis since they contain configuration data, not error logs.

A common pattern when asking for help:

# 1. Bundle the workflow logs
torc logs bundle <workflow_id>

# 2. Analyze locally first to understand the issue
torc logs analyze wf<id>.tar.gz

# 3. Share the bundle with your colleague/support
#    They can extract and analyze:
tar -xzf wf<id>.tar.gz
torc logs analyze wf<id>/

torc reports results: Generate JSON report with all log file paths
torc results list: View summary table of job return codes
torc slurm parse-logs: Parse Slurm logs for error patterns (Slurm-specific)
torc slurm sacct: Collect Slurm accounting data

Reference

Comprehensive reference documentation.

CLI Reference - Command-line interface documentation
Job Parameterization - Parameter syntax and options
Resource Requirements - Specifying job resources
Environment Variables - Configuration via environment
Resource Monitoring Database - Metrics database schema
Configuration - Configuration file reference
OpenAPI Specification - REST API documentation

CLI Reference

This documentation is automatically generated from the CLI help text.

To regenerate, run:

cargo run --bin generate-cli-docs --features "client,tui,plot_resources"

Command-Line Help for `torc`

This document contains the help content for the torc command-line program.

Command Overview:

`torc`

Torc workflow orchestration system

Usage: torc [OPTIONS] <COMMAND>

Subcommands:

run — Run a workflow locally (create from spec file or run existing workflow by ID)
submit — Submit a workflow to scheduler (create from spec file or submit existing workflow by ID)
submit-slurm — Submit a workflow to Slurm with auto-generated schedulers
watch — Watch a workflow and automatically recover from failures
workflows — Workflow management commands
compute-nodes — Compute node management commands
files — File management commands
jobs — Job management commands
job-dependencies — Job dependency and relationship queries
resource-requirements — Resource requirements management commands
events — Event management commands
results — Result management commands
user-data — User data management commands
slurm — Slurm scheduler commands
remote — Remote worker execution commands (SSH-based distributed execution)
scheduled-compute-nodes — Scheduled compute node management commands
hpc — HPC system profiles and partition information
reports — Generate reports and analytics
config — Manage configuration files and settings
tui — Interactive terminal UI for managing workflows
plot-resources — Generate interactive HTML plots from resource monitoring data
completions — Generate shell completions

Options:

--log-level <LOG_LEVEL> — Log level (error, warn, info, debug, trace)
-f, --format <FORMAT> — Output format (table or json)
--log-level <LOG_LEVEL> — Log level (error, warn, info, debug, trace)
-f, --format <FORMAT> — Output format (table or json)

Default value: table
--url <URL> — URL of torc server
--username <USERNAME> — Username for basic authentication
--password <PASSWORD> — Password for basic authentication (will prompt if username provided but password not)
--url <URL> — URL of torc server
--username <USERNAME> — Username for basic authentication
--password <PASSWORD> — Password for basic authentication (will prompt if username provided but password not)

`torc run`

Run a workflow locally (create from spec file or run existing workflow by ID)

Usage: torc run [OPTIONS] <WORKFLOW_SPEC_OR_ID>

Arguments:

<WORKFLOW_SPEC_OR_ID> — Path to workflow spec file (JSON/JSON5/YAML) or workflow ID
<WORKFLOW_SPEC_OR_ID> — Path to workflow spec file (JSON/JSON5/YAML) or workflow ID

Options:

--max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs to run concurrently
--num-cpus <NUM_CPUS> — Number of CPUs available
--memory-gb <MEMORY_GB> — Memory in GB
--num-gpus <NUM_GPUS> — Number of GPUs available
-p, --poll-interval <POLL_INTERVAL> — Job completion poll interval in seconds
-o, --output-dir <OUTPUT_DIR> — Output directory for jobs
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution
--max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs to run concurrently
--num-cpus <NUM_CPUS> — Number of CPUs available
--memory-gb <MEMORY_GB> — Memory in GB
--num-gpus <NUM_GPUS> — Number of GPUs available
-p, --poll-interval <POLL_INTERVAL> — Job completion poll interval in seconds
-o, --output-dir <OUTPUT_DIR> — Output directory for jobs
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution

Default value: false

`torc submit`

Submit a workflow to scheduler (create from spec file or submit existing workflow by ID)

Requires workflow to have an on_workflow_start action with schedule_nodes. For Slurm workflows without pre-configured schedulers, use submit-slurm instead. Requires workflow to have an on_workflow_start action with schedule_nodes. For Slurm workflows without pre-configured schedulers, use submit-slurm instead.

Usage: torc submit [OPTIONS] <WORKFLOW_SPEC_OR_ID>

Arguments:

<WORKFLOW_SPEC_OR_ID> — Path to workflow spec file (JSON/JSON5/YAML) or workflow ID
<WORKFLOW_SPEC_OR_ID> — Path to workflow spec file (JSON/JSON5/YAML) or workflow ID

Options:

-i, --ignore-missing-data — Ignore missing data (defaults to false)
-i, --ignore-missing-data — Ignore missing data (defaults to false)

Default value: false
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution

Default value: false

`torc submit-slurm`

Submit a workflow to Slurm with auto-generated schedulers

Automatically generates Slurm schedulers based on job resource requirements and HPC profile.

WARNING: This command uses heuristics to generate schedulers and workflow actions. For complex workflows with unusual dependency patterns, the generated configuration may not be optimal and could waste allocation time. WARNING: This command uses heuristics to generate schedulers and workflow actions. For complex workflows with unusual dependency patterns, the generated configuration may not be optimal and could waste allocation time.

RECOMMENDED: Preview the generated configuration first with:

torc slurm generate --account workflow.yaml

Review the schedulers and actions to ensure they are appropriate for your workflow before submitting. You can save the output and submit manually: Review the schedulers and actions to ensure they are appropriate for your workflow before submitting. You can save the output and submit manually:

torc slurm generate --account -o workflow_with_schedulers.yaml workflow.yaml torc submit workflow_with_schedulers.yaml torc slurm generate --account -o workflow_with_schedulers.yaml workflow.yaml torc submit workflow_with_schedulers.yaml

Usage: torc submit-slurm [OPTIONS] --account <ACCOUNT> <WORKFLOW_SPEC>

Arguments:

<WORKFLOW_SPEC> — Path to workflow spec file (JSON/JSON5/YAML/KDL)
<WORKFLOW_SPEC> — Path to workflow spec file (JSON/JSON5/YAML/KDL)

Options:

--account <ACCOUNT> — Slurm account to use for allocations
--hpc-profile <HPC_PROFILE> — HPC profile to use (auto-detected if not specified)
--single-allocation — Bundle all nodes into a single Slurm allocation per scheduler
--account <ACCOUNT> — Slurm account to use for allocations
--hpc-profile <HPC_PROFILE> — HPC profile to use (auto-detected if not specified)
--single-allocation — Bundle all nodes into a single Slurm allocation per scheduler

By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance. By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance.

With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch.
-i, --ignore-missing-data — Ignore missing data (defaults to false) With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch.
-i, --ignore-missing-data — Ignore missing data (defaults to false)

Default value: false
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution

Default value: false

`torc watch`

Watch a workflow and automatically recover from failures.

Monitors a workflow until completion. With --recover, automatically diagnoses failures, adjusts resource requirements, and resubmits jobs.

Usage: torc watch [OPTIONS] <WORKFLOW_ID>

Usage Modes

Basic monitoring (no recovery):
```
torc watch 123
```
Reports failures and exits. Use for manual intervention or AI-assisted recovery.
With automatic recovery (--recover):
```
torc watch 123 --recover
```
Automatically diagnoses OOM/timeout failures, adjusts resources, and retries. Runs until all jobs complete or max retries exceeded.
With auto-scheduling (--auto-schedule):
```
torc watch 123 --auto-schedule
```
Automatically submits new Slurm allocations when retry jobs are waiting. Essential for workflows using failure handlers that create retry jobs.

Arguments

<WORKFLOW_ID> — Workflow ID to watch

Options

Polling:

-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds (default: 60)
-o, --output-dir <OUTPUT_DIR> — Output directory for job files (default: output)
-s, --show-job-counts — Show job counts by status during polling. WARNING: Can cause high server load for large workflows.

Recovery:

-r, --recover — Enable automatic failure recovery
-m, --max-retries <MAX_RETRIES> — Maximum number of recovery attempts (default: 3)
--memory-multiplier <MEMORY_MULTIPLIER> — Memory multiplier for OOM failures (default: 1.5)
--runtime-multiplier <RUNTIME_MULTIPLIER> — Runtime multiplier for timeout failures (default: 1.5)
--retry-unknown — Also retry jobs with unknown failure causes (not just OOM or timeout)
--recovery-hook <RECOVERY_HOOK> — Custom recovery script for unknown failures. The workflow ID is passed as an argument and via TORC_WORKFLOW_ID environment variable.

Auto-scheduling:

--auto-schedule — Automatically schedule new compute nodes when needed
--auto-schedule-threshold <N> — Minimum retry jobs before auto-scheduling when schedulers exist (default: 5)
--auto-schedule-cooldown <SECONDS> — Cooldown between auto-schedule attempts (default: 1800 / 30 min)
--auto-schedule-stranded-timeout <SECONDS> — Schedule stranded jobs after this timeout even if below threshold (default: 7200 / 2 hrs). Set to 0 to disable.

Auto-Scheduling Behavior

When --auto-schedule is enabled:

No schedulers available: Immediately submits new allocations if ready jobs exist.
Threshold exceeded: If retry jobs (attempt_id > 1) exceed --auto-schedule-threshold while schedulers are running, submits additional allocations after cooldown.
Stranded jobs: If retry jobs are below threshold but waiting longer than --auto-schedule-stranded-timeout, schedules anyway to prevent indefinite waiting.

Examples

# Basic: watch until completion, report failures
torc watch 123

# Recovery: automatically fix OOM/timeout failures
torc watch 123 --recover

# Recovery with aggressive resource increases
torc watch 123 --recover --memory-multiplier 2.0 --runtime-multiplier 2.0

# Recovery including unknown failures (transient errors)
torc watch 123 --recover --retry-unknown

# Auto-schedule: ensure retry jobs get scheduled
torc watch 123 --auto-schedule

# Full production setup: recovery + auto-scheduling
torc watch 123 --recover --auto-schedule

# Custom auto-schedule settings
torc watch 123 --auto-schedule \
    --auto-schedule-threshold 10 \
    --auto-schedule-cooldown 3600 \
    --auto-schedule-stranded-timeout 14400

`torc recover`

Recover a Slurm workflow from failures (one-shot).

Diagnoses job failures (OOM, timeout), adjusts resource requirements, and resubmits jobs. Use after a workflow has completed with failures. For continuous monitoring, use torc watch --recover instead.

Usage: torc recover [OPTIONS] <WORKFLOW_ID>

Arguments

<WORKFLOW_ID> — Workflow ID to recover

Options

-o, --output-dir <OUTPUT_DIR> — Output directory for job files (default: output)
--memory-multiplier <MEMORY_MULTIPLIER> — Memory multiplier for OOM failures (default: 1.5)
--runtime-multiplier <RUNTIME_MULTIPLIER> — Runtime multiplier for timeout failures (default: 1.4)
--retry-unknown — Also retry jobs with unknown failure causes
--recovery-hook <RECOVERY_HOOK> — Custom recovery script for unknown failures
--dry-run — Show what would be done without making any changes

When to Use

Use torc recover for:

One-shot recovery after a workflow has completed with failures
Manual investigation before retrying (use --dry-run first)
Workflows where you want to inspect failures before retrying

Use torc watch --recover instead for:

Continuous monitoring of long-running workflows
Fully automated recovery without manual intervention
Production workflows that should self-heal

Examples

# Basic recovery
torc recover 123

# Dry run to preview changes without modifying anything
torc recover 123 --dry-run

# Custom resource multipliers
torc recover 123 --memory-multiplier 2.0 --runtime-multiplier 1.5

# Also retry unknown failures (not just OOM/timeout)
torc recover 123 --retry-unknown

# With custom recovery hook for domain-specific fixes
torc recover 123 --recovery-hook 'bash fix-cluster.sh'

`torc workflows`

Workflow management commands

Usage: torc workflows <COMMAND>

Subcommands:

create — Create a workflow from a specification file (supports JSON, JSON5, YAML, and KDL formats)
create-slurm — Create a workflow with auto-generated Slurm schedulers
new — Create a new empty workflow
list — List workflows
get — Get a specific workflow by ID
update — Update an existing workflow
cancel — Cancel a workflow and all associated Slurm jobs
delete — Delete one or more workflows
archive — Archive or unarchive one or more workflows
submit — Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodes
run — Run a workflow locally on the current node
initialize — Initialize a workflow, including all job statuses
reinitialize — Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changed
status — Get workflow status
reset-status — Reset workflow and job status
execution-plan — Show the execution plan for a workflow specification or existing workflow
list-actions — List workflow actions and their statuses (useful for debugging action triggers)
is-complete — Check if a workflow is complete
create — Create a workflow from a specification file (supports JSON, JSON5, YAML, and KDL formats)
create-slurm — Create a workflow with auto-generated Slurm schedulers
new — Create a new empty workflow
list — List workflows
get — Get a specific workflow by ID
update — Update an existing workflow
cancel — Cancel a workflow and all associated Slurm jobs
delete — Delete one or more workflows
archive — Archive or unarchive one or more workflows
submit — Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodes
run — Run a workflow locally on the current node
initialize — Initialize a workflow, including all job statuses
reinitialize — Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changed
status — Get workflow status
reset-status — Reset workflow and job status
execution-plan — Show the execution plan for a workflow specification or existing workflow
list-actions — List workflow actions and their statuses (useful for debugging action triggers)
is-complete — Check if a workflow is complete
export — Export a workflow to a portable JSON file
import — Import a workflow from an exported JSON file
sync-status — Synchronize job statuses with Slurm (detect and fail orphaned jobs)

`torc workflows create`

Create a workflow from a specification file (supports JSON, JSON5, YAML, and KDL formats)

Usage: torc workflows create [OPTIONS] --user <USER> <FILE>

Arguments:

<FILE> — Path to specification file containing WorkflowSpec
<FILE> — Path to specification file containing WorkflowSpec

Supported formats: - JSON (.json): Standard JSON format - JSON5 (.json5): JSON with comments and trailing commas - YAML (.yaml, .yml): Human-readable YAML format - KDL (.kdl): KDL document format Supported formats: - JSON (.json): Standard JSON format - JSON5 (.json5): JSON with comments and trailing commas - YAML (.yaml, .yml): Human-readable YAML format - KDL (.kdl): KDL document format

Format is auto-detected from file extension, with fallback parsing attempted Format is auto-detected from file extension, with fallback parsing attempted

Options:

-u, --user <USER> — User that owns the workflow (defaults to USER environment variable)
--no-resource-monitoring — Disable resource monitoring (default: enabled with summary granularity and 5s sample rate)
-u, --user <USER> — User that owns the workflow (defaults to USER environment variable)
--no-resource-monitoring — Disable resource monitoring (default: enabled with summary granularity and 5s sample rate)

Default value: false
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution

Default value: false
--dry-run — Validate the workflow specification without creating it (dry-run mode) Returns a summary of what would be created including job count after parameter expansion
--dry-run — Validate the workflow specification without creating it (dry-run mode) Returns a summary of what would be created including job count after parameter expansion

`torc workflows create-slurm`

Create a workflow with auto-generated Slurm schedulers

Automatically generates Slurm schedulers based on job resource requirements and HPC profile. For Slurm workflows without pre-configured schedulers. Automatically generates Slurm schedulers based on job resource requirements and HPC profile. For Slurm workflows without pre-configured schedulers.

Usage: torc workflows create-slurm [OPTIONS] --account <ACCOUNT> --user <USER> <FILE>

Arguments:

<FILE> — Path to specification file containing WorkflowSpec
<FILE> — Path to specification file containing WorkflowSpec

Options:

--account <ACCOUNT> — Slurm account to use for allocations
--hpc-profile <HPC_PROFILE> — HPC profile to use (auto-detected if not specified)
--single-allocation — Bundle all nodes into a single Slurm allocation per scheduler
--account <ACCOUNT> — Slurm account to use for allocations
--hpc-profile <HPC_PROFILE> — HPC profile to use (auto-detected if not specified)
--single-allocation — Bundle all nodes into a single Slurm allocation per scheduler

By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance. By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance.

With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch.
-u, --user <USER> — User that owns the workflow (defaults to USER environment variable)
--no-resource-monitoring — Disable resource monitoring (default: enabled with summary granularity and 5s sample rate) With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch.
-u, --user <USER> — User that owns the workflow (defaults to USER environment variable)
--no-resource-monitoring — Disable resource monitoring (default: enabled with summary granularity and 5s sample rate)

Default value: false
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution
--skip-checks — Skip validation checks (e.g., scheduler node requirements). Use with caution

Default value: false
--dry-run — Validate the workflow specification without creating it (dry-run mode)
--dry-run — Validate the workflow specification without creating it (dry-run mode)

`torc workflows new`

Create a new empty workflow

Usage: torc workflows new [OPTIONS] --name <NAME> --user <USER>

Options:

-n, --name <NAME> — Name of the workflow
-d, --description <DESCRIPTION> — Description of the workflow
-u, --user <USER> — User that owns the workflow (defaults to USER environment variable)
-n, --name <NAME> — Name of the workflow
-d, --description <DESCRIPTION> — Description of the workflow
-u, --user <USER> — User that owns the workflow (defaults to USER environment variable)

`torc workflows list`

List workflows

Usage: torc workflows list [OPTIONS]

Options:

-u, --user <USER> — User to filter by (defaults to USER environment variable)
--all-users — List workflows for all users (overrides --user)
-l, --limit <LIMIT> — Maximum number of workflows to return
-u, --user <USER> — User to filter by (defaults to USER environment variable)
--all-users — List workflows for all users (overrides --user)
-l, --limit <LIMIT> — Maximum number of workflows to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--archived-only — Show only archived workflows
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--archived-only — Show only archived workflows

Default value: false
--include-archived — Include both archived and non-archived workflows
--include-archived — Include both archived and non-archived workflows

Default value: false

`torc workflows get`

Get a specific workflow by ID

Usage: torc workflows get [OPTIONS] [ID]

Arguments:

<ID> — ID of the workflow to get (optional - will prompt if not provided)
<ID> — ID of the workflow to get (optional - will prompt if not provided)

Options:

-u, --user <USER> — User to filter by (defaults to USER environment variable)
-u, --user <USER> — User to filter by (defaults to USER environment variable)

`torc workflows update`

Update an existing workflow

Usage: torc workflows update [OPTIONS] [ID]

Arguments:

<ID> — ID of the workflow to update (optional - will prompt if not provided)
<ID> — ID of the workflow to update (optional - will prompt if not provided)

Options:

-n, --name <NAME> — Name of the workflow
-d, --description <DESCRIPTION> — Description of the workflow
--owner-user <OWNER_USER> — User that owns the workflow
-n, --name <NAME> — Name of the workflow
-d, --description <DESCRIPTION> — Description of the workflow
--owner-user <OWNER_USER> — User that owns the workflow

`torc workflows cancel`

Cancel a workflow and all associated Slurm jobs

Usage: torc workflows cancel [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to cancel (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow to cancel (optional - will prompt if not provided)

`torc workflows delete`

Delete one or more workflows

Usage: torc workflows delete [OPTIONS] [IDS]...

Arguments:

<IDS> — IDs of workflows to remove (optional - will prompt if not provided)
<IDS> — IDs of workflows to remove (optional - will prompt if not provided)

Options:

--no-prompts — Skip confirmation prompt
--force — Force deletion even if workflow belongs to a different user
--no-prompts — Skip confirmation prompt
--force — Force deletion even if workflow belongs to a different user

`torc workflows archive`

Archive or unarchive one or more workflows

Usage: torc workflows archive <IS_ARCHIVED> [WORKFLOW_IDS]...

Arguments:

<IS_ARCHIVED> — Set to true to archive, false to unarchive
<WORKFLOW_IDS> — IDs of workflows to archive/unarchive (if empty, will prompt for selection)
<IS_ARCHIVED> — Set to true to archive, false to unarchive
<WORKFLOW_IDS> — IDs of workflows to archive/unarchive (if empty, will prompt for selection)

`torc workflows submit`

Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodes Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodes

Usage: torc workflows submit [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to submit (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow to submit (optional - will prompt if not provided)

Options:

--force — If false, fail the operation if missing data is present (defaults to false)
--force — If false, fail the operation if missing data is present (defaults to false)

Default value: false

`torc workflows run`

Run a workflow locally on the current node

Usage: torc workflows run [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to run (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow to run (optional - will prompt if not provided)

Options:

-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds for checking job completion
-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds for checking job completion

Default value: 5.0
--max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs to run (defaults to available CPUs)
--output-dir <OUTPUT_DIR> — Output directory for job logs and results
--max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs to run (defaults to available CPUs)
--output-dir <OUTPUT_DIR> — Output directory for job logs and results

Default value: output

`torc workflows initialize`

Initialize a workflow, including all job statuses

Usage: torc workflows initialize [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to start (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow to start (optional - will prompt if not provided)

Options:

--force — If false, fail the operation if missing data is present (defaults to false)
--force — If false, fail the operation if missing data is present (defaults to false)

Default value: false
--no-prompts — Skip confirmation prompt
--dry-run — Perform a dry run without making changes
--no-prompts — Skip confirmation prompt
--dry-run — Perform a dry run without making changes

`torc workflows reinitialize`

Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changed Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changed

Usage: torc workflows reinitialize [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to reinitialize (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow to reinitialize (optional - will prompt if not provided)

Options:

--force — If false, fail the operation if missing data is present (defaults to false)
--force — If false, fail the operation if missing data is present (defaults to false)

Default value: false
--dry-run — Perform a dry run without making changes
--dry-run — Perform a dry run without making changes

`torc workflows status`

Get workflow status

Usage: torc workflows status [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to get status for (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow to get status for (optional - will prompt if not provided)

Options:

-u, --user <USER> — User to filter by (defaults to USER environment variable)
-u, --user <USER> — User to filter by (defaults to USER environment variable)

`torc workflows reset-status`

Reset workflow and job status

Usage: torc workflows reset-status [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to reset status for (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow to reset status for (optional - will prompt if not provided)

Options:

--failed-only — Only reset failed jobs
--failed-only — Only reset failed jobs

Default value: false
-r, --reinitialize — Reinitialize the workflow after resetting status
-r, --reinitialize — Reinitialize the workflow after resetting status

Default value: false
--force — Force reset even if there are active jobs (ignores running/pending jobs check)
--force — Force reset even if there are active jobs (ignores running/pending jobs check)

Default value: false
--no-prompts — Skip confirmation prompt
--no-prompts — Skip confirmation prompt

`torc workflows execution-plan`

Show the execution plan for a workflow specification or existing workflow

Usage: torc workflows execution-plan <SPEC_OR_ID>

Arguments:

<SPEC_OR_ID> — Path to specification file OR workflow ID
<SPEC_OR_ID> — Path to specification file OR workflow ID

`torc workflows list-actions`

List workflow actions and their statuses (useful for debugging action triggers)

Usage: torc workflows list-actions [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to show actions for (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow to show actions for (optional - will prompt if not provided)

Options:

-u, --user <USER> — User to filter by when selecting workflow interactively (defaults to USER environment variable)
-u, --user <USER> — User to filter by when selecting workflow interactively (defaults to USER environment variable)

`torc workflows is-complete`

Check if a workflow is complete

Usage: torc workflows is-complete [ID]

Arguments:

<ID> — ID of the workflow to check (optional - will prompt if not provided)
<ID> — ID of the workflow to check (optional - will prompt if not provided)

`torc workflows export`

Export a workflow to a portable JSON file

Creates a self-contained export that can be imported into the same or different torc-server instance. All entity IDs are preserved in the export and remapped during import.

Usage: torc workflows export [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to export (optional - will prompt if not provided)

Options:

-o, --output <OUTPUT> — Output file path (default: stdout)
--include-results — Include job results in export
--include-events — Include events (workflow history) in export

Examples:

# Export workflow to stdout
torc workflows export 123

# Export to a file
torc workflows export 123 -o workflow.json

# Include job results in export
torc workflows export 123 --include-results -o backup.json

# Export with all optional data
torc workflows export 123 --include-results --include-events -o complete.json

`torc workflows import`

Import a workflow from an exported JSON file

Imports a workflow that was previously exported. All entity IDs are remapped to new IDs assigned by the server. By default, all job statuses are reset to uninitialized for a fresh start.

Usage: torc workflows import [OPTIONS] <FILE>

Arguments:

<FILE> — Path to the exported workflow JSON file (use '-' for stdin)

Options:

--name <NAME> — Override the workflow name
--skip-results — Skip importing results even if present in export
--skip-events — Skip importing events even if present in export

Examples:

# Import a workflow (resets job statuses by default)
torc workflows import workflow.json

# Import from stdin
cat workflow.json | torc workflows import -

# Import with a different name
torc workflows import workflow.json --name 'my-copy'

# Skip importing results even if present in file
torc workflows import workflow.json --skip-results

`torc workflows sync-status`

Synchronize job statuses with Slurm (detect and fail orphaned jobs)

This command detects jobs that are stuck in "running" status because their Slurm allocation terminated unexpectedly (e.g., due to timeout, node failure, or admin intervention). It marks these orphaned jobs as failed so the workflow can be recovered or restarted.

Use this when:

torc recover reports "there are active Slurm allocations" but squeue shows none
Jobs appear stuck in "running" status after a Slurm allocation ended
You want to clean up workflow state before running torc recover

Usage: torc workflows sync-status [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow to sync (optional - will prompt if not provided)

Options:

--dry-run — Preview changes without applying them

Examples:

# Preview what would be cleaned up
torc workflows sync-status 123 --dry-run

# Clean up orphaned jobs
torc workflows sync-status 123

# Get JSON output for scripting
torc -f json workflows sync-status 123

`torc compute-nodes`

Compute node management commands

Usage: torc compute-nodes <COMMAND>

Subcommands:

get — Get a specific compute node by ID
list — List compute nodes for a workflow
get — Get a specific compute node by ID
list — List compute nodes for a workflow

`torc compute-nodes get`

Get a specific compute node by ID

Usage: torc compute-nodes get <ID>

Arguments:

<ID> — ID of the compute node
<ID> — ID of the compute node

`torc compute-nodes list`

List compute nodes for a workflow

Usage: torc compute-nodes list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — List compute nodes for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — List compute nodes for this workflow (optional - will prompt if not provided)

Options:

-l, --limit <LIMIT> — Maximum number of compute nodes to return
-l, --limit <LIMIT> — Maximum number of compute nodes to return

Default value: 10000
-o, --offset <OFFSET> — Offset for pagination (0-based)
-o, --offset <OFFSET> — Offset for pagination (0-based)

Default value: 0
-s, --sort-by <SORT_BY> — Field to sort by
-r, --reverse-sort — Reverse sort order
-s, --sort-by <SORT_BY> — Field to sort by
-r, --reverse-sort — Reverse sort order

Default value: false
--scheduled-compute-node <SCHEDULED_COMPUTE_NODE> — Filter by scheduled compute node ID
--scheduled-compute-node <SCHEDULED_COMPUTE_NODE> — Filter by scheduled compute node ID

`torc files`

File management commands

Usage: torc files <COMMAND>

Subcommands:

create — Create a new file
list — List files
get — Get a specific file by ID
update — Update an existing file
delete — Delete a file
list-required-existing — List required existing files for a workflow
create — Create a new file
list — List files
get — Get a specific file by ID
update — Update an existing file
delete — Delete a file
list-required-existing — List required existing files for a workflow

`torc files create`

Create a new file

Usage: torc files create --name <NAME> --path <PATH> [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Create the file in this workflow
<WORKFLOW_ID> — Create the file in this workflow

Options:

-n, --name <NAME> — Name of the job
-p, --path <PATH> — Path of the file
-n, --name <NAME> — Name of the job
-p, --path <PATH> — Path of the file

`torc files list`

List files

Usage: torc files list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — List files for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — List files for this workflow (optional - will prompt if not provided)

Options:

--produced-by-job-id <PRODUCED_BY_JOB_ID> — Filter by job ID that produced the files
-l, --limit <LIMIT> — Maximum number of files to return
--produced-by-job-id <PRODUCED_BY_JOB_ID> — Filter by job ID that produced the files
-l, --limit <LIMIT> — Maximum number of files to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order

`torc files get`

Get a specific file by ID

Usage: torc files get <ID>

Arguments:

<ID> — ID of the file to get
<ID> — ID of the file to get

`torc files update`

Update an existing file

Usage: torc files update [OPTIONS] <ID>

Arguments:

<ID> — ID of the file to update
<ID> — ID of the file to update

Options:

-n, --name <NAME> — Name of the file
-p, --path <PATH> — Path of the file
-n, --name <NAME> — Name of the file
-p, --path <PATH> — Path of the file

`torc files delete`

Delete a file

Usage: torc files delete <ID>

Arguments:

<ID> — ID of the file to remove
<ID> — ID of the file to remove

`torc files list-required-existing`

List required existing files for a workflow

Usage: torc files list-required-existing [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — List required existing files for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — List required existing files for this workflow (optional - will prompt if not provided)

`torc jobs`

Job management commands

Usage: torc jobs <COMMAND>

Subcommands:

create — Create a new job
create-from-file — Create multiple jobs from a text file containing one command per line
list — List jobs
get — Get a specific job by ID
update — Update an existing job
delete — Delete one or more jobs
delete-all — Delete all jobs for a workflow
list-resource-requirements — List jobs with their resource requirements
create — Create a new job
create-from-file — Create multiple jobs from a text file containing one command per line
list — List jobs
get — Get a specific job by ID
update — Update an existing job
delete — Delete one or more jobs
delete-all — Delete all jobs for a workflow
list-resource-requirements — List jobs with their resource requirements

`torc jobs create`

Create a new job

Usage: torc jobs create [OPTIONS] --name <NAME> --command <COMMAND> [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Create the job in this workflow
<WORKFLOW_ID> — Create the job in this workflow

Options:

-n, --name <NAME> — Name of the job
-c, --command <COMMAND> — Command to execute
-r, --resource-requirements-id <RESOURCE_REQUIREMENTS_ID> — Resource requirements ID for this job
-b, --blocking-job-ids <BLOCKING_JOB_IDS> — Job IDs that block this job
-i, --input-file-ids <INPUT_FILE_IDS> — Input files needed by this job
-o, --output-file-ids <OUTPUT_FILE_IDS> — Output files produced by this job
-n, --name <NAME> — Name of the job
-c, --command <COMMAND> — Command to execute
-r, --resource-requirements-id <RESOURCE_REQUIREMENTS_ID> — Resource requirements ID for this job
-b, --blocking-job-ids <BLOCKING_JOB_IDS> — Job IDs that block this job
-i, --input-file-ids <INPUT_FILE_IDS> — Input files needed by this job
-o, --output-file-ids <OUTPUT_FILE_IDS> — Output files produced by this job

`torc jobs create-from-file`

Create multiple jobs from a text file containing one command per line

This command reads a text file where each line contains a job command. Lines starting with '#' are treated as comments and ignored. Empty lines are also ignored. This command reads a text file where each line contains a job command. Lines starting with '#' are treated as comments and ignored. Empty lines are also ignored.

Jobs will be named sequentially as job1, job2, job3, etc., starting from the current job count + 1 to avoid naming conflicts. Jobs will be named sequentially as job1, job2, job3, etc., starting from the current job count + 1 to avoid naming conflicts.

All jobs created will share the same resource requirements, which are automatically created and assigned. All jobs created will share the same resource requirements, which are automatically created and assigned.

Example: torc jobs create-from-file 123 batch_jobs.txt --cpus-per-job 4 --memory-per-job 8g

Usage: torc jobs create-from-file [OPTIONS] <WORKFLOW_ID> <FILE>

Arguments:

<WORKFLOW_ID> — Workflow ID to create jobs for
<FILE> — Path to text file containing job commands (one per line)
<WORKFLOW_ID> — Workflow ID to create jobs for
<FILE> — Path to text file containing job commands (one per line)

File format: - One command per line - Lines starting with # are comments (ignored) - Empty lines are ignored File format: - One command per line - Lines starting with # are comments (ignored) - Empty lines are ignored

Example file content: # Data processing jobs python process.py --batch 1 python process.py --batch 2 python process.py --batch 3 Example file content: # Data processing jobs python process.py --batch 1 python process.py --batch 2 python process.py --batch 3

Options:

--cpus-per-job <CPUS_PER_JOB> — Number of CPUs per job
--cpus-per-job <CPUS_PER_JOB> — Number of CPUs per job

Default value: 1
--memory-per-job <MEMORY_PER_JOB> — Memory per job (e.g., "1m", "2g", "16g")
--memory-per-job <MEMORY_PER_JOB> — Memory per job (e.g., "1m", "2g", "16g")

Default value: 1m
--runtime-per-job <RUNTIME_PER_JOB> — Runtime per job (ISO 8601 duration format)
--runtime-per-job <RUNTIME_PER_JOB> — Runtime per job (ISO 8601 duration format)

Examples: P0DT1M = 1 minute P0DT30M = 30 minutes P0DT2H = 2 hours P1DT0H = 1 day Examples: P0DT1M = 1 minute P0DT30M = 30 minutes P0DT2H = 2 hours P1DT0H = 1 day

Default value: P0DT1M

`torc jobs list`

List jobs

Usage: torc jobs list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — List jobs for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — List jobs for this workflow (optional - will prompt if not provided)

Options:

-s, --status <STATUS> — User to filter by (defaults to USER environment variable)
--upstream-job-id <UPSTREAM_JOB_ID> — Filter by upstream job ID (jobs that depend on this job)
-l, --limit <LIMIT> — Maximum number of jobs to return
-s, --status <STATUS> — User to filter by (defaults to USER environment variable)
--upstream-job-id <UPSTREAM_JOB_ID> — Filter by upstream job ID (jobs that depend on this job)
-l, --limit <LIMIT> — Maximum number of jobs to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--include-relationships — Include job relationships (depends_on_job_ids, input/output file/user_data IDs) - slower but more complete
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--include-relationships — Include job relationships (depends_on_job_ids, input/output file/user_data IDs) - slower but more complete

`torc jobs get`

Get a specific job by ID

Usage: torc jobs get <ID>

Arguments:

<ID> — ID of the job to get
<ID> — ID of the job to get

`torc jobs update`

Update an existing job

Usage: torc jobs update [OPTIONS] <ID>

Arguments:

<ID> — ID of the job to update
<ID> — ID of the job to update

Options:

-n, --name <NAME> — Name of the job
-c, --command <COMMAND> — Command to execute
-n, --name <NAME> — Name of the job
-c, --command <COMMAND> — Command to execute

`torc jobs delete`

Delete one or more jobs

Usage: torc jobs delete [IDS]...

Arguments:

<IDS> — IDs of the jobs to remove
<IDS> — IDs of the jobs to remove

`torc jobs delete-all`

Delete all jobs for a workflow

Usage: torc jobs delete-all [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID to delete all jobs from (optional - will prompt if not provided)
<WORKFLOW_ID> — Workflow ID to delete all jobs from (optional - will prompt if not provided)

`torc jobs list-resource-requirements`

List jobs with their resource requirements

Usage: torc jobs list-resource-requirements [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID to list jobs from (optional - will prompt if not provided)
<WORKFLOW_ID> — Workflow ID to list jobs from (optional - will prompt if not provided)

Options:

-j, --job-id <JOB_ID> — Filter by specific job ID
-j, --job-id <JOB_ID> — Filter by specific job ID

`torc job-dependencies`

Job dependency and relationship queries

Usage: torc job-dependencies <COMMAND>

Subcommands:

job-job — List job-to-job dependencies for a workflow
job-file — List job-file relationships for a workflow
job-user-data — List job-user_data relationships for a workflow
job-job — List job-to-job dependencies for a workflow
job-file — List job-file relationships for a workflow
job-user-data — List job-user_data relationships for a workflow

`torc job-dependencies job-job`

List job-to-job dependencies for a workflow

Usage: torc job-dependencies job-job [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)

Options:

-l, --limit <LIMIT> — Maximum number of dependencies to return
-l, --limit <LIMIT> — Maximum number of dependencies to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0

`torc job-dependencies job-file`

List job-file relationships for a workflow

Usage: torc job-dependencies job-file [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)

Options:

-l, --limit <LIMIT> — Maximum number of relationships to return
-l, --limit <LIMIT> — Maximum number of relationships to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0

`torc job-dependencies job-user-data`

List job-user_data relationships for a workflow

Usage: torc job-dependencies job-user-data [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)

Options:

-l, --limit <LIMIT> — Maximum number of relationships to return
-l, --limit <LIMIT> — Maximum number of relationships to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0

`torc resource-requirements`

Resource requirements management commands

Usage: torc resource-requirements <COMMAND>

Subcommands:

create — Create new resource requirements
list — List resource requirements
get — Get a specific resource requirement by ID
update — Update existing resource requirements
delete — Delete resource requirements
create — Create new resource requirements
list — List resource requirements
get — Get a specific resource requirement by ID
update — Update existing resource requirements
delete — Delete resource requirements

`torc resource-requirements create`

Create new resource requirements

Usage: torc resource-requirements create [OPTIONS] --name <NAME> [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Create resource requirements in this workflow
<WORKFLOW_ID> — Create resource requirements in this workflow

Options:

-n, --name <NAME> — Name of the resource requirements
--num-cpus <NUM_CPUS> — Number of CPUs required
-n, --name <NAME> — Name of the resource requirements
--num-cpus <NUM_CPUS> — Number of CPUs required

Default value: 1
--num-gpus <NUM_GPUS> — Number of GPUs required
--num-gpus <NUM_GPUS> — Number of GPUs required

Default value: 0
--num-nodes <NUM_NODES> — Number of nodes required
--num-nodes <NUM_NODES> — Number of nodes required

Default value: 1
-m, --memory <MEMORY> — Amount of memory required (e.g., "20g")
-m, --memory <MEMORY> — Amount of memory required (e.g., "20g")

Default value: 1m
-r, --runtime <RUNTIME> — Maximum runtime in ISO 8601 duration format (e.g., "P0DT1H")
-r, --runtime <RUNTIME> — Maximum runtime in ISO 8601 duration format (e.g., "P0DT1H")

Default value: P0DT1M

`torc resource-requirements list`

List resource requirements

Usage: torc resource-requirements list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — List resource requirements for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — List resource requirements for this workflow (optional - will prompt if not provided)

Options:

-l, --limit <LIMIT> — Maximum number of resource requirements to return
-l, --limit <LIMIT> — Maximum number of resource requirements to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order

`torc resource-requirements get`

Get a specific resource requirement by ID

Usage: torc resource-requirements get <ID>

Arguments:

<ID> — ID of the resource requirement to get
<ID> — ID of the resource requirement to get

`torc resource-requirements update`

Update existing resource requirements

Usage: torc resource-requirements update [OPTIONS] <ID>

Arguments:

<ID> — ID of the resource requirement to update
<ID> — ID of the resource requirement to update

Options:

-n, --name <NAME> — Name of the resource requirements
--num-cpus <NUM_CPUS> — Number of CPUs required
--num-gpus <NUM_GPUS> — Number of GPUs required
--num-nodes <NUM_NODES> — Number of nodes required
--memory <MEMORY> — Amount of memory required (e.g., "20g")
--runtime <RUNTIME> — Maximum runtime (e.g., "1h", "30m")
-n, --name <NAME> — Name of the resource requirements
--num-cpus <NUM_CPUS> — Number of CPUs required
--num-gpus <NUM_GPUS> — Number of GPUs required
--num-nodes <NUM_NODES> — Number of nodes required
--memory <MEMORY> — Amount of memory required (e.g., "20g")
--runtime <RUNTIME> — Maximum runtime (e.g., "1h", "30m")

`torc resource-requirements delete`

Delete resource requirements

Usage: torc resource-requirements delete <ID>

Arguments:

<ID> — ID of the resource requirement to remove
<ID> — ID of the resource requirement to remove

`torc events`

Event management commands

Usage: torc events <COMMAND>

Subcommands:

create — Create a new event
list — List events for a workflow
monitor — Monitor events for a workflow in real-time
get-latest-event — Get the latest event for a workflow
delete — Delete an event
create — Create a new event
list — List events for a workflow
monitor — Monitor events for a workflow in real-time
get-latest-event — Get the latest event for a workflow
delete — Delete an event

`torc events create`

Create a new event

Usage: torc events create --data <DATA> [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Create the event in this workflow
<WORKFLOW_ID> — Create the event in this workflow

Options:

-d, --data <DATA> — JSON data for the event
-d, --data <DATA> — JSON data for the event

`torc events list`

List events for a workflow

Usage: torc events list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — List events for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — List events for this workflow (optional - will prompt if not provided)

Options:

-c, --category <CATEGORY> — Filter events by category
-l, --limit <LIMIT> — Maximum number of events to return
-c, --category <CATEGORY> — Filter events by category
-l, --limit <LIMIT> — Maximum number of events to return

Default value: 10000
-o, --offset <OFFSET> — Offset for pagination (0-based)
-o, --offset <OFFSET> — Offset for pagination (0-based)

Default value: 0
-s, --sort-by <SORT_BY> — Field to sort by
-r, --reverse-sort — Reverse sort order
-s, --sort-by <SORT_BY> — Field to sort by
-r, --reverse-sort — Reverse sort order

Default value: false

`torc events monitor`

Monitor events for a workflow in real-time

Usage: torc events monitor [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Monitor events for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — Monitor events for this workflow (optional - will prompt if not provided)

Options:

-d, --duration <DURATION> — Duration to monitor in minutes (default: infinite)
-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds (default: 60)
-d, --duration <DURATION> — Duration to monitor in minutes (default: infinite)
-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds (default: 60)

Default value: 60
-c, --category <CATEGORY> — Filter events by category
-c, --category <CATEGORY> — Filter events by category

`torc events get-latest-event`

Get the latest event for a workflow

Usage: torc events get-latest-event [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Get the latest event for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — Get the latest event for this workflow (optional - will prompt if not provided)

`torc events delete`

Delete an event

Usage: torc events delete <ID>

Arguments:

<ID> — ID of the event to remove
<ID> — ID of the event to remove

`torc results`

Result management commands

Usage: torc results <COMMAND>

Subcommands:

list — List results
get — Get a specific result by ID
delete — Delete a result
list — List results
get — Get a specific result by ID
delete — Delete a result

`torc results list`

List results

Usage: torc results list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — List results for this workflow (optional - will prompt if not provided). By default, only lists results for the latest run of the workflow
<WORKFLOW_ID> — List results for this workflow (optional - will prompt if not provided). By default, only lists results for the latest run of the workflow

Options:

-j, --job-id <JOB_ID> — List results for this job
-r, --run-id <RUN_ID> — List results for this run_id
--return-code <RETURN_CODE> — Filter by return code
--failed — Show only failed jobs (non-zero return code)
-s, --status <STATUS> — Filter by job status (uninitialized, blocked, canceled, terminated, done, ready, scheduled, running, pending, disabled)
-l, --limit <LIMIT> — Maximum number of results to return
-j, --job-id <JOB_ID> — List results for this job
-r, --run-id <RUN_ID> — List results for this run_id
--return-code <RETURN_CODE> — Filter by return code
--failed — Show only failed jobs (non-zero return code)
-s, --status <STATUS> — Filter by job status (uninitialized, blocked, canceled, terminated, done, ready, scheduled, running, pending, disabled)
-l, --limit <LIMIT> — Maximum number of results to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--all-runs — Show all historical results (default: false, only shows current results)
--compute-node <COMPUTE_NODE> — Filter by compute node ID
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--all-runs — Show all historical results (default: false, only shows current results)
--compute-node <COMPUTE_NODE> — Filter by compute node ID

`torc results get`

Get a specific result by ID

Usage: torc results get <ID>

Arguments:

<ID> — ID of the result to get
<ID> — ID of the result to get

`torc results delete`

Delete a result

Usage: torc results delete <ID>

Arguments:

<ID> — ID of the result to remove
<ID> — ID of the result to remove

`torc user-data`

User data management commands

Usage: torc user-data <COMMAND>

Subcommands:

create — Create a new user data record
list — List user data records
get — Get a specific user data record
update — Update a user data record
delete — Delete a user data record
delete-all — Delete all user data records for a workflow
list-missing — List missing user data for a workflow
create — Create a new user data record
list — List user data records
get — Get a specific user data record
update — Update a user data record
delete — Delete a user data record
delete-all — Delete all user data records for a workflow
list-missing — List missing user data for a workflow

`torc user-data create`

Create a new user data record

Usage: torc user-data create [OPTIONS] --name <NAME> [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

Options:

-n, --name <NAME> — Name of the data object
-d, --data <DATA> — JSON data content
--ephemeral — Whether the data is ephemeral (cleared between runs)
--consumer-job-id <CONSUMER_JOB_ID> — Consumer job ID (optional)
--producer-job-id <PRODUCER_JOB_ID> — Producer job ID (optional)
-n, --name <NAME> — Name of the data object
-d, --data <DATA> — JSON data content
--ephemeral — Whether the data is ephemeral (cleared between runs)
--consumer-job-id <CONSUMER_JOB_ID> — Consumer job ID (optional)
--producer-job-id <PRODUCER_JOB_ID> — Producer job ID (optional)

`torc user-data list`

List user data records

Usage: torc user-data list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID (if not provided, will be selected interactively)
<WORKFLOW_ID> — Workflow ID (if not provided, will be selected interactively)

Options:

-l, --limit <LIMIT> — Maximum number of records to return
-l, --limit <LIMIT> — Maximum number of records to return

Default value: 50
-o, --offset <OFFSET> — Number of records to skip
-o, --offset <OFFSET> — Number of records to skip

Default value: 0
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--name <NAME> — Filter by name
--is-ephemeral <IS_EPHEMERAL> — Filter by ephemeral status
--sort-by <SORT_BY> — Field to sort by
--reverse-sort — Reverse sort order
--name <NAME> — Filter by name
--is-ephemeral <IS_EPHEMERAL> — Filter by ephemeral status

Possible values: true, false
--consumer-job-id <CONSUMER_JOB_ID> — Filter by consumer job ID
--producer-job-id <PRODUCER_JOB_ID> — Filter by producer job ID
--consumer-job-id <CONSUMER_JOB_ID> — Filter by consumer job ID
--producer-job-id <PRODUCER_JOB_ID> — Filter by producer job ID

`torc user-data get`

Get a specific user data record

Usage: torc user-data get <ID>

Arguments:

<ID> — User data record ID
<ID> — User data record ID

`torc user-data update`

Update a user data record

Usage: torc user-data update [OPTIONS] <ID>

Arguments:

<ID> — User data record ID
<ID> — User data record ID

Options:

-n, --name <NAME> — New name for the data object
-d, --data <DATA> — New JSON data content
--ephemeral <EPHEMERAL> — Update ephemeral status
-n, --name <NAME> — New name for the data object
-d, --data <DATA> — New JSON data content
--ephemeral <EPHEMERAL> — Update ephemeral status

Possible values: true, false

`torc user-data delete`

Delete a user data record

Usage: torc user-data delete <ID>

Arguments:

<ID> — User data record ID
<ID> — User data record ID

`torc user-data delete-all`

Delete all user data records for a workflow

Usage: torc user-data delete-all <WORKFLOW_ID>

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

`torc user-data list-missing`

List missing user data for a workflow

Usage: torc user-data list-missing <WORKFLOW_ID>

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

`torc slurm`

Slurm scheduler commands

Usage: torc slurm <COMMAND>

Subcommands:

create — Add a Slurm config to the database
update — Modify a Slurm config in the database
list — Show the current Slurm configs in the database
get — Get a specific Slurm config by ID
delete — Delete a Slurm config by ID
schedule-nodes — Schedule compute nodes using Slurm
parse-logs — Parse Slurm log files for known error messages
sacct — Call sacct for scheduled compute nodes and display summary
generate — Generate Slurm schedulers for a workflow based on job resource requirements
regenerate — Regenerate Slurm schedulers for an existing workflow based on pending jobs
create — Add a Slurm config to the database
update — Modify a Slurm config in the database
list — Show the current Slurm configs in the database
get — Get a specific Slurm config by ID
delete — Delete a Slurm config by ID
schedule-nodes — Schedule compute nodes using Slurm
parse-logs — Parse Slurm log files for known error messages
sacct — Call sacct for scheduled compute nodes and display summary
generate — Generate Slurm schedulers for a workflow based on job resource requirements
regenerate — Regenerate Slurm schedulers for an existing workflow based on pending jobs

`torc slurm create`

Add a Slurm config to the database

Usage: torc slurm create [OPTIONS] --name <NAME> --account <ACCOUNT> [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

Options:

-n, --name <NAME> — Name of config
-a, --account <ACCOUNT> — HPC account
-g, --gres <GRES> — Request nodes that have at least this number of GPUs. Ex: 'gpu:2'
-m, --mem <MEM> — Request nodes that have at least this amount of memory. Ex: '180G'
-N, --nodes <NODES> — Number of nodes to use for each job
-n, --name <NAME> — Name of config
-a, --account <ACCOUNT> — HPC account
-g, --gres <GRES> — Request nodes that have at least this number of GPUs. Ex: 'gpu:2'
-m, --mem <MEM> — Request nodes that have at least this amount of memory. Ex: '180G'
-N, --nodes <NODES> — Number of nodes to use for each job

Default value: 1
-p, --partition <PARTITION> — HPC partition. Default is determined by the scheduler
-q, --qos <QOS> — Controls priority of the jobs
-p, --partition <PARTITION> — HPC partition. Default is determined by the scheduler
-q, --qos <QOS> — Controls priority of the jobs

Default value: normal
-t, --tmp <TMP> — Request nodes that have at least this amount of storage scratch space
-W, --walltime <WALLTIME> — Slurm job walltime
-t, --tmp <TMP> — Request nodes that have at least this amount of storage scratch space
-W, --walltime <WALLTIME> — Slurm job walltime

Default value: 04:00:00
-e, --extra <EXTRA> — Add extra Slurm parameters, for example --extra='--reservation=my-reservation'
-e, --extra <EXTRA> — Add extra Slurm parameters, for example --extra='--reservation=my-reservation'

`torc slurm update`

Modify a Slurm config in the database

Usage: torc slurm update [OPTIONS] <SCHEDULER_ID>

Arguments:

<SCHEDULER_ID>
<SCHEDULER_ID>

Options:

-N, --name <NAME> — Name of config
-a, --account <ACCOUNT> — HPC account
-g, --gres <GRES> — Request nodes that have at least this number of GPUs. Ex: 'gpu:2'
-m, --mem <MEM> — Request nodes that have at least this amount of memory. Ex: '180G'
-n, --nodes <NODES> — Number of nodes to use for each job
-p, --partition <PARTITION> — HPC partition
-q, --qos <QOS> — Controls priority of the jobs
-t, --tmp <TMP> — Request nodes that have at least this amount of storage scratch space
--walltime <WALLTIME> — Slurm job walltime
-e, --extra <EXTRA> — Add extra Slurm parameters
-N, --name <NAME> — Name of config
-a, --account <ACCOUNT> — HPC account
-g, --gres <GRES> — Request nodes that have at least this number of GPUs. Ex: 'gpu:2'
-m, --mem <MEM> — Request nodes that have at least this amount of memory. Ex: '180G'
-n, --nodes <NODES> — Number of nodes to use for each job
-p, --partition <PARTITION> — HPC partition
-q, --qos <QOS> — Controls priority of the jobs
-t, --tmp <TMP> — Request nodes that have at least this amount of storage scratch space
--walltime <WALLTIME> — Slurm job walltime
-e, --extra <EXTRA> — Add extra Slurm parameters

`torc slurm list`

Show the current Slurm configs in the database

Usage: torc slurm list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

Options:

-l, --limit <LIMIT> — Maximum number of configs to return
-l, --limit <LIMIT> — Maximum number of configs to return

Default value: 10000
--offset <OFFSET> — Offset for pagination (0-based)
--offset <OFFSET> — Offset for pagination (0-based)

Default value: 0

`torc slurm get`

Get a specific Slurm config by ID

Usage: torc slurm get <ID>

Arguments:

<ID> — ID of the Slurm config to get
<ID> — ID of the Slurm config to get

`torc slurm delete`

Delete a Slurm config by ID

Usage: torc slurm delete <ID>

Arguments:

<ID> — ID of the Slurm config to delete
<ID> — ID of the Slurm config to delete

`torc slurm schedule-nodes`

Schedule compute nodes using Slurm

Usage: torc slurm schedule-nodes [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

Options:

-j, --job-prefix <JOB_PREFIX> — Job prefix for the Slurm job names
-j, --job-prefix <JOB_PREFIX> — Job prefix for the Slurm job names

Default value: worker
--keep-submission-scripts — Keep submission scripts after job submission
--keep-submission-scripts — Keep submission scripts after job submission

Default value: false
-m, --max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs
-n, --num-hpc-jobs <NUM_HPC_JOBS> — Number of HPC jobs to submit
-m, --max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs
-n, --num-hpc-jobs <NUM_HPC_JOBS> — Number of HPC jobs to submit

Default value: 1
-o, --output <OUTPUT> — Output directory for job output files
-o, --output <OUTPUT> — Output directory for job output files

Default value: output
-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds
-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds

Default value: 60
--scheduler-config-id <SCHEDULER_CONFIG_ID> — Scheduler config ID
--start-one-worker-per-node — Start one worker per node
--scheduler-config-id <SCHEDULER_CONFIG_ID> — Scheduler config ID
--start-one-worker-per-node — Start one worker per node

Default value: false

`torc slurm parse-logs`

Parse Slurm log files for known error messages

Usage: torc slurm parse-logs [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

Options:

-o, --output-dir <OUTPUT_DIR> — Output directory containing Slurm log files
-o, --output-dir <OUTPUT_DIR> — Output directory containing Slurm log files

Default value: output
--errors-only — Only show errors (skip warnings)
--errors-only — Only show errors (skip warnings)

Default value: false

`torc slurm sacct`

Call sacct for scheduled compute nodes and display summary

Usage: torc slurm sacct [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

Options:

-o, --output-dir <OUTPUT_DIR> — Output directory for sacct JSON files (only used with --save-json)
-o, --output-dir <OUTPUT_DIR> — Output directory for sacct JSON files (only used with --save-json)

Default value: output
--save-json — Save full JSON output to files in addition to displaying summary
--save-json — Save full JSON output to files in addition to displaying summary

Default value: false

`torc slurm generate`

Generate Slurm schedulers for a workflow based on job resource requirements

Usage: torc slurm generate [OPTIONS] --account <ACCOUNT> <WORKFLOW_FILE>

Arguments:

<WORKFLOW_FILE> — Path to workflow specification file (YAML, JSON, JSON5, or KDL)
<WORKFLOW_FILE> — Path to workflow specification file (YAML, JSON, JSON5, or KDL)

Options:

--account <ACCOUNT> — Slurm account to use
--profile <PROFILE> — HPC profile to use (if not specified, tries to detect current system)
-o, --output <OUTPUT> — Output file path (if not specified, prints to stdout)
--single-allocation — Bundle all nodes into a single Slurm allocation per scheduler
--account <ACCOUNT> — Slurm account to use
--profile <PROFILE> — HPC profile to use (if not specified, tries to detect current system)
-o, --output <OUTPUT> — Output file path (if not specified, prints to stdout)
--single-allocation — Bundle all nodes into a single Slurm allocation per scheduler

By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance. By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance.

With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch.
--group-by <GROUP_BY> — Strategy for grouping jobs into schedulers

Possible values: resource-requirements (default), partition
--walltime-strategy <STRATEGY> — Strategy for determining Slurm job walltime

Possible values: max-job-runtime (default), max-partition-time
- max-job-runtime: Uses the maximum job runtime multiplied by --walltime-multiplier. This typically results in shorter walltime requests, improving queue priority.
- max-partition-time: Uses the partition's maximum allowed walltime. More conservative but may negatively impact queue scheduling.
--walltime-multiplier <MULTIPLIER> — Multiplier for job runtime when using --walltime-strategy=max-job-runtime

Default value: 1.5

The maximum job runtime is multiplied by this value to provide a safety margin. For example, 1.5 means requesting 50% more time than the longest job estimate.
--no-actions — Don't add workflow actions for scheduling nodes
--overwrite — Overwrite existing schedulers in the workflow
--dry-run — Show what would be generated without writing to output

`torc slurm regenerate`

Regenerate Slurm schedulers for an existing workflow based on pending jobs

Analyzes jobs that are uninitialized, ready, or blocked and generates new Slurm schedulers to run them. Uses existing scheduler configurations as defaults for account, partition, and other settings. Analyzes jobs that are uninitialized, ready, or blocked and generates new Slurm schedulers to run them. Uses existing scheduler configurations as defaults for account, partition, and other settings.

This is useful for recovery after job failures: update job resources, reset failed jobs, then regenerate schedulers to submit new allocations. This is useful for recovery after job failures: update job resources, reset failed jobs, then regenerate schedulers to submit new allocations.

Usage: torc slurm regenerate [OPTIONS] <WORKFLOW_ID>

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKFLOW_ID> — Workflow ID

Options:

--account <ACCOUNT> — Slurm account to use (defaults to account from existing schedulers)
--profile <PROFILE> — HPC profile to use (if not specified, tries to detect current system)
--single-allocation — Bundle all nodes into a single Slurm allocation per scheduler
--submit — Submit the generated allocations immediately
-o, --output-dir <OUTPUT_DIR> — Output directory for job output files (used when submitting)
--account <ACCOUNT> — Slurm account to use (defaults to account from existing schedulers)
--profile <PROFILE> — HPC profile to use (if not specified, tries to detect current system)
--single-allocation — Bundle all nodes into a single Slurm allocation per scheduler
--submit — Submit the generated allocations immediately
-o, --output-dir <OUTPUT_DIR> — Output directory for job output files (used when submitting)

Default value: output
-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds (used when submitting)

Default value: 60
--group-by <GROUP_BY> — Strategy for grouping jobs into schedulers

Possible values: resource-requirements (default), partition
--walltime-strategy <STRATEGY> — Strategy for determining Slurm job walltime

Possible values: max-job-runtime (default), max-partition-time
--walltime-multiplier <MULTIPLIER> — Multiplier for job runtime when using --walltime-strategy=max-job-runtime

Default value: 1.5
--dry-run — Show what would be created without making changes
--include-job-ids <JOB_IDS> — Include specific job IDs in planning regardless of their status (useful for recovery dry-run to include failed jobs)

`torc remote`

Remote worker execution commands (SSH-based distributed execution)

Usage: torc remote <COMMAND>

Subcommands:

add-workers — Add one or more remote workers to a workflow
add-workers-from-file — Add remote workers to a workflow from a file
remove-worker — Remove a remote worker from a workflow
list-workers — List remote workers stored in the database for a workflow
run — Run workers on remote machines via SSH
status — Check status of remote workers
stop — Stop workers on remote machines
collect-logs — Collect logs from remote workers
delete-logs — Delete logs from remote workers

`torc remote add-workers`

Add one or more remote workers to a workflow

Workers are stored in the database and used by subsequent commands. Format: [user@]hostname[:port]

Usage: torc remote add-workers <WORKFLOW_ID> <WORKERS>...

Arguments:

<WORKFLOW_ID> — Workflow ID
<WORKERS> — Worker addresses (format: [user@]hostname[:port])

`torc remote add-workers-from-file`

Add remote workers to a workflow from a file

Each line in the file should be a worker address. Lines starting with # are comments.

Usage: torc remote add-workers-from-file <WORKER_FILE> [WORKFLOW_ID]

Arguments:

<WORKER_FILE> — Path to worker file listing remote machines
<WORKFLOW_ID> — Workflow ID (optional - will prompt if not provided)

`torc remote remove-worker`

Remove a remote worker from a workflow

Usage: torc remote remove-worker <WORKER> [WORKFLOW_ID]

Arguments:

<WORKER> — Worker address to remove
<WORKFLOW_ID> — Workflow ID (optional - will prompt if not provided)

`torc remote list-workers`

List remote workers stored in the database for a workflow

Usage: torc remote list-workers [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID (optional - will prompt if not provided)

`torc remote run`

Run workers on remote machines via SSH

SSH into each stored worker and start a torc worker process. Workers run detached (via nohup) and survive SSH disconnection. Use add-workers first, or provide --workers to add and run in one step.

Usage: torc remote run [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID to run (optional - will prompt if not provided)

Options:

-w, --workers <WORKERS> — Path to worker file (optional - adds workers before running)
-o, --output-dir <OUTPUT_DIR> — Output directory on remote machines (relative to home)

Default value: torc_output
--max-parallel-ssh <MAX_PARALLEL_SSH> — Maximum parallel SSH connections

Default value: 10
-p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds for workers

Default value: 5.0
--max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs per worker
--num-cpus <NUM_CPUS> — Number of CPUs per worker (auto-detect if not specified)
--memory-gb <MEMORY_GB> — Memory in GB per worker (auto-detect if not specified)
--num-gpus <NUM_GPUS> — Number of GPUs per worker (auto-detect if not specified)
--skip-version-check — Skip version check (not recommended)

Default value: false

`torc remote status`

Check status of remote workers

Usage: torc remote status [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID (optional - will prompt if not provided)

Options:

--output-dir <OUTPUT_DIR> — Remote output directory (must match what was used in run)

Default value: torc_output
--max-parallel-ssh <MAX_PARALLEL_SSH> — Maximum parallel SSH connections

Default value: 10

`torc remote stop`

Stop workers on remote machines

Usage: torc remote stop [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID (optional - will prompt if not provided)

Options:

--output-dir <OUTPUT_DIR> — Remote output directory (must match what was used in run)

Default value: torc_output
--max-parallel-ssh <MAX_PARALLEL_SSH> — Maximum parallel SSH connections

Default value: 10
--force — Force kill (SIGKILL instead of SIGTERM)

Default value: false

`torc remote collect-logs`

Collect logs from remote workers

Usage: torc remote collect-logs [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID (optional - will prompt if not provided)

Options:

-l, --local-output-dir <LOCAL_OUTPUT_DIR> — Local directory to save collected logs

Default value: remote_logs
--remote-output-dir <REMOTE_OUTPUT_DIR> — Remote output directory (must match what was used in run)

Default value: torc_output
--max-parallel-ssh <MAX_PARALLEL_SSH> — Maximum parallel SSH connections

Default value: 10
--delete — Delete remote logs after successful collection

Default value: false

`torc remote delete-logs`

Delete logs from remote workers

Removes the output directory from all remote workers. Use collect-logs --delete to safely collect before deleting.

Usage: torc remote delete-logs [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID (optional - will prompt if not provided)

Options:

--remote-output-dir <REMOTE_OUTPUT_DIR> — Remote output directory to delete (must match what was used in run)

Default value: torc_output
--max-parallel-ssh <MAX_PARALLEL_SSH> — Maximum parallel SSH connections

Default value: 10

`torc scheduled-compute-nodes`

Scheduled compute node management commands

Usage: torc scheduled-compute-nodes <COMMAND>

Subcommands:

get — Get a scheduled compute node by ID
list — List scheduled compute nodes for a workflow
list-jobs — List jobs that ran under a scheduled compute node
get — Get a scheduled compute node by ID
list — List scheduled compute nodes for a workflow
list-jobs — List jobs that ran under a scheduled compute node

`torc scheduled-compute-nodes get`

Get a scheduled compute node by ID

Usage: torc scheduled-compute-nodes get <ID>

Arguments:

<ID> — ID of the scheduled compute node
<ID> — ID of the scheduled compute node

`torc scheduled-compute-nodes list`

List scheduled compute nodes for a workflow

Usage: torc scheduled-compute-nodes list [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — List scheduled compute nodes for this workflow (optional - will prompt if not provided)
<WORKFLOW_ID> — List scheduled compute nodes for this workflow (optional - will prompt if not provided)

Options:

-l, --limit <LIMIT> — Maximum number of scheduled compute nodes to return
-l, --limit <LIMIT> — Maximum number of scheduled compute nodes to return

Default value: 10000
-o, --offset <OFFSET> — Offset for pagination (0-based)
-o, --offset <OFFSET> — Offset for pagination (0-based)

Default value: 0
-s, --sort-by <SORT_BY> — Field to sort by
-r, --reverse-sort — Reverse sort order
-s, --sort-by <SORT_BY> — Field to sort by
-r, --reverse-sort — Reverse sort order

Default value: false
--scheduler-id <SCHEDULER_ID> — Filter by scheduler ID
--scheduler-config-id <SCHEDULER_CONFIG_ID> — Filter by scheduler config ID
--status <STATUS> — Filter by status
--scheduler-id <SCHEDULER_ID> — Filter by scheduler ID
--scheduler-config-id <SCHEDULER_CONFIG_ID> — Filter by scheduler config ID
--status <STATUS> — Filter by status

`torc scheduled-compute-nodes list-jobs`

List jobs that ran under a scheduled compute node

Usage: torc scheduled-compute-nodes list-jobs <ID>

Arguments:

<ID> — ID of the scheduled compute node
<ID> — ID of the scheduled compute node

`torc hpc`

HPC system profiles and partition information

Usage: torc hpc <COMMAND>

Subcommands:

list — List known HPC system profiles
detect — Detect the current HPC system
show — Show details of an HPC profile
partitions — Show partitions for an HPC profile
match — Find partitions matching resource requirements
list — List known HPC system profiles
detect — Detect the current HPC system
show — Show details of an HPC profile
partitions — Show partitions for an HPC profile
match — Find partitions matching resource requirements

`torc hpc list`

List known HPC system profiles

Usage: torc hpc list

`torc hpc detect`

Detect the current HPC system

Usage: torc hpc detect

`torc hpc show`

Show details of an HPC profile

Usage: torc hpc show <NAME>

Arguments:

<NAME> — Profile name (e.g., "kestrel")
<NAME> — Profile name (e.g., "kestrel")

`torc hpc partitions`

Show partitions for an HPC profile

Usage: torc hpc partitions [OPTIONS] [NAME]

Arguments:

<NAME> — Profile name (e.g., "kestrel"). If not specified, tries to detect current system
<NAME> — Profile name (e.g., "kestrel"). If not specified, tries to detect current system

Options:

--gpu — Filter to GPU partitions only
--cpu — Filter to CPU-only partitions
--shared — Filter to shared partitions
--gpu — Filter to GPU partitions only
--cpu — Filter to CPU-only partitions
--shared — Filter to shared partitions

`torc hpc match`

Find partitions matching resource requirements

Usage: torc hpc match [OPTIONS]

Options:

--cpus <CPUS> — Number of CPUs required
--cpus <CPUS> — Number of CPUs required

Default value: 1
--memory <MEMORY> — Memory required (e.g., "100g", "512m", or MB as number)
--memory <MEMORY> — Memory required (e.g., "100g", "512m", or MB as number)

Default value: 1g
--walltime <WALLTIME> — Wall time required (e.g., "4:00:00", "2-00:00:00")
--walltime <WALLTIME> — Wall time required (e.g., "4:00:00", "2-00:00:00")

Default value: 1:00:00
--gpus <GPUS> — Number of GPUs required
--profile <PROFILE> — Profile name (if not specified, tries to detect current system)
--gpus <GPUS> — Number of GPUs required
--profile <PROFILE> — Profile name (if not specified, tries to detect current system)

`torc reports`

Generate reports and analytics

Usage: torc reports <COMMAND>

Subcommands:

check-resource-utilization — Check resource utilization and report jobs that exceeded their specified requirements
results — Generate a comprehensive JSON report of job results including all log file paths
summary — Generate a summary of workflow results (requires workflow to be complete)
check-resource-utilization — Check resource utilization and report jobs that exceeded their specified requirements
results — Generate a comprehensive JSON report of job results including all log file paths
summary — Generate a summary of workflow results (requires workflow to be complete)

`torc reports check-resource-utilization`

Check resource utilization and report jobs that exceeded their specified requirements

Usage: torc reports check-resource-utilization [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID to analyze (optional - will prompt if not provided)
<WORKFLOW_ID> — Workflow ID to analyze (optional - will prompt if not provided)

Options:

-r, --run-id <RUN_ID> — Run ID to analyze (optional - analyzes latest run if not provided)
-a, --all — Show all jobs (default: only show jobs that exceeded requirements)
--include-failed — Include failed and terminated jobs in the analysis (for recovery diagnostics)
-r, --run-id <RUN_ID> — Run ID to analyze (optional - analyzes latest run if not provided)
-a, --all — Show all jobs (default: only show jobs that exceeded requirements)
--include-failed — Include failed and terminated jobs in the analysis (for recovery diagnostics)

`torc reports results`

Generate a comprehensive JSON report of job results including all log file paths

Usage: torc reports results [OPTIONS] [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID to analyze (optional - will prompt if not provided)
<WORKFLOW_ID> — Workflow ID to analyze (optional - will prompt if not provided)

Options:

-o, --output-dir <OUTPUT_DIR> — Output directory (where job logs are stored, passed in torc run and torc submit)
-o, --output-dir <OUTPUT_DIR> — Output directory (where job logs are stored, passed in torc run and torc submit)

Default value: output
--all-runs — Include all runs for each job (default: only latest run)
--all-runs — Include all runs for each job (default: only latest run)

`torc reports summary`

Generate a summary of workflow results (requires workflow to be complete)

Usage: torc reports summary [WORKFLOW_ID]

Arguments:

<WORKFLOW_ID> — Workflow ID to summarize (optional - will prompt if not provided)
<WORKFLOW_ID> — Workflow ID to summarize (optional - will prompt if not provided)

`torc config`

Manage configuration files and settings

Usage: torc config <COMMAND>

Subcommands:

show — Show the effective configuration (merged from all sources)
paths — Show configuration file paths
init — Initialize a configuration file with defaults
validate — Validate the current configuration
show — Show the effective configuration (merged from all sources)
paths — Show configuration file paths
init — Initialize a configuration file with defaults
validate — Validate the current configuration

`torc config show`

Show the effective configuration (merged from all sources)

Usage: torc config show [OPTIONS]

Options:

-f, --format <FORMAT> — Output format (toml or json)
-f, --format <FORMAT> — Output format (toml or json)

Default value: toml

`torc config paths`

Show configuration file paths

Usage: torc config paths

`torc config init`

Initialize a configuration file with defaults

Usage: torc config init [OPTIONS]

Options:

--system — Create system-wide config (/etc/torc/config.toml)
--user — Create user config (~/.config/torc/config.toml)
--local — Create project-local config (./torc.toml)
-f, --force — Force overwrite if file exists
--system — Create system-wide config (/etc/torc/config.toml)
--user — Create user config (~/.config/torc/config.toml)
--local — Create project-local config (./torc.toml)
-f, --force — Force overwrite if file exists

`torc config validate`

Validate the current configuration

Usage: torc config validate

`torc tui`

Interactive terminal UI for managing workflows

Usage: torc tui [OPTIONS]

Options:

--standalone — Start in standalone mode: automatically start a torc-server
--port <PORT> — Port for the server in standalone mode (default: 8080)
--standalone — Start in standalone mode: automatically start a torc-server
--port <PORT> — Port for the server in standalone mode (default: 8080)

Default value: 8080
--database <DATABASE> — Database path for standalone mode
--database <DATABASE> — Database path for standalone mode

`torc plot-resources`

Generate interactive HTML plots from resource monitoring data

Usage: torc plot-resources [OPTIONS] <DB_PATHS>...

Arguments:

<DB_PATHS> — Path to the resource metrics database file(s)
<DB_PATHS> — Path to the resource metrics database file(s)

Options:

-o, --output-dir <OUTPUT_DIR> — Output directory for generated plots (default: current directory)
-o, --output-dir <OUTPUT_DIR> — Output directory for generated plots (default: current directory)

Default value: .
-j, --job-ids <JOB_IDS> — Only plot specific job IDs (comma-separated)
-p, --prefix <PREFIX> — Prefix for output filenames
-j, --job-ids <JOB_IDS> — Only plot specific job IDs (comma-separated)
-p, --prefix <PREFIX> — Prefix for output filenames

Default value: resource_plot
-f, --format <FORMAT> — Output format: html or json
-f, --format <FORMAT> — Output format: html or json

Default value: html

`torc completions`

Generate shell completions

Usage: torc completions <SHELL>

Arguments:

<SHELL> — The shell to generate completions for
<SHELL> — The shell to generate completions for

Possible values: bash, elvish, fish, powershell, zsh

This document was generated automatically by clap-markdown. This document was generated automatically by clap-markdown.

Workflow Specification Reference

This page documents all data models used in workflow specification files. Workflow specs can be written in YAML, JSON, JSON5, or KDL formats.

WorkflowSpec

The top-level container for a complete workflow definition.

Name Type Default Description

name string required Name of the workflow

user string current user User who owns this workflow

description string none Description of the workflow

parameters map<string, string> none Shared parameters that can be used by jobs and files via use_parameters

jobs [JobSpec] required Jobs that make up this workflow

files [FileSpec] none Files associated with this workflow

user_data [UserDataSpec] none User data associated with this workflow

resource_requirements [ResourceRequirementsSpec] none Resource requirements available for this workflow

failure_handlers [FailureHandlerSpec] none Failure handlers available for this workflow

slurm_schedulers [SlurmSchedulerSpec] none Slurm schedulers available for this workflow

slurm_defaults SlurmDefaultsSpec none Default Slurm parameters to apply to all schedulers

resource_monitor ResourceMonitorConfig none Resource monitoring configuration

actions [WorkflowActionSpec] none Actions to execute based on workflow/job state transitions

use_pending_failed boolean false Use PendingFailed status for failed jobs (enables AI-assisted recovery)

compute_node_expiration_buffer_seconds integer none Shut down compute nodes this many seconds before expiration

compute_node_wait_for_new_jobs_seconds integer none Compute nodes wait for new jobs this long before exiting

compute_node_ignore_workflow_completion boolean false Compute nodes hold allocations even after workflow completes

compute_node_wait_for_healthy_database_minutes integer none Compute nodes wait this many minutes for database recovery

jobs_sort_method ClaimJobsSortMethod none Method for sorting jobs when claiming them

JobSpec

Defines a single computational task within a workflow.

Name Type Default Description

name string required Name of the job

command string required Command to execute for this job

invocation_script string none Optional script for job invocation

resource_requirements string none Name of a ResourceRequirementsSpec to use

failure_handler string none Name of a FailureHandlerSpec to use

scheduler string none Name of the scheduler to use for this job

cancel_on_blocking_job_failure boolean false Cancel this job if a blocking job fails

supports_termination boolean false Whether this job supports graceful termination

depends_on [string] none Job names that must complete before this job runs (exact matches)

depends_on_regexes [string] none Regex patterns for job dependencies

input_files [string] none File names this job reads (exact matches)

input_file_regexes [string] none Regex patterns for input files

output_files [string] none File names this job produces (exact matches)

output_file_regexes [string] none Regex patterns for output files

input_user_data [string] none User data names this job reads (exact matches)

input_user_data_regexes [string] none Regex patterns for input user data

output_user_data [string] none User data names this job produces (exact matches)

output_user_data_regexes [string] none Regex patterns for output user data

parameters map<string, string> none Local parameters for generating multiple jobs

parameter_mode string "product" How to combine parameters: "product" (Cartesian) or "zip"

use_parameters [string] none Workflow parameter names to use for this job

FileSpec

Defines input/output file artifacts that establish implicit job dependencies.

Name Type Default Description

name string required Name of the file (used for referencing in jobs)

path string required File system path

parameters map<string, string> none Parameters for generating multiple files

parameter_mode string "product" How to combine parameters: "product" (Cartesian) or "zip"

use_parameters [string] none Workflow parameter names to use for this file

UserDataSpec

Arbitrary JSON data that can establish dependencies between jobs.

Name Type Default Description

name string none Name of the user data (used for referencing in jobs)

data JSON none The data content as a JSON value

is_ephemeral boolean false Whether the user data is ephemeral

ResourceRequirementsSpec

Defines compute resource requirements for jobs.

Name Type Default Description

name string required Name of this resource configuration (referenced by jobs)

num_cpus integer required Number of CPUs required

memory string required Memory requirement (e.g., "1m", "2g", "512k")

num_gpus integer 0 Number of GPUs required

num_nodes integer 1 Number of nodes required

runtime string "PT1H" Runtime limit in ISO8601 duration format (e.g., "PT30M", "PT2H")

FailureHandlerSpec

Defines error recovery strategies for jobs.

Name Type Default Description

name string required Name of the failure handler (referenced by jobs)

rules [FailureHandlerRuleSpec] required Rules for handling different exit codes

FailureHandlerRuleSpec

A single rule within a failure handler for handling specific exit codes.

Name Type Default Description

exit_codes [integer] [] Exit codes that trigger this rule

match_all_exit_codes boolean false If true, matches any non-zero exit code

recovery_script string none Optional script to run before retrying

max_retries integer 3 Maximum number of retry attempts

SlurmSchedulerSpec

Defines a Slurm HPC job scheduler configuration.

Name Type Default Description

name string none Name of the scheduler (used for referencing)

account string required Slurm account

partition string none Slurm partition name

nodes integer 1 Number of nodes to allocate

walltime string "01:00:00" Wall time limit

mem string none Memory specification

gres string none Generic resources (e.g., GPUs)

qos string none Quality of service

ntasks_per_node integer none Number of tasks per node

tmp string none Temporary storage specification

extra string none Additional Slurm parameters

SlurmDefaultsSpec

Workflow-level default parameters applied to all Slurm schedulers. This is a map of parameter names to values.

Any valid sbatch long option can be specified (without the leading --), except for parameters managed by torc: partition, nodes, walltime, time, mem, gres, name, job-name.

The account parameter is allowed as a workflow-level default.

Example:

slurm_defaults: qos: "high" constraint: "cpu" mail-user: "user@example.com" mail-type: "END,FAIL"

WorkflowActionSpec

Defines conditional actions triggered by workflow or job state changes.

Name Type Default Description

trigger_type string required When to trigger: "on_workflow_start", "on_workflow_complete", "on_jobs_ready", "on_jobs_complete"

action_type string required What to do: "run_commands", "schedule_nodes"

jobs [string] none For job triggers: exact job names to match

job_name_regexes [string] none For job triggers: regex patterns to match job names

commands [string] none For run_commands: commands to execute

scheduler string none For schedule_nodes: scheduler name

scheduler_type string none For schedule_nodes: scheduler type ("slurm", "local")

num_allocations integer none For schedule_nodes: number of node allocations

start_one_worker_per_node boolean none For schedule_nodes: start one worker per allocated node

max_parallel_jobs integer none For schedule_nodes: maximum parallel jobs

persistent boolean false Whether the action persists and can be claimed by multiple workers

ResourceMonitorConfig

Configuration for resource usage monitoring.

Name Type Default Description

enabled boolean false Enable resource monitoring

granularity MonitorGranularity "Summary" Level of detail for metrics collection

sample_interval_seconds integer 5 Sampling interval in seconds

generate_plots boolean false Generate resource usage plots

MonitorGranularity

Enum specifying the level of detail for resource monitoring.

Value Description

Summary Collect summary statistics only

TimeSeries Collect detailed time series data

ClaimJobsSortMethod

Enum specifying how jobs are sorted when being claimed by workers.

Value Description

none No sorting (default)

gpus_runtime_memory Sort by GPUs, then runtime, then memory

gpus_memory_runtime Sort by GPUs, then memory, then runtime

Parameter Formats

Parameters support several formats for generating multiple jobs or files:

Format Example Description

Integer range "1:100" Inclusive range from 1 to 100

Integer range with step "0:100:10" Range with step size

Float range "0.0:1.0:0.1" Float range with step

Integer list "[1,5,10,100]" Explicit list of integers

Float list "[0.1,0.5,0.9]" Explicit list of floats

String list "['adam','sgd','rmsprop']" Explicit list of strings

Template substitution in strings:

Basic: {param_name} - Replace with parameter value

Formatted integer: {i:03d} - Zero-padded (001, 042, 100)

Formatted float: {lr:.4f} - Precision (0.0010, 0.1000)

See the Job Parameterization reference for more details.

Job Parameterization

Parameterization allows creating multiple jobs/files from a single specification by expanding parameter ranges.

Parameter Formats

Integer Ranges

parameters: i: "1:10" # Expands to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] i: "0:100:10" # Expands to [0, 10, 20, 30, ..., 90, 100] (with step)

Float Ranges

parameters: lr: "0.0001:0.01:10" # 10 values from 0.0001 to 0.01 (log scale) alpha: "0.0:1.0:0.1" # [0.0, 0.1, 0.2, ..., 0.9, 1.0]

Lists (Integer)

parameters: batch_size: "[16,32,64,128]"

Lists (Float)

parameters: threshold: "[0.1,0.5,0.9]"

Lists (String)

parameters: optimizer: "['adam','sgd','rmsprop']" dataset: "['train','test','validation']"

Template Substitution

Use parameter values in job/file specifications with {param_name} syntax:

Basic Substitution

jobs: - name: job_{i} command: python train.py --run={i} parameters: i: "1:5"

Expands to:

jobs: - name: job_1 command: python train.py --run=1 - name: job_2 command: python train.py --run=2 # ... etc

Format Specifiers

Zero-padded integers:

jobs: - name: job_{i:03d} command: echo {i} parameters: i: "1:100"

Expands to: job_001, job_002, ..., job_100

Float precision:

jobs: - name: train_lr{lr:.4f} command: python train.py --lr={lr} parameters: lr: "[0.0001,0.001,0.01]"

Expands to: train_lr0.0001, train_lr0.0010, train_lr0.0100

Multiple decimals:

files: - name: result_{threshold:.2f} path: /results/threshold_{threshold:.2f}.csv parameters: threshold: "0.1:1.0:0.1"

Expands to: result_0.10, result_0.20, ..., result_1.00

Multi-Dimensional Parameterization

Use multiple parameters to create Cartesian products:

Example: Hyperparameter Sweep

jobs: - name: train_lr{lr:.4f}_bs{batch_size} command: | python train.py \ --learning-rate={lr} \ --batch-size={batch_size} parameters: lr: "[0.0001,0.001,0.01]" batch_size: "[16,32,64]"

This expands to 3 × 3 = 9 jobs:

train_lr0.0001_bs16

train_lr0.0001_bs32

train_lr0.0001_bs64

train_lr0.0010_bs16

... (9 total)

Example: Multi-Dataset Processing

jobs: - name: process_{dataset}_rep{rep:02d} command: python process.py --data={dataset} --replicate={rep} parameters: dataset: "['train','validation','test']" rep: "1:5"

This expands to 3 × 5 = 15 jobs

Parameterized Dependencies

Parameters work in dependency specifications:

jobs: # Generate data for each configuration - name: generate_{config} command: python generate.py --config={config} output_files: - data_{config} parameters: config: "['A','B','C']" # Process each generated dataset - name: process_{config} command: python process.py --input=data_{config}.pkl input_files: - data_{config} depends_on: - generate_{config} parameters: config: "['A','B','C']"

This creates 6 jobs with proper dependencies:

generate_A → process_A

generate_B → process_B

generate_C → process_C

Parameterized Files and User Data

Files:

files: - name: model_{run_id:03d} path: /models/run_{run_id:03d}.pt parameters: run_id: "1:100"

User Data:

user_data: - name: config_{experiment} data: experiment: "{experiment}" learning_rate: 0.001 parameters: experiment: "['baseline','ablation','full']"

Shared (Workflow-Level) Parameters

Define parameters once at the workflow level and reuse them across multiple jobs and files using use_parameters:

Basic Usage

name: hyperparameter_sweep parameters: lr: "[0.0001,0.001,0.01]" batch_size: "[16,32,64]" optimizer: "['adam','sgd']" jobs: # Training jobs - inherit parameters via use_parameters - name: train_lr{lr:.4f}_bs{batch_size}_opt{optimizer} command: python train.py --lr={lr} --batch-size={batch_size} --optimizer={optimizer} use_parameters: - lr - batch_size - optimizer # Aggregate results - also uses shared parameters - name: aggregate_results command: python aggregate.py depends_on: - train_lr{lr:.4f}_bs{batch_size}_opt{optimizer} use_parameters: - lr - batch_size - optimizer files: - name: model_lr{lr:.4f}_bs{batch_size}_opt{optimizer} path: /models/model_lr{lr:.4f}_bs{batch_size}_opt{optimizer}.pt use_parameters: - lr - batch_size - optimizer

Benefits

DRY (Don't Repeat Yourself) - Define parameter ranges once, use everywhere

Consistency - Ensures all jobs use the same parameter values

Maintainability - Change parameters in one place, affects all uses

Selective inheritance - Jobs can choose which parameters to use

Selective Parameter Inheritance

Jobs don't have to use all workflow parameters:

parameters: lr: "[0.0001,0.001,0.01]" batch_size: "[16,32,64]" dataset: "['train','validation']" jobs: # Only uses lr and batch_size (9 jobs) - name: train_lr{lr:.4f}_bs{batch_size} command: python train.py --lr={lr} --batch-size={batch_size} use_parameters: - lr - batch_size # Only uses dataset (2 jobs) - name: prepare_{dataset} command: python prepare.py --dataset={dataset} use_parameters: - dataset

Local Parameters Override Shared

Jobs can define local parameters that take precedence over workflow-level parameters:

parameters: lr: "[0.0001,0.001,0.01]" jobs: # Uses workflow parameter (3 jobs) - name: train_lr{lr:.4f} command: python train.py --lr={lr} use_parameters: - lr # Uses local override (2 jobs instead of 3) - name: special_lr{lr:.4f} command: python special.py --lr={lr} parameters: lr: "[0.01,0.1]" # Local override - ignores workflow's lr

KDL Syntax

parameters { lr "[0.0001,0.001,0.01]" batch_size "[16,32,64]" } job "train_lr{lr:.4f}_bs{batch_size}" { command "python train.py --lr={lr} --batch-size={batch_size}" use_parameters "lr" "batch_size" }

JSON5 Syntax

{ parameters: { lr: "[0.0001,0.001,0.01]", batch_size: "[16,32,64]" }, jobs: [ { name: "train_lr{lr:.4f}_bs{batch_size}", command: "python train.py --lr={lr} --batch-size={batch_size}", use_parameters: ["lr", "batch_size"] } ] }

Parameter Modes

By default, when multiple parameters are specified, Torc generates the Cartesian product of all parameter values. You can change this behavior using parameter_mode.

Product Mode (Default)

The default mode generates all possible combinations:

jobs: - name: job_{a}_{b} command: echo {a} {b} parameters: a: "[1, 2, 3]" b: "['x', 'y', 'z']" # parameter_mode: product # This is the default

This creates 3 × 3 = 9 jobs: job_1_x, job_1_y, job_1_z, job_2_x, etc.

Zip Mode

Use parameter_mode: zip to pair parameters element-wise (like Python's zip() function). All parameter lists must have the same length.

jobs: - name: train_{dataset}_{model} command: python train.py --dataset={dataset} --model={model} parameters: dataset: "['cifar10', 'mnist', 'imagenet']" model: "['resnet', 'cnn', 'transformer']" parameter_mode: zip

This creates 3 jobs (not 9):

train_cifar10_resnet

train_mnist_cnn

train_imagenet_transformer

When to use zip mode:

Pre-determined parameter pairings (dataset A always uses model X)

Corresponding input/output file pairs

Parallel arrays where position matters

Error handling: If parameter lists have different lengths in zip mode, Torc will return an error:

All parameters must have the same number of values when using 'zip' mode. Parameter 'dataset' has 3 values, but 'model' has 2 values.

KDL Syntax

job "train_{dataset}_{model}" { command "python train.py --dataset={dataset} --model={model}" parameters { dataset "['cifar10', 'mnist', 'imagenet']" model "['resnet', 'cnn', 'transformer']" } parameter_mode "zip" }

JSON5 Syntax

{ name: "train_{dataset}_{model}", command: "python train.py --dataset={dataset} --model={model}", parameters: { dataset: "['cifar10', 'mnist', 'imagenet']", model: "['resnet', 'cnn', 'transformer']" }, parameter_mode: "zip" }

Best Practices

Use descriptive parameter names - lr not x, batch_size not b

Format numbers consistently - Use :03d for run IDs, :.4f for learning rates

Keep parameter counts reasonable - 3×3×3 = 27 jobs is manageable, 10×10×10 = 1000 may overwhelm the system

Match parameter ranges across related jobs - Use same parameter values for generator and consumer jobs

Consider parameter dependencies - Some parameter combinations may be invalid

Prefer shared parameters for multi-job workflows - Use use_parameters to avoid repeating definitions

Use selective inheritance - Only inherit the parameters each job actually needs

Use zip mode for paired parameters - When parameters have a 1:1 correspondence, use parameter_mode: zip

Resource Requirements Reference

Technical reference for job resource specifications and allocation strategies.

Resource Requirements Fields

Field Type Required Description

name string Yes Identifier to reference from jobs

num_cpus integer No Number of CPU cores

num_gpus integer No Number of GPUs

num_nodes integer No Number of compute nodes

memory string No Memory allocation (see format below)

runtime string No Maximum runtime (ISO 8601 duration)

Example

resource_requirements: - name: small num_cpus: 2 num_gpus: 0 num_nodes: 1 memory: 4g runtime: PT30M - name: large num_cpus: 16 num_gpus: 2 num_nodes: 1 memory: 128g runtime: PT8H

Memory Format

String format with unit suffix:

Suffix Unit Example

k Kilobytes 512k

m Megabytes 512m

g Gigabytes 16g

Examples:

memory: 512m # 512 MB memory: 1g # 1 GB memory: 16g # 16 GB

Runtime Format

ISO 8601 duration format:

Format Description Example

PTnM Minutes PT30M (30 minutes)

PTnH Hours PT2H (2 hours)

PnD Days P1D (1 day)

PnDTnH Days and hours P1DT12H (1.5 days)

Examples:

runtime: PT10M # 10 minutes runtime: PT4H # 4 hours runtime: P1D # 1 day runtime: P1DT12H # 1 day, 12 hours

Job Allocation Strategies

Resource-Based Allocation (Default)

The server considers each job's resource requirements and only returns jobs that fit within available compute node resources.

Behavior:

Considers CPU, memory, and GPU requirements

Prevents resource over-subscription

Enables efficient packing of heterogeneous workloads

Configuration: Run without --max-parallel-jobs:

torc run $WORKFLOW_ID

Queue-Based Allocation

The server returns the next N ready jobs regardless of resource requirements.

Behavior:

Ignores job resource requirements

Only limits concurrent job count

Simpler and faster (no resource calculation)

Configuration: Run with --max-parallel-jobs:

torc run $WORKFLOW_ID --max-parallel-jobs 10

Use cases:

Homogeneous workloads where all jobs need similar resources

Simple task queues

When resource tracking overhead is not wanted

Resource Tracking

When using resource-based allocation, the job runner tracks:

Resource Description

CPUs Number of CPU cores in use

Memory Total memory allocated to running jobs

GPUs Number of GPUs in use

Nodes Number of jobs running per node

Jobs are only started when sufficient resources are available.

Environment Variables

When Torc executes jobs, it automatically sets several environment variables that provide context about the job and enable communication with the Torc server. These variables are available to all job commands during execution.

Variables Set During Job Execution

TORC_WORKFLOW_ID

The unique identifier of the workflow that contains this job.

Type: Integer (provided as string)

Example: "42"

Use case: Jobs can use this to query workflow information or to organize output files by workflow

# Example: Create a workflow-specific output directory mkdir -p "/data/results/workflow_${TORC_WORKFLOW_ID}" echo "Processing data..." > "/data/results/workflow_${TORC_WORKFLOW_ID}/output.txt"

TORC_JOB_ID

The unique identifier of the currently executing job.

Type: Integer (provided as string)

Example: "123"

Use case: Jobs can use this for logging, creating job-specific output files, or querying job metadata

# Example: Log job-specific information echo "Job ${TORC_JOB_ID} started at $(date)" >> "/var/log/torc/job_${TORC_JOB_ID}.log"

TORC_API_URL

The URL of the Torc API server that the job runner is communicating with.

Type: String (URL)

Example: "http://localhost:8080/torc-service/v1"

Use case: Jobs can make API calls to the Torc server to query data, create files, update user data, or perform other operations

# Example: Query workflow information from within a job curl -s "${TORC_API_URL}/workflows/${TORC_WORKFLOW_ID}" | jq '.name' # Example: Create a file entry in Torc curl -X POST "${TORC_API_URL}/files" \ -H "Content-Type: application/json" \ -d "{ \"workflow_id\": ${TORC_WORKFLOW_ID}, \"name\": \"result_${TORC_JOB_ID}\", \"path\": \"/data/results/output.txt\" }"

TORC_JOB_NAME

The name of the currently executing job as defined in the workflow specification.

Type: String

Example: "train_model"

Use case: Jobs can use this for logging or creating human-readable output file names

# Example: Log with job name echo "[${TORC_JOB_NAME}] Processing started at $(date)"

TORC_OUTPUT_DIR

The output directory where job logs and artifacts are stored.

Type: String (path)

Example: "/path/to/output"

Use case: Jobs can write additional output files to this directory alongside the standard stdout/stderr logs

# Example: Write job artifacts to output directory cp results.json "${TORC_OUTPUT_DIR}/job_${TORC_JOB_ID}_results.json"

TORC_ATTEMPT_ID

The current attempt number for this job execution. Starts at 1 and increments with each retry when using failure handlers.

Type: Integer (provided as string)

Example: "1" (first attempt), "2" (first retry), etc.

Use case: Jobs can adjust behavior based on retry attempt, or include attempt information in logs

# Example: Log attempt information echo "Running attempt ${TORC_ATTEMPT_ID} of job ${TORC_JOB_NAME}" # Example: Adjust behavior on retry if [ "${TORC_ATTEMPT_ID}" -gt 1 ]; then echo "This is a retry - using more conservative settings" BATCH_SIZE=16 else BATCH_SIZE=64 fi

Variables Set During Recovery Script Execution

When a job fails and has a failure handler configured, Torc may run a recovery script before retrying the job. Recovery scripts receive all the standard job environment variables plus additional context about the failure.

TORC_RETURN_CODE

The exit code from the failed job that triggered the recovery script. Only available in recovery scripts, not during normal job execution.

Type: Integer (provided as string)

Example: "137" (OOM killed), "1" (general error)

Use case: Recovery scripts can inspect the exit code to determine appropriate recovery actions

# Example: Recovery script that handles different exit codes #!/bin/bash echo "Job ${TORC_JOB_NAME} failed with exit code ${TORC_RETURN_CODE}" case ${TORC_RETURN_CODE} in 137) echo "Out of memory - reducing batch size for retry" # Modify config for next attempt ;; 139) echo "Segmentation fault - checking for corrupted data" # Clean up corrupted files ;; *) echo "Unknown error - attempting general recovery" ;; esac exit 0 # Exit 0 to proceed with retry, non-zero to abort

Complete Example

Here's a complete example of a job that uses the environment variables:

name: "Environment Variables Demo" user: "demo" jobs: - name: "example_job" command: | #!/bin/bash set -e echo "=== Job Environment ===" echo "Workflow ID: ${TORC_WORKFLOW_ID}" echo "Job ID: ${TORC_JOB_ID}" echo "Job Name: ${TORC_JOB_NAME}" echo "Attempt: ${TORC_ATTEMPT_ID}" echo "Output Dir: ${TORC_OUTPUT_DIR}" echo "API URL: ${TORC_API_URL}" # Create job-specific output directory OUTPUT_DIR="/tmp/workflow_${TORC_WORKFLOW_ID}/job_${TORC_JOB_ID}" mkdir -p "${OUTPUT_DIR}" # Do some work echo "Processing data..." > "${OUTPUT_DIR}/status.txt" date >> "${OUTPUT_DIR}/status.txt" echo "Job completed successfully!"

Summary Table

Variable Type Available In Description

TORC_WORKFLOW_ID Integer Jobs, Recovery Scripts Workflow identifier

TORC_JOB_ID Integer Jobs, Recovery Scripts Job identifier

TORC_JOB_NAME String Jobs, Recovery Scripts Job name from workflow spec

TORC_API_URL URL Jobs, Recovery Scripts Torc server API endpoint

TORC_OUTPUT_DIR Path Jobs, Recovery Scripts Output directory for logs/artifacts

TORC_ATTEMPT_ID Integer Jobs, Recovery Scripts Current attempt number (1, 2, 3...)

TORC_RETURN_CODE Integer Recovery Scripts only Exit code that triggered recovery

Notes

All environment variables are set as strings, even numeric values like workflow and job IDs

The TORC_API_URL includes the full base path to the API (e.g., /torc-service/v1)

Jobs inherit all other environment variables from the job runner process

These variables are available in both local and Slurm-scheduled job executions

TORC_ATTEMPT_ID starts at 1 for the first execution and increments with each retry

TORC_RETURN_CODE is only available in recovery scripts, not during normal job execution

Resource Monitoring Reference

Technical reference for Torc's resource monitoring system.

Configuration Options

The resource_monitor section in workflow specifications accepts the following fields:

Field Type Default Description

enabled boolean true Enable or disable monitoring

granularity string "summary" "summary" or "time_series"

sample_interval_seconds integer 5 Seconds between resource samples

generate_plots boolean false Reserved for future use

Granularity Modes

Summary mode ("summary"):

Stores only peak and average values per job

Metrics stored in the main database results table

Minimal storage overhead

Time series mode ("time_series"):

Stores samples at regular intervals

Creates separate SQLite database per workflow run

Database location: <output_dir>/resource_utilization/resource_metrics_<hostname>_<workflow_id>_<run_id>.db

Sample Interval Guidelines

Job Duration Recommended Interval

< 1 hour 1-2 seconds

1-4 hours 5 seconds (default)

> 4 hours 10-30 seconds

Time Series Database Schema

job_resource_samples Table

Column Type Description

id INTEGER Primary key

job_id INTEGER Torc job ID

timestamp REAL Unix timestamp

cpu_percent REAL CPU utilization percentage

memory_bytes INTEGER Memory usage in bytes

num_processes INTEGER Process count including children

job_metadata Table

Column Type Description

job_id INTEGER Primary key, Torc job ID

job_name TEXT Human-readable job name

Summary Metrics in Results

When using summary mode, the following fields are added to job results:

Field Type Description

peak_cpu_percent float Maximum CPU percentage observed

avg_cpu_percent float Average CPU percentage

peak_memory_gb float Maximum memory in GB

avg_memory_gb float Average memory in GB

check-resource-utilization JSON Output

When using --format json:

{ "workflow_id": 123, "run_id": null, "total_results": 10, "over_utilization_count": 3, "violations": [ { "job_id": 15, "job_name": "train_model", "resource_type": "Memory", "specified": "8.00 GB", "peak_used": "10.50 GB", "over_utilization": "+31.3%" } ] }

Field Description

workflow_id Workflow being analyzed

run_id Specific run ID if provided, otherwise null for latest

total_results Total number of completed jobs analyzed

over_utilization_count Number of violations found

violations Array of violation details

Violation Object

Field Description

job_id Job ID with violation

job_name Human-readable job name

resource_type "Memory", "CPU", or "Runtime"

specified Resource requirement from workflow spec

peak_used Actual peak usage observed

over_utilization Percentage over/under specification

plot-resources Output Files

File Description

resource_plot_job_<id>.html Per-job timeline with CPU, memory, process count

resource_plot_cpu_all_jobs.html CPU comparison across all jobs

resource_plot_memory_all_jobs.html Memory comparison across all jobs

resource_plot_summary.html Bar chart dashboard of peak vs average

All plots are self-contained HTML files using Plotly.js with:

Interactive hover tooltips

Zoom and pan controls

Legend toggling

Export options (PNG, SVG)

Monitored Metrics

Metric Unit Description

CPU percentage % Total CPU utilization across all cores

Memory usage bytes Resident memory consumption

Process count count Number of processes in job's process tree

Process Tree Tracking

The monitoring system automatically tracks child processes spawned by jobs. When a job creates worker processes (e.g., Python multiprocessing), all descendants are included in the aggregated metrics.

Performance Characteristics

Single background monitoring thread regardless of job count

Typical overhead: <1% CPU even with 1-second sampling

Uses native OS APIs via the sysinfo crate

Non-blocking async design

Configuration Reference

Complete reference for Torc configuration options.

Configuration Sources

Torc loads configuration from multiple sources in this order (later sources override earlier):

Built-in defaults (lowest priority)

System config: /etc/torc/config.toml

User config: ~/.config/torc/config.toml (platform-dependent)

Project config: ./torc.toml

Environment variables: TORC_* prefix

CLI arguments (highest priority)

Configuration Commands

torc config show # Show effective configuration torc config show --format json # Show as JSON torc config paths # Show configuration file locations torc config init --user # Create user config file torc config init --local # Create project config file torc config init --system # Create system config file torc config validate # Validate current configuration

Client Configuration

Settings for the torc CLI.

[client] Section

Option Type Default Description

api_url string http://localhost:8080/torc-service/v1 Torc server API URL

format string table Output format: table or json

log_level string info Log level: error, warn, info, debug, trace

username string (none) Username for basic authentication

[client.run] Section

Settings for torc run command.

Option Type Default Description

poll_interval float 5.0 Job completion poll interval (seconds)

output_dir path output Output directory for job logs

max_parallel_jobs int (none) Maximum parallel jobs (overrides resource-based)

num_cpus int (none) Available CPUs for resource-based scheduling

memory_gb float (none) Available memory (GB) for resource-based scheduling

num_gpus int (none) Available GPUs for resource-based scheduling

Example

[client] api_url = "http://localhost:8080/torc-service/v1" format = "table" log_level = "info" username = "myuser" [client.run] poll_interval = 5.0 output_dir = "output" max_parallel_jobs = 4 num_cpus = 8 memory_gb = 32.0 num_gpus = 1

[client.hpc] Section

Settings for HPC profile system (used by torc hpc and torc slurm commands).

Option Type Default Description

profile_overrides table {} Override settings for built-in HPC profiles

custom_profiles table {} Define custom HPC profiles

[client.hpc.profile_overrides.<profile>] Section

Override settings for built-in profiles (e.g., kestrel).

Option Type Default Description

default_account string (none) Default Slurm account for this profile

[client.hpc.custom_profiles.<name>] Section

Define a custom HPC profile.

Option Type Required Description

display_name string No Human-readable name

description string No Profile description

detect_env_var string No Environment variable for detection (NAME=value)

detect_hostname string No Regex pattern for hostname detection

default_account string No Default Slurm account

partitions array Yes List of partition configurations

[[client.hpc.custom_profiles.<name>.partitions]] Section

Define partitions for a custom profile.

Option Type Required Description

name string Yes Partition name

cpus_per_node int Yes CPU cores per node

memory_mb int Yes Memory per node in MB

max_walltime_secs int Yes Maximum walltime in seconds

gpus_per_node int No GPUs per node

gpu_type string No GPU model (e.g., "H100")

shared bool No Whether partition supports shared jobs

min_nodes int No Minimum required nodes

requires_explicit_request bool No Must be explicitly requested

HPC Example

[client.hpc.profile_overrides.kestrel] default_account = "my_default_account" [client.hpc.custom_profiles.mycluster] display_name = "My Research Cluster" description = "Internal research HPC system" detect_env_var = "MY_CLUSTER=research" default_account = "default_project" [[client.hpc.custom_profiles.mycluster.partitions]] name = "compute" cpus_per_node = 64 memory_mb = 256000 max_walltime_secs = 172800 shared = false [[client.hpc.custom_profiles.mycluster.partitions]] name = "gpu" cpus_per_node = 32 memory_mb = 128000 max_walltime_secs = 86400 gpus_per_node = 4 gpu_type = "A100" shared = false

Server Configuration

Settings for torc-server.

[server] Section

Option Type Default Description

log_level string info Log level

https bool false Enable HTTPS

url string localhost Hostname/IP to bind to

port int 8080 Port to listen on

threads int 1 Number of worker threads

database string (none) SQLite database path (falls back to DATABASE_URL env)

auth_file string (none) Path to htpasswd file

require_auth bool false Require authentication for all requests

enforce_access_control bool false Enforce access control based on workflow ownership

admin_users string array [] Users to add to the admin group

completion_check_interval_secs float 30.0 Background job processing interval

[server.logging] Section

Option Type Default Description

log_dir path (none) Directory for log files (enables file logging)

json_logs bool false Use JSON format for log files

Example

[server] url = "0.0.0.0" port = 8080 threads = 4 database = "/var/lib/torc/torc.db" auth_file = "/etc/torc/htpasswd" require_auth = true enforce_access_control = true admin_users = ["alice", "bob"] completion_check_interval_secs = 30.0 log_level = "info" https = false [server.logging] log_dir = "/var/log/torc" json_logs = false

Dashboard Configuration

Settings for torc-dash.

[dash] Section

Option Type Default Description

host string 127.0.0.1 Hostname/IP to bind to

port int 8090 Port to listen on

api_url string http://localhost:8080/torc-service/v1 Torc server API URL

torc_bin string torc Path to torc CLI binary

torc_server_bin string torc-server Path to torc-server binary

standalone bool false Auto-start torc-server

server_port int 0 Server port for standalone mode (0 = auto)

database string (none) Database path for standalone mode

completion_check_interval_secs int 5 Completion check interval (standalone mode)

Example

[dash] host = "0.0.0.0" port = 8090 api_url = "http://localhost:8080/torc-service/v1" torc_bin = "/usr/local/bin/torc" torc_server_bin = "/usr/local/bin/torc-server" standalone = true server_port = 0 completion_check_interval_secs = 5

Environment Variables

Environment variables use double underscore (__) to separate nested keys.

Client Variables

Variable Maps To

TORC_CLIENT__API_URL client.api_url

TORC_CLIENT__FORMAT client.format

TORC_CLIENT__LOG_LEVEL client.log_level

TORC_CLIENT__USERNAME client.username

TORC_CLIENT__RUN__POLL_INTERVAL client.run.poll_interval

TORC_CLIENT__RUN__OUTPUT_DIR client.run.output_dir

TORC_CLIENT__RUN__MAX_PARALLEL_JOBS client.run.max_parallel_jobs

TORC_CLIENT__RUN__NUM_CPUS client.run.num_cpus

TORC_CLIENT__RUN__MEMORY_GB client.run.memory_gb

TORC_CLIENT__RUN__NUM_GPUS client.run.num_gpus

Server Variables

Variable Maps To

TORC_SERVER__URL server.url

TORC_SERVER__PORT server.port

TORC_SERVER__THREADS server.threads

TORC_SERVER__DATABASE server.database

TORC_SERVER__AUTH_FILE server.auth_file

TORC_SERVER__REQUIRE_AUTH server.require_auth

TORC_SERVER__ENFORCE_ACCESS_CONTROL server.enforce_access_control

TORC_SERVER__LOG_LEVEL server.log_level

TORC_SERVER__COMPLETION_CHECK_INTERVAL_SECS server.completion_check_interval_secs

TORC_SERVER__LOGGING__LOG_DIR server.logging.log_dir

TORC_SERVER__LOGGING__JSON_LOGS server.logging.json_logs

Dashboard Variables

Variable Maps To

TORC_DASH__HOST dash.host

TORC_DASH__PORT dash.port

TORC_DASH__API_URL dash.api_url

TORC_DASH__STANDALONE dash.standalone

Legacy Variables

These environment variables are still supported directly by clap:

Variable Component Description

TORC_API_URL Client Server API URL (CLI only)

TORC_PASSWORD Client Authentication password (CLI only)

TORC_AUTH_FILE Server htpasswd file path

TORC_LOG_DIR Server Log directory

TORC_COMPLETION_CHECK_INTERVAL_SECS Server Completion check interval

TORC_ADMIN_USERS Server Comma-separated list of admin usernames

DATABASE_URL Server SQLite database URL

RUST_LOG All Log level filter

Complete Example

# ~/.config/torc/config.toml [client] api_url = "http://localhost:8080/torc-service/v1" format = "table" log_level = "info" username = "developer" [client.run] poll_interval = 5.0 output_dir = "output" num_cpus = 8 memory_gb = 32.0 num_gpus = 1 [server] log_level = "info" https = false url = "localhost" port = 8080 threads = 4 database = "/var/lib/torc/torc.db" auth_file = "/etc/torc/htpasswd" require_auth = true enforce_access_control = true admin_users = ["alice", "bob"] completion_check_interval_secs = 30.0 [server.logging] log_dir = "/var/log/torc" json_logs = false [dash] host = "127.0.0.1" port = 8090 api_url = "http://localhost:8080/torc-service/v1" torc_bin = "torc" torc_server_bin = "torc-server" standalone = false server_port = 0 completion_check_interval_secs = 5

See Also

Configuration Files How-To

Configuration Tutorial

Server Deployment

OpenAPI Specification

The Torc server implements a HTTP API defined in api/openapi.yaml. All endpoints are prefixed with /torc-service/v1.

Core Endpoints

Workflows

Create Workflow

# curl curl -X POST http://localhost:8080/torc-service/v1/workflows \ -H "Content-Type: application/json" \ -d '{ "name": "test_workflow", "user": "alice", "description": "Test workflow" }' | jq '.' # nushell http post http://localhost:8080/torc-service/v1/workflows { name: "test_workflow" user: "alice" description: "Test workflow" }

Response:

{ "id": 1, "name": "test_workflow", "user": "alice", "description": "Test workflow", "timestamp": 1699000000.0 }

List Workflows

# curl with jq curl http://localhost:8080/torc-service/v1/workflows?offset=0&limit=10 | jq '.workflows' # nushell (native JSON parsing) http get http://localhost:8080/torc-service/v1/workflows?offset=0&limit=10 | get workflows

Get Workflow

# curl curl http://localhost:8080/torc-service/v1/workflows/1 | jq '.' # nushell http get http://localhost:8080/torc-service/v1/workflows/1

Initialize Jobs

# curl curl -X POST http://localhost:8080/torc-service/v1/workflows/1/initialize_jobs \ -H "Content-Type: application/json" \ -d '{"reinitialize": false, "ignore_missing_data": false}' | jq '.' # nushell http post http://localhost:8080/torc-service/v1/workflows/1/initialize_jobs { reinitialize: false ignore_missing_data: false }

Jobs

Create Job

# curl curl -X POST http://localhost:8080/torc-service/v1/jobs \ -H "Content-Type: application/json" \ -d '{ "workflow_id": 1, "name": "job1", "command": "echo hello", "resource_requirements_id": 1, "input_file_ids": [], "output_file_ids": [], "depends_on_job_ids": [] }' | jq '.'

List Jobs

# curl - filter by status curl "http://localhost:8080/torc-service/v1/jobs?workflow_id=1&status=ready" \ | jq '.jobs[] | {name, status, id}' # nushell - filter and format http get "http://localhost:8080/torc-service/v1/jobs?workflow_id=1" | get jobs | where status == "ready" | select name status id

Update Job Status

# curl curl -X POST http://localhost:8080/torc-service/v1/jobs/1/manage_status_change \ -H "Content-Type: application/json" \ -d '{"target_status": "running"}' | jq '.'

Files

Create File

# curl curl -X POST http://localhost:8080/torc-service/v1/files \ -H "Content-Type: application/json" \ -d '{ "workflow_id": 1, "name": "input_data", "path": "/data/input.csv" }' | jq '.'

List Files

curl "http://localhost:8080/torc-service/v1/files?workflow_id=1" | jq '.files'

User Data

Create User Data

curl -X POST http://localhost:8080/torc-service/v1/user_data \ -H "Content-Type: application/json" \ -d '{ "workflow_id": 1, "name": "config", "data": {"learning_rate": 0.001, "batch_size": 32} }' | jq '.'

Update User Data

curl -X PUT http://localhost:8080/torc-service/v1/user_data/1 \ -H "Content-Type: application/json" \ -d '{ "workflow_id": 1, "name": "config", "data": {"learning_rate": 0.01, "batch_size": 64} }' | jq '.'

Resource Requirements

Create Resource Requirements

curl -X POST http://localhost:8080/torc-service/v1/resource_requirements \ -H "Content-Type: application/json" \ -d '{ "workflow_id": 1, "name": "gpu_large", "num_cpus": 16, "num_gpus": 4, "num_nodes": 1, "memory": "128g", "runtime": "PT8H" }' | jq '.'

Memory Format: String with suffix: 1m (MB), 2g (GB), 512k (KB)

Runtime Format: ISO 8601 duration: PT30M (30 minutes), PT2H (2 hours), P1DT12H (1.5 days)

Compute Nodes

Create Compute Node

curl -X POST http://localhost:8080/torc-service/v1/compute_nodes \ -H "Content-Type: application/json" \ -d '{ "workflow_id": 1, "hostname": "compute-01", "num_cpus": 32, "memory": "256g", "num_gpus": 8, "is_active": true }' | jq '.'

List Active Compute Nodes

curl "http://localhost:8080/torc-service/v1/compute_nodes?workflow_id=1&is_active=true" \ | jq '.compute_nodes[] | {hostname, num_cpus, num_gpus}'

Results

Create Result

curl -X POST http://localhost:8080/torc-service/v1/results \ -H "Content-Type: application/json" \ -d '{ "workflow_id": 1, "job_id": 1, "exit_code": 0, "stdout": "Job completed successfully", "stderr": "" }' | jq '.'

Events

List Events

curl "http://localhost:8080/torc-service/v1/events?workflow_id=1&limit=20" \ | jq '.events[] | {timestamp, data}'

Advanced Endpoints

Prepare Next Jobs for Submission (Job Runner)

curl -X POST "http://localhost:8080/torc-service/v1/workflows/1/claim_next_jobs?num_jobs=5" \ -H "Content-Type: application/json" \ -d '{}' | jq '.jobs'

Process Changed Job Inputs (Reinitialization)

curl -X POST "http://localhost:8080/torc-service/v1/workflows/1/process_changed_job_inputs?dry_run=true" \ -H "Content-Type: application/json" \ -d '{}' | jq '.reinitialized_jobs'

HPC & Slurm

Running Torc workflows on HPC clusters with Slurm.

Submit a Workflow to Slurm - Running workflows on HPC clusters

Slurm Workflows - Configuring Slurm-based workflows

Debugging Slurm Workflows - Troubleshooting Slurm issues

Working with Slurm - Slurm integration details

HPC Profiles - Pre-configured HPC settings

HPC Profiles Reference - Available profiles

HPC Deployment - Deploying Torc on HPC systems

Custom HPC Profile - Creating custom profiles

How to Submit a Workflow to Slurm

Submit a workflow specification to a Slurm-based HPC system with automatic scheduler generation.

Quick Start

torc submit-slurm --account <your-account> workflow.yaml

Torc will:

Detect your HPC system (e.g., NREL Kestrel, Eagle)

Match job requirements to appropriate partitions

Generate Slurm scheduler configurations

Submit everything for execution

Preview Before Submitting

Always preview the generated configuration first:

torc slurm generate --account <your-account> workflow.yaml

This shows the Slurm schedulers and workflow actions that would be created without submitting.

Requirements

Your workflow must define resource requirements for jobs:

name: my_workflow resource_requirements: - name: standard num_cpus: 4 memory: 8g runtime: PT1H jobs: - name: process_data command: python process.py resource_requirements: standard

Options

# See all options torc submit-slurm --help

See Also

Slurm Workflows — Full Slurm integration guide

HPC Profiles — Available HPC system configurations

Slurm Workflows

This document explains how Torc simplifies running workflows on Slurm-based HPC systems. The key insight is that you don't need to understand Slurm schedulers or workflow actions to run workflows on HPC systems—Torc handles this automatically.

The Simple Approach

Running a workflow on Slurm requires just two things:

Define your jobs with resource requirements

Submit with submit-slurm

That's it. Torc will analyze your workflow, generate appropriate Slurm configurations, and submit everything for execution.

⚠️ Important: The submit-slurm command uses heuristics to auto-generate Slurm schedulers and workflow actions. For complex workflows with unusual dependency patterns, the generated configuration may not be optimal and could result in suboptimal allocation timing. Always preview the configuration first using torc slurm generate (see Previewing Generated Configuration) before submitting production workflows.

Example Workflow

Here's a complete workflow specification that runs on Slurm:

name: data_analysis_pipeline description: Analyze experimental data with preprocessing, training, and evaluation resource_requirements: - name: light num_cpus: 4 memory: 8g runtime: PT30M - name: compute num_cpus: 32 memory: 64g runtime: PT2H - name: gpu num_cpus: 16 num_gpus: 2 memory: 128g runtime: PT4H jobs: - name: preprocess command: python preprocess.py --input data/ --output processed/ resource_requirements: light - name: train_model command: python train.py --data processed/ --output model/ resource_requirements: gpu depends_on: [preprocess] - name: evaluate command: python evaluate.py --model model/ --output results/ resource_requirements: compute depends_on: [train_model] - name: generate_report command: python report.py --results results/ resource_requirements: light depends_on: [evaluate]

Submitting the Workflow

torc submit-slurm --account myproject workflow.yaml

Torc will:

Detect which HPC system you're on (e.g., NREL Kestrel)

Match each job's requirements to appropriate partitions

Generate Slurm scheduler configurations

Create workflow actions that stage resource allocation based on dependencies

Submit the workflow for execution

How It Works

When you use submit-slurm, Torc performs intelligent analysis of your workflow:

1. Per-Job Scheduler Generation

Each job gets its own Slurm scheduler configuration based on its resource requirements. This means:

Jobs are matched to the most appropriate partition

Memory, CPU, and GPU requirements are correctly specified

Walltime is set to the partition's maximum (explained below)

2. Staged Resource Allocation

Torc analyzes job dependencies and creates staged workflow actions:

Jobs without dependencies trigger on_workflow_start — resources are allocated immediately

Jobs with dependencies trigger on_jobs_ready — resources are allocated only when the job becomes ready to run

This prevents wasting allocation time on resources that aren't needed yet. For example, in the workflow above:

preprocess resources are allocated at workflow start

train_model resources are allocated when preprocess completes

evaluate resources are allocated when train_model completes

generate_report resources are allocated when evaluate completes

3. Walltime Calculation

By default, Torc sets the walltime to 1.5× your longest job's runtime (capped at the partition's maximum). This provides headroom for jobs that run slightly longer than expected.

You can customize this behavior:

--walltime-strategy max-job-runtime (default): Uses longest job runtime × multiplier

--walltime-strategy max-partition-time: Uses the partition's maximum walltime

--walltime-multiplier 2.0: Change the safety multiplier (default: 1.5)

See Walltime Strategy Options for details.

4. HPC Profile Knowledge

Torc includes built-in knowledge of HPC systems like NREL Kestrel, including:

Available partitions and their resource limits

GPU configurations

Memory and CPU specifications

Special requirements (e.g., minimum node counts for high-bandwidth partitions)

Using an unsupported HPC? Please request built-in support so everyone benefits. You can also create a custom profile for immediate use.

Resource Requirements Specification

Resource requirements are the key to the simplified workflow. Define them once and reference them from jobs:

resource_requirements: - name: small num_cpus: 4 num_gpus: 0 num_nodes: 1 memory: 8g runtime: PT1H - name: gpu_training num_cpus: 32 num_gpus: 4 num_nodes: 1 memory: 256g runtime: PT8H

Fields

Field Description Example

name Reference name for jobs "compute"

num_cpus CPU cores required 32

num_gpus GPUs required (0 if none) 2

num_nodes Nodes required 1

memory Memory with unit suffix "64g", "512m"

runtime ISO8601 duration "PT2H", "PT30M"

Runtime Format

Use ISO8601 duration format:

PT30M — 30 minutes

PT2H — 2 hours

PT1H30M — 1 hour 30 minutes

P1D — 1 day

P2DT4H — 2 days 4 hours

Job Dependencies

Define dependencies explicitly or implicitly through file/data relationships:

Explicit Dependencies

jobs: - name: step1 command: ./step1.sh resource_requirements: small - name: step2 command: ./step2.sh resource_requirements: small depends_on: [step1] - name: step3 command: ./step3.sh resource_requirements: small depends_on: [step1, step2] # Waits for both

Implicit Dependencies (via Files)

files: - name: raw_data path: /data/raw.csv - name: processed_data path: /data/processed.csv jobs: - name: process command: python process.py input_files: [raw_data] output_files: [processed_data] resource_requirements: compute - name: analyze command: python analyze.py input_files: [processed_data] # Creates implicit dependency on 'process' resource_requirements: compute

Previewing Generated Configuration

Recommended Practice: Always preview the generated configuration before submitting to Slurm, especially for complex workflows. This allows you to verify that schedulers and actions are appropriate for your workflow structure.

Viewing the Execution Plan

Before generating schedulers, visualize how your workflow will execute in stages:

torc workflows execution-plan workflow.yaml

This shows the execution stages, which jobs run at each stage, and (if schedulers are defined) when Slurm allocations are requested. See Visualizing Workflow Structure for detailed examples.

Generating Slurm Configuration

Preview what Torc will generate:

torc slurm generate --account myproject --profile kestrel workflow.yaml

This outputs the complete workflow with generated schedulers and actions:

Scheduler Grouping Options

By default, Torc creates one scheduler per unique resource_requirements name. This means if you have three jobs with three different resource requirement definitions (e.g., cpu, memory, mixed), you get three schedulers—even if all three would fit on the same partition.

The --group-by option controls how jobs are grouped into schedulers:

# Default: one scheduler per resource_requirements name torc slurm generate --account myproject workflow.yaml torc slurm generate --account myproject --group-by resource-requirements workflow.yaml # Result: 3 schedulers (cpu_scheduler, memory_scheduler, mixed_scheduler) # Group by partition: one scheduler per partition torc slurm generate --account myproject --group-by partition workflow.yaml # Result: 1 scheduler (short_scheduler) if all jobs fit on the "short" partition

When to use --group-by partition:

Your workflow has many small resource requirement definitions that all fit on the same partition

You want to minimize Slurm queue overhead by reducing the number of allocations

Jobs have similar characteristics and can share nodes efficiently

When to use --group-by resource-requirements (default):

Jobs have significantly different resource profiles that benefit from separate allocations

You want fine-grained control over which jobs share resources

You're debugging and want clear separation between job types

When grouping by partition, the scheduler uses the maximum resource values from all grouped requirements (max memory, max CPUs, max runtime, etc.) to ensure all jobs can run.

Walltime Strategy Options

The --walltime-strategy option controls how Torc calculates the walltime for generated schedulers:

# Default: use max job runtime with a safety multiplier (1.5x) torc slurm generate --account myproject workflow.yaml torc slurm generate --account myproject --walltime-strategy max-job-runtime workflow.yaml # Use the partition's maximum allowed walltime torc slurm generate --account myproject --walltime-strategy max-partition-time workflow.yaml

Walltime strategies:

Strategy Description

max-job-runtime Uses the longest job's runtime × multiplier (default: 1.5x). Capped at partition max.

max-partition-time Uses the partition's maximum walltime. More conservative but may impact queue scheduling.

Customizing the multiplier:

The --walltime-multiplier option (default: 1.5) provides a safety margin when using max-job-runtime:

# Use 2x the max job runtime for extra buffer torc slurm generate --account myproject --walltime-multiplier 2.0 workflow.yaml # Use exact job runtime (no buffer - use with caution) torc slurm generate --account myproject --walltime-multiplier 1.0 workflow.yaml

When to use max-job-runtime (default):

You want better queue scheduling (shorter walltime requests often get prioritized)

Your job runtime estimates are reasonably accurate

You prefer the Torc runner to exit early rather than holding idle allocations

When to use max-partition-time:

Your job runtimes are highly variable or unpredictable

You consistently underestimate job runtimes

Queue priority is not a concern

name: data_analysis_pipeline # ... original content ... jobs: - name: preprocess command: python preprocess.py --input data/ --output processed/ resource_requirements: light scheduler: preprocess_scheduler # ... more jobs ... slurm_schedulers: - name: preprocess_scheduler account: myproject mem: 8g nodes: 1 walltime: "04:00:00" - name: train_model_scheduler account: myproject mem: 128g nodes: 1 gres: "gpu:2" walltime: "04:00:00" # ... more schedulers ... actions: - trigger_type: on_workflow_start action_type: schedule_nodes scheduler: preprocess_scheduler scheduler_type: slurm num_allocations: 1 - trigger_type: on_jobs_ready action_type: schedule_nodes jobs: [train_model] scheduler: train_model_scheduler scheduler_type: slurm num_allocations: 1 # ... more actions ...

Save the output to inspect or modify before submission:

torc slurm generate --account myproject workflow.yaml -o workflow_with_schedulers.yaml

Torc Server Considerations

The Torc server must be accessible to compute nodes. Options include:

Shared server (Recommended): A team member allocates a dedicated server in the HPC environment

Login node: Suitable for small workflows with few, long-running jobs

For large workflows with many short jobs, a dedicated server prevents overloading login nodes.

Best Practices

1. Focus on Resource Requirements

Spend time accurately defining resource requirements. Torc handles the rest:

resource_requirements: # Be specific about what each job type needs - name: io_heavy num_cpus: 4 memory: 32g # High memory for data loading runtime: PT1H - name: compute_heavy num_cpus: 64 memory: 16g # Less memory, more CPU runtime: PT4H

2. Use Meaningful Names

Name resource requirements by their purpose, not by partition:

# Good - describes the workload resource_requirements: - name: data_preprocessing - name: model_training - name: inference # Avoid - ties you to specific infrastructure resource_requirements: - name: short_partition - name: gpu_h100

3. Group Similar Jobs

Jobs with similar requirements can share resource requirement definitions:

resource_requirements: - name: quick_task num_cpus: 2 memory: 4g runtime: PT15M jobs: - name: validate_input command: ./validate.sh resource_requirements: quick_task - name: check_output command: ./check.sh resource_requirements: quick_task depends_on: [main_process]

4. Test Locally First

Validate your workflow logic locally before submitting to HPC:

# Run locally (without Slurm) torc run workflow.yaml # Then submit to HPC torc submit-slurm --account myproject workflow.yaml

Limitations and Caveats

The auto-generation in submit-slurm uses heuristics that work well for common workflow patterns but may not be optimal for all cases:

When Auto-Generation Works Well

Linear pipelines: A → B → C → D

Fan-out patterns: One job unblocks many (e.g., preprocess → 100 work jobs)

Fan-in patterns: Many jobs unblock one (e.g., 100 work jobs → postprocess)

Simple DAGs: Clear dependency structures with distinct resource tiers

When to Use Manual Configuration

Consider using torc slurm generate to preview and manually adjust, or define schedulers manually, when:

Complex dependency graphs: Multiple interleaved dependency patterns

Shared schedulers: You want multiple jobs to share the same Slurm allocation

Custom timing: Specific requirements for when allocations should be requested

Resource optimization: Fine-tuning to minimize allocation waste

Multi-node jobs: Jobs requiring coordination across multiple nodes

What Could Go Wrong

Without previewing, auto-generation might:

Request allocations too early: Wasting queue time waiting for dependencies

Request allocations too late: Adding latency to job startup

Create suboptimal scheduler groupings: Not sharing allocations when beneficial

Miss optimization opportunities: Not recognizing patterns that could share resources

Best Practice: For production workflows, always run torc slurm generate first, review the output, and submit the reviewed configuration with torc submit.

Advanced: Manual Scheduler Configuration

For advanced users who need fine-grained control, you can define schedulers and actions manually. See Working with Slurm for details.

Common reasons for manual configuration:

Non-standard partition requirements

Custom Slurm directives (e.g., --constraint)

Multi-node jobs with specific topology requirements

Reusing allocations across multiple jobs for efficiency

Troubleshooting

"No partition found for job"

Your resource requirements exceed what's available. Check:

Memory doesn't exceed partition limits

Runtime doesn't exceed partition walltime

GPU count is available on GPU partitions

Use torc hpc partitions <profile> to see available resources.

Jobs Not Starting

Ensure the Torc server is accessible from compute nodes:

# From a compute node curl $TORC_API_URL/health

Wrong Partition Selected

Use torc hpc match to see which partitions match your requirements:

torc hpc match kestrel --cpus 32 --memory 64g --walltime 2h --gpus 2

See Also

Visualizing Workflow Structure — Execution plans and DAG visualization

HPC Profiles — Detailed HPC profile usage

Working with Slurm — Advanced Slurm configuration

Resource Requirements Reference — Complete specification

Workflow Actions — Understanding actions

Debugging Slurm Workflows

When running workflows on Slurm clusters, Torc provides additional debugging tools specifically designed for Slurm environments. This guide covers Slurm-specific debugging techniques and tools.

For general debugging concepts and tools that apply to all workflows, see Debugging Workflows.

Overview

Slurm workflows generate additional log files beyond the standard job logs:

Slurm stdout/stderr: Output from Slurm's perspective (job allocation, environment setup)

Slurm environment logs: All SLURM environment variables captured at job runner startup

dmesg logs: Kernel message buffer captured when the Slurm job runner exits

These logs help diagnose issues specific to the cluster environment, such as resource allocation failures, node problems, and system-level errors.

Slurm Log File Structure

For jobs executed via Slurm scheduler (compute_node_type: "slurm"), the debug report includes these additional log paths:

{ "job_stdout": "output/job_stdio/job_wf1_j456_r1.o", "job_stderr": "output/job_stdio/job_wf1_j456_r1.e", "job_runner_log": "output/job_runner_slurm_wf1_sl12345_n0_pid67890.log", "slurm_stdout": "output/slurm_output_wf1_sl12345.o", "slurm_stderr": "output/slurm_output_wf1_sl12345.e", "slurm_env_log": "output/slurm_env_wf1_sl12345_n0_pid67890.log", "dmesg_log": "output/dmesg_slurm_wf1_sl12345_n0_pid67890.log" }

All Slurm log files include the workflow ID (wf<id>) prefix, making it easy to identify and collect logs for a specific workflow.

Log File Descriptions

slurm_stdout (output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.o):

Standard output from Slurm's perspective

Includes Slurm environment setup, job allocation info

Use for: Debugging Slurm job submission issues

slurm_stderr (output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.e):

Standard error from Slurm's perspective

Contains Slurm-specific errors (allocation failures, node issues)

Use for: Investigating Slurm scheduler problems

job_runner_log (output/job_runner_slurm_wf<id>_sl<slurm_job_id>_n<node>_pid<pid>.log):

Log output from the Torc Slurm job runner process

Contains job execution details, status updates, and runner-level errors

Use for: Debugging job runner issues, understanding job execution flow

slurm_env_log (output/slurm_env_wf<id>_sl<slurm_job_id>_n<node_id>_pid<task_pid>.log):

All SLURM environment variables captured at job runner startup

Contains job allocation details, resource limits, node assignments

Use for: Verifying Slurm job configuration, debugging resource allocation issues

dmesg_log (output/dmesg_slurm_wf<id>_sl<slurm_job_id>_n<node_id>_pid<task_pid>.log):

Kernel message buffer captured when the Slurm job runner exits (only on failure)

Contains system-level events: OOM killer activity, hardware errors, kernel panics

Use for: Investigating job failures caused by system-level issues (e.g., out-of-memory kills, hardware failures)

Note: All Slurm log files include the workflow ID, Slurm job ID, node ID, and task PID in the filename for easy filtering and correlation with Slurm's own logs.

Parsing Slurm Log Files for Errors

The torc slurm parse-logs command scans Slurm stdout/stderr log files for known error patterns and correlates them with affected Torc jobs:

# Parse logs for a specific workflow torc slurm parse-logs <workflow_id> # Specify custom output directory torc slurm parse-logs <workflow_id> --output-dir /path/to/output # Output as JSON for programmatic processing torc slurm parse-logs <workflow_id> --format json

Detected Error Patterns

The command detects common Slurm failure patterns including:

Memory Errors:

out of memory, oom-kill, cannot allocate memory

memory cgroup out of memory, Exceeded job memory limit

task/cgroup: .*: Killed

std::bad_alloc (C++), MemoryError (Python)

Slurm-Specific Errors:

slurmstepd: error:, srun: error:

DUE TO TIME LIMIT, DUE TO PREEMPTION

NODE_FAIL, FAILED, CANCELLED

Exceeded.*step.*limit

GPU/CUDA Errors:

CUDA out of memory, CUDA error, GPU memory.*exceeded

Signal/Crash Errors:

Segmentation fault, SIGSEGV

Bus error, SIGBUS

killed by signal, core dumped

Python Errors:

Traceback (most recent call last)

ModuleNotFoundError, ImportError

File System Errors:

No space left on device, Disk quota exceeded

Read-only file system, Permission denied

Network Errors:

Connection refused, Connection timed out, Network is unreachable

Example Output

Table format:

Slurm Log Analysis Results ========================== Found 2 error(s) in log files: ╭─────────────────────────────┬──────────────┬──────┬─────────────────────────────┬──────────┬──────────────────────────────╮ │ File │ Slurm Job ID │ Line │ Pattern │ Severity │ Affected Torc Jobs │ ├─────────────────────────────┼──────────────┼──────┼─────────────────────────────┼──────────┼──────────────────────────────┤ │ slurm_output_sl12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │ │ slurm_output_sl12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │ ╰─────────────────────────────┴──────────────┴──────┴─────────────────────────────┴──────────┴──────────────────────────────╯

Viewing Slurm Accounting Data

The torc slurm sacct command displays a summary of Slurm job accounting data for all scheduled compute nodes in a workflow:

# Display sacct summary table for a workflow torc slurm sacct <workflow_id> # Also save full JSON files for detailed analysis torc slurm sacct <workflow_id> --save-json --output-dir /path/to/output # Output as JSON for programmatic processing torc slurm sacct <workflow_id> --format json

Summary Table Fields

The command displays a summary table with key metrics:

Slurm Job: The Slurm job ID

Job Step: Name of the job step (e.g., "worker_1", "batch")

State: Job state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, etc.)

Exit Code: Exit code of the job step

Elapsed: Wall clock time for the job step

Max RSS: Maximum resident set size (memory usage)

CPU Time: Total CPU time consumed

Nodes: Compute nodes used

Example Output

Slurm Accounting Summary for Workflow 123 ╭────────────┬───────────┬───────────┬───────────┬─────────┬─────────┬──────────┬─────────╮ │ Slurm Job │ Job Step │ State │ Exit Code │ Elapsed │ Max RSS │ CPU Time │ Nodes │ ├────────────┼───────────┼───────────┼───────────┼─────────┼─────────┼──────────┼─────────┤ │ 12345 │ worker_1 │ COMPLETED │ 0 │ 2h 15m │ 4.5GB │ 4h 30m │ node01 │ │ 12345 │ batch │ COMPLETED │ 0 │ 2h 16m │ 128.0MB │ 1m 30s │ node01 │ │ 12346 │ worker_1 │ FAILED │ 1 │ 45m 30s │ 8.2GB │ 1h 30m │ node02 │ ╰────────────┴───────────┴───────────┴───────────┴─────────┴─────────┴──────────┴─────────╯ Total: 3 job steps

Saving Full JSON Output

Use --save-json to save full sacct JSON output to files for detailed analysis:

torc slurm sacct 123 --save-json --output-dir output # Creates: output/sacct_12345.json, output/sacct_12346.json, etc.

Viewing Slurm Logs in torc-dash

The torc-dash web interface provides two ways to view Slurm logs:

Debugging Tab - Slurm Log Analysis

The Debugging tab includes a "Slurm Log Analysis" section:

Navigate to the Debugging tab

Find the Slurm Log Analysis section

Enter the output directory path (default: output)

Click Analyze Slurm Logs

The results show all detected errors with their Slurm job IDs, line numbers, error patterns, severity levels, and affected Torc jobs.

Debugging Tab - Slurm Accounting Data

The Debugging tab also includes a "Slurm Accounting Data" section:

Navigate to the Debugging tab

Find the Slurm Accounting Data section

Click Collect sacct Data

This displays a summary table showing job state, exit codes, elapsed time, memory usage (Max RSS), CPU time, and nodes for all Slurm job steps. The table helps quickly identify failed jobs and resource usage patterns.

Scheduled Nodes Tab - View Slurm Logs

You can view individual Slurm job logs directly from the Details view:

Select a workflow

Go to the Details tab

Switch to the Scheduled Nodes sub-tab

Find a Slurm scheduled node in the table

Click the View Logs button in the Logs column

This opens a modal with tabs for viewing the Slurm job's stdout and stderr files.

Viewing Slurm Logs in the TUI

The torc tui terminal interface also supports Slurm log viewing:

Launch the TUI: torc tui

Select a workflow and press Enter to load details

Press Tab to switch to the Scheduled Nodes tab

Navigate to a Slurm scheduled node using arrow keys

Press l to view the Slurm job's logs

The log viewer shows:

stdout tab: Slurm job standard output (slurm_output_wf<id>_sl<slurm_job_id>.o)

stderr tab: Slurm job standard error (slurm_output_wf<id>_sl<slurm_job_id>.e)

Use Tab to switch between stdout/stderr, arrow keys to scroll, / to search, and q to close.

Debugging Slurm Job Failures

When a Slurm job fails, follow this debugging workflow:

Parse logs for known errors:

torc slurm parse-logs <workflow_id>

If OOM or resource issues are detected, collect sacct data:

torc slurm sacct <workflow_id> cat output/sacct_<slurm_job_id>.json | jq '.jobs[].steps[].tres.requested'

View the specific Slurm log files:

Use torc-dash: Details → Scheduled Nodes → View Logs

Or use TUI: Scheduled Nodes tab → press l

Or directly: cat output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.e

Check the job's own stderr for application errors:

torc reports results <workflow_id> > report.json jq -r '.results[] | select(.return_code != 0) | .job_stderr' report.json | xargs cat

Review dmesg logs for system-level issues:

cat output/dmesg_slurm_wf<workflow_id>_sl<slurm_job_id>_*.log

Orphaned Jobs and Status Synchronization

When a Slurm allocation terminates unexpectedly (e.g., due to timeout, node failure, or admin intervention), jobs may become "orphaned" - stuck in "running" status in Torc's database even though no process is actually executing them.

Detecting Orphaned Jobs

Common signs of orphaned jobs:

Jobs remain in "running" status long after the Slurm allocation ended

torc recover reports "there are active Slurm allocations" but squeue shows none

Workflow appears stuck but no Slurm jobs are actually running

Synchronizing Status with Slurm

The torc workflows sync-status command detects and fixes orphaned jobs by checking the actual Slurm state:

# Preview what would be cleaned up (recommended first) torc workflows sync-status <workflow_id> --dry-run # Clean up orphaned jobs torc workflows sync-status <workflow_id> # Get JSON output for scripting torc -f json workflows sync-status <workflow_id>

This command:

Checks each "active" scheduled compute node against squeue

If Slurm reports the job is no longer running, marks associated Torc jobs as failed

Updates scheduled compute node status to "complete"

Also handles "pending" allocations that were cancelled before starting

Example Output

Synchronizing job statuses for workflow 42... Cleaned up orphaned jobs: - 3 job(s) from terminated Slurm allocations - 1 pending allocation(s) that no longer exist in Slurm Affected jobs: - Job 107 (train_model_7): Allocation terminated (Slurm job 12345) - Job 112 (train_model_12): Allocation terminated (Slurm job 12345) - Job 123 (train_model_23): Allocation terminated (Slurm job 12345) Total: 3 job(s) marked as failed You can now run `torc recover 42` to retry failed jobs.

Automatic Cleanup in Recovery

The torc recover command automatically performs orphan detection as its first step, so you typically don't need to run sync-status manually before recovery. However, sync-status is useful when:

You want to clean up orphaned jobs without triggering a full recovery

You want to preview what recover would clean up (using --dry-run)

You're debugging why recover reports active allocations

Common Slurm Issues and Solutions

Out of Memory (OOM) Kills

Symptoms:

torc slurm parse-logs shows "Out of Memory (OOM) Kill"

Job exits with signal 9 (SIGKILL)

dmesg log shows "oom-kill" entries

Solutions:

Increase memory request in job specification

Check torc slurm sacct output for actual memory usage (Max RSS)

Consider splitting job into smaller chunks

Time Limit Exceeded

Symptoms:

torc slurm parse-logs shows "DUE TO TIME LIMIT"

Job state in sacct shows "TIMEOUT"

Solutions:

Increase runtime in job specification

Check if job is stuck (review stdout for progress)

Consider optimizing the job or splitting into phases

Node Failures

Symptoms:

torc slurm parse-logs shows "NODE_FAIL"

Job may have completed partially

Solutions:

Reinitialize workflow to retry failed jobs

Check cluster status with sinfo

Review dmesg logs for hardware issues

GPU/CUDA Errors

Symptoms:

torc slurm parse-logs shows "CUDA out of memory" or "CUDA error"

Solutions:

Reduce batch size or model size

Check GPU memory with nvidia-smi in job script

Ensure correct CUDA version is loaded

Related Commands

torc slurm parse-logs: Parse Slurm logs for known error patterns

torc slurm sacct: Collect Slurm accounting data for workflow jobs

torc workflows sync-status: Detect and fix orphaned jobs from terminated Slurm allocations

torc reports results: Generate debug report with all log file paths

torc results list: View summary of job results in table format

torc-dash: Launch web interface with Slurm log viewing

torc tui: Launch terminal UI with Slurm log viewing

See Also

Debugging Workflows — General debugging tools and workflows

Working with Logs — Bundling and analyzing logs

Working with Slurm

This guide covers advanced Slurm configuration for users who need fine-grained control over their HPC workflows.

For most users: See Slurm Workflows for the recommended approach using torc submit-slurm. You don't need to manually configure schedulers or actions—Torc handles this automatically.

When to Use Manual Configuration

Manual Slurm configuration is useful when you need:

Custom Slurm directives (e.g., --constraint, --exclusive)

Multi-node jobs with specific topology requirements

Shared allocations across multiple jobs for efficiency

Non-standard partition configurations

Fine-tuned control over allocation timing

Torc Server Requirements

The Torc server must be accessible from compute nodes:

External server (Recommended): A team member allocates a shared server in the HPC environment. This is recommended if your operations team provides this capability.

Login node: Suitable for small workflows. The server runs single-threaded by default. If you have many thousands of short jobs, check with your operations team about resource limits.

Manual Scheduler Configuration

Defining Slurm Schedulers

Define schedulers in your workflow specification:

slurm_schedulers: - name: standard account: my_project nodes: 1 walltime: "12:00:00" partition: compute mem: 64G - name: gpu_nodes account: my_project nodes: 1 walltime: "08:00:00" partition: gpu gres: "gpu:4" mem: 256G

Scheduler Fields

Field Description Required

name Scheduler identifier Yes

account Slurm account/allocation Yes

nodes Number of nodes Yes

walltime Time limit (HH:MM:SS or D-HH:MM:SS) Yes

partition Slurm partition No

mem Memory per node No

gres Generic resources (e.g., GPUs) No

qos Quality of Service No

ntasks_per_node Tasks per node No

tmp Temporary disk space No

extra Additional sbatch arguments No

Defining Workflow Actions

Actions trigger scheduler allocations:

actions: - trigger_type: on_workflow_start action_type: schedule_nodes scheduler: standard scheduler_type: slurm num_allocations: 1 - trigger_type: on_jobs_ready action_type: schedule_nodes jobs: [train_model] scheduler: gpu_nodes scheduler_type: slurm num_allocations: 2

Action Trigger Types

Trigger Description

on_workflow_start Fires when workflow is submitted

on_jobs_ready Fires when specified jobs become ready

on_jobs_complete Fires when specified jobs complete

on_workflow_complete Fires when all jobs complete

Assigning Jobs to Schedulers

Reference schedulers in job definitions:

jobs: - name: preprocess command: ./preprocess.sh scheduler: standard - name: train command: python train.py scheduler: gpu_nodes depends_on: [preprocess]

Scheduling Strategies

Strategy 1: Many Single-Node Allocations

Submit multiple Slurm jobs, each with its own Torc worker:

slurm_schedulers: - name: work_scheduler account: my_account nodes: 1 walltime: "04:00:00" actions: - trigger_type: on_workflow_start action_type: schedule_nodes scheduler: work_scheduler scheduler_type: slurm num_allocations: 10

When to use:

Jobs have diverse resource requirements

Want independent time limits per job

Cluster has low queue wait times

Benefits:

Maximum scheduling flexibility

Independent time limits per allocation

Fault isolation

Drawbacks:

More Slurm queue overhead

Multiple jobs to schedule

Strategy 2: Multi-Node Allocation, One Worker Per Node

Launch multiple workers within a single allocation:

slurm_schedulers: - name: work_scheduler account: my_account nodes: 10 walltime: "04:00:00" actions: - trigger_type: on_workflow_start action_type: schedule_nodes scheduler: work_scheduler scheduler_type: slurm num_allocations: 1 start_one_worker_per_node: true

When to use:

Many jobs with similar requirements

Want faster queue scheduling (larger jobs often prioritized)

Benefits:

Single queue wait

Often prioritized by Slurm scheduler

Drawbacks:

Shared time limit for all workers

Less flexibility

Strategy 3: Single Worker Per Allocation

One Torc worker handles all nodes:

slurm_schedulers: - name: work_scheduler account: my_account nodes: 10 walltime: "04:00:00" actions: - trigger_type: on_workflow_start action_type: schedule_nodes scheduler: work_scheduler scheduler_type: slurm num_allocations: 1

When to use:

Your application manages node coordination

Need full control over compute resources

Staged Allocations

For pipelines with distinct phases, stage allocations to avoid wasted resources:

slurm_schedulers: - name: preprocess_sched account: my_project nodes: 2 walltime: "01:00:00" - name: compute_sched account: my_project nodes: 20 walltime: "08:00:00" - name: postprocess_sched account: my_project nodes: 1 walltime: "00:30:00" actions: # Preprocessing starts immediately - trigger_type: on_workflow_start action_type: schedule_nodes scheduler: preprocess_sched scheduler_type: slurm num_allocations: 1 # Compute nodes allocated when compute jobs are ready - trigger_type: on_jobs_ready action_type: schedule_nodes jobs: [compute_step] scheduler: compute_sched scheduler_type: slurm num_allocations: 1 start_one_worker_per_node: true # Postprocessing allocated when those jobs are ready - trigger_type: on_jobs_ready action_type: schedule_nodes jobs: [postprocess] scheduler: postprocess_sched scheduler_type: slurm num_allocations: 1

Note: The torc submit-slurm command handles this automatically by analyzing job dependencies.

Custom Slurm Directives

Use the extra field for additional sbatch arguments:

slurm_schedulers: - name: exclusive_nodes account: my_project nodes: 4 walltime: "04:00:00" extra: "--exclusive --constraint=skylake"

Submitting Workflows

With Manual Configuration

# Submit workflow with pre-defined schedulers and actions torc submit workflow.yaml

Scheduling Additional Nodes

Add more allocations to a running workflow:

torc slurm schedule-nodes -n 5 $WORKFLOW_ID

Debugging

Check Slurm Job Status

squeue --me

View Torc Worker Logs

Workers log to the Slurm output file. Check:

cat slurm-<jobid>.out

Verify Server Connectivity

From a compute node:

curl $TORC_API_URL/health

See Also

Slurm Workflows — Simplified workflow approach

HPC Profiles — Automatic partition matching

Workflow Actions — Action system details

Debugging Slurm Workflows — Troubleshooting guide

Working with HPC Profiles

HPC (High-Performance Computing) profiles provide pre-configured knowledge about specific HPC systems, including their partitions, resource limits, and optimal settings. Torc uses this information to automatically match job requirements to appropriate partitions.

Overview

HPC profiles contain:

Partition definitions: Available queues with their resource limits (CPUs, memory, walltime, GPUs)

Detection rules: How to identify when you're on a specific HPC system

Default settings: Account names and other system-specific defaults

Built-in profiles are available for systems like NREL's Kestrel. You can also define custom profiles for private clusters.

Listing Available Profiles

View all known HPC profiles:

torc hpc list

Example output:

Known HPC profiles: ╭─────────┬──────────────┬────────────┬──────────╮ │ Name │ Display Name │ Partitions │ Detected │ ├─────────┼──────────────┼────────────┼──────────┤ │ kestrel │ NREL Kestrel │ 15 │ ✓ │ ╰─────────┴──────────────┴────────────┴──────────╯

The "Detected" column shows if Torc recognizes you're currently on that system.

Detecting the Current System

Torc can automatically detect which HPC system you're on:

torc hpc detect

Detection works through environment variables. For example, NREL Kestrel is detected when NREL_CLUSTER=kestrel is set.

Viewing Profile Details

See detailed information about a specific profile:

torc hpc show kestrel

This displays:

Profile name and description

Detection method

Default account (if configured)

Number of partitions

Viewing Available Partitions

List all partitions for a profile:

torc hpc partitions kestrel

Example output:

Partitions for kestrel: ╭──────────┬─────────────┬───────────┬─────────────────┬─────────────────╮ │ Name │ CPUs/Node │ Mem/Node │ Max Walltime │ GPUs │ ├──────────┼─────────────┼───────────┼─────────────────┼─────────────────┤ │ debug │ 104 │ 240 GB │ 1h │ - │ │ short │ 104 │ 240 GB │ 4h │ - │ │ standard │ 104 │ 240 GB │ 48h │ - │ │ gpu-h100 │ 2 │ 240 GB │ 48h │ 4 (H100) │ │ ... │ ... │ ... │ ... │ ... │ ╰──────────┴─────────────┴───────────┴─────────────────┴─────────────────╯

Finding Matching Partitions

Find partitions that can satisfy specific resource requirements:

torc hpc match kestrel --cpus 32 --memory 64g --walltime 2h

Options:

--cpus <N>: Required CPU cores

--memory <SIZE>: Required memory (e.g., 64g, 512m)

--walltime <DURATION>: Required walltime (e.g., 2h, 4:00:00)

--gpus <N>: Required GPUs (optional)

This is useful for understanding which partitions your jobs will be assigned to.

Custom HPC Profiles

If your HPC system doesn't have a built-in profile, you have two options:

Request Built-in Support (Recommended)

If your HPC is widely used, please open an issue requesting built-in support. Include:

Your HPC system name and organization

Partition names with resource limits (CPUs, memory, walltime, GPUs)

Detection method (environment variable or hostname pattern)

Built-in profiles benefit everyone using that system and are maintained by the Torc team.

If you need to use your HPC immediately or have a private cluster, you can define a custom profile in your configuration file. See the Custom HPC Profile Tutorial for a complete walkthrough.

Quick Example

Define custom profiles in your configuration file:

# ~/.config/torc/config.toml [client.hpc.custom_profiles.mycluster] display_name = "My Research Cluster" description = "Internal research HPC system" detect_env_var = "MY_CLUSTER=research" default_account = "default_project" [[client.hpc.custom_profiles.mycluster.partitions]] name = "compute" cpus_per_node = 64 memory_mb = 256000 max_walltime_secs = 172800 shared = false [[client.hpc.custom_profiles.mycluster.partitions]] name = "gpu" cpus_per_node = 32 memory_mb = 128000 max_walltime_secs = 86400 gpus_per_node = 4 gpu_type = "A100" shared = false

See Configuration Reference for full configuration options.

Using Profiles with Slurm Workflows

HPC profiles are used by Slurm-related commands to automatically generate scheduler configurations. See Working with Slurm for details on:

torc submit-slurm - Submit workflows with auto-generated schedulers

torc workflows create-slurm - Create workflows with auto-generated schedulers

See Also

Working with Slurm

Custom HPC Profile Tutorial

HPC Profiles Reference

Configuration Reference

Resource Requirements Reference

HPC Profiles Reference

Complete reference for HPC profile system and CLI commands.

Overview

HPC profiles contain pre-configured knowledge about High-Performance Computing systems, enabling automatic Slurm scheduler generation based on job resource requirements.

CLI Commands

torc hpc list

List all available HPC profiles.

torc hpc list [OPTIONS]

Options:

Option Description

-f, --format <FORMAT> Output format: table or json

Output columns:

Name: Profile identifier used in commands

Display Name: Human-readable name

Partitions: Number of configured partitions

Detected: Whether current system matches this profile

torc hpc detect

Detect the current HPC system.

torc hpc detect [OPTIONS]

Options:

Option Description

-f, --format <FORMAT> Output format: table or json

Returns the detected profile name, or indicates no match.

torc hpc show

Display detailed information about an HPC profile.

torc hpc show <PROFILE> [OPTIONS]

Arguments:

Argument Description

<PROFILE> Profile name (e.g., kestrel)

Options:

Option Description

-f, --format <FORMAT> Output format: table or json

torc hpc partitions

List partitions for an HPC profile.

torc hpc partitions <PROFILE> [OPTIONS]

Arguments:

Argument Description

<PROFILE> Profile name (e.g., kestrel)

Options:

Option Description

-f, --format <FORMAT> Output format: table or json

Output columns:

Name: Partition name

CPUs/Node: CPU cores per node

Mem/Node: Memory per node

Max Walltime: Maximum job duration

GPUs: GPU count and type (if applicable)

Shared: Whether partition supports shared jobs

Notes: Special requirements or features

torc hpc match

Find partitions matching resource requirements.

torc hpc match <PROFILE> [OPTIONS]

Arguments:

Argument Description

<PROFILE> Profile name (e.g., kestrel)

Options:

Option Description

--cpus <N> Required CPU cores

--memory <SIZE> Required memory (e.g., 64g, 512m)

--walltime <DURATION> Required walltime (e.g., 2h, 4:00:00)

--gpus <N> Required GPUs

-f, --format <FORMAT> Output format: table or json

Memory format: <number><unit> where unit is k, m, g, or t (case-insensitive).

Walltime formats:

HH:MM:SS (e.g., 04:00:00)

<N>h (e.g., 4h)

<N>m (e.g., 30m)

<N>s (e.g., 3600s)

torc hpc generate

Generate an HPC profile configuration from the current Slurm cluster.

torc hpc generate [OPTIONS]

Options:

Option Description

--name <NAME> Profile name (defaults to cluster name or hostname)

--display-name <NAME> Human-readable display name

-o, --output <FILE> Output file path (prints to stdout if not specified)

--skip-stdby Skip standby partitions (names ending in -stdby)

How it works:

Queries sinfo to get partition names, CPUs, memory, time limits, and GRES

Queries scontrol show partition for each partition to get additional details

Parses GRES strings to extract GPU count and type

Generates hostname-based detection pattern from current hostname

Outputs TOML configuration ready to add to your config file

Example:

# Generate profile from current cluster torc hpc generate # Output: # [client.hpc.custom_profiles.mycluster] # display_name = "Mycluster" # detect_hostname = ".*\\.mycluster\\.edu" # # [[client.hpc.custom_profiles.mycluster.partitions]] # name = "compute" # cpus_per_node = 64 # memory_mb = 256000 # max_walltime_secs = 172800 # ...

Fields extracted automatically:

Partition name, CPUs per node, memory (MB), max walltime (seconds)

GPU count and type from GRES (e.g., gpu:a100:4)

Shared node support from OverSubscribe setting

Fields that may need manual adjustment:

requires_explicit_request: Defaults to false; set to true for partitions that shouldn't be auto-selected

description: Not available from Slurm; add human-readable descriptions

gpu_memory_gb: Not available from Slurm; add if known

torc slurm generate

Generate Slurm schedulers for a workflow based on job resource requirements.

torc slurm generate [OPTIONS] --account <ACCOUNT> <WORKFLOW_FILE>

Arguments:

Argument Description

<WORKFLOW_FILE> Path to workflow specification file (YAML, JSON, or JSON5)

Options:

Option Description

--account <ACCOUNT> Slurm account to use (required)

--profile <PROFILE> HPC profile to use (auto-detected if not specified)

-o, --output <FILE> Output file path (prints to stdout if not specified)

--no-actions Don't add workflow actions for scheduling nodes

--force Overwrite existing schedulers in the workflow

Generated artifacts:

Slurm schedulers: One for each unique resource requirement

Job scheduler assignments: Each job linked to appropriate scheduler

Workflow actions: on_workflow_start/schedule_nodes actions (unless --no-actions)

Scheduler naming: <resource_requirement_name>_scheduler

Built-in Profiles

NREL Kestrel

Profile name: kestrel

Detection: Environment variable NREL_CLUSTER=kestrel

Partitions:

Partition CPUs Memory Max Walltime GPUs Notes

debug 104 240 GB 1h - Quick testing

short 104 240 GB 4h - Short jobs

standard 104 240 GB 48h - General workloads

long 104 240 GB 240h - Extended jobs

medmem 104 480 GB 48h - Medium memory

bigmem 104 2048 GB 48h - High memory

shared 104 240 GB 48h - Shared node access

hbw 104 240 GB 48h - High-bandwidth memory, min 10 nodes

nvme 104 240 GB 48h - NVMe local storage

gpu-h100 2 240 GB 48h 4x H100 GPU compute

Node specifications:

Standard nodes: 104 cores (2x Intel Xeon Sapphire Rapids), 240 GB RAM

GPU nodes: 4x NVIDIA H100 80GB HBM3, 128 cores, 2 TB RAM

Configuration

Custom Profiles

Don't see your HPC? Please request built-in support so everyone benefits. See the Custom HPC Profile Tutorial for creating a profile while you wait.

Define custom profiles in your Torc configuration file:

# ~/.config/torc/config.toml [client.hpc.custom_profiles.mycluster] display_name = "My Cluster" description = "Description of the cluster" detect_env_var = "CLUSTER_NAME=mycluster" detect_hostname = ".*\\.mycluster\\.org" default_account = "myproject" [[client.hpc.custom_profiles.mycluster.partitions]] name = "compute" cpus_per_node = 64 memory_mb = 256000 max_walltime_secs = 172800 shared = false [[client.hpc.custom_profiles.mycluster.partitions]] name = "gpu" cpus_per_node = 32 memory_mb = 128000 max_walltime_secs = 86400 gpus_per_node = 4 gpu_type = "A100" shared = false

Profile Override

Override settings for built-in profiles:

[client.hpc.profile_overrides.kestrel] default_account = "my_default_account"

Configuration Options

[client.hpc] Section:

Option Type Description

profile_overrides table Override settings for built-in profiles

custom_profiles table Define custom HPC profiles

Profile override options:

Option Type Description

default_account string Default Slurm account for this profile

Custom profile options:

Option Type Required Description

display_name string No Human-readable name

description string No Profile description

detect_env_var string No Environment variable for detection (NAME=value)

detect_hostname string No Regex pattern for hostname detection

default_account string No Default Slurm account

partitions array Yes List of partition configurations

Partition options:

Option Type Required Description

name string Yes Partition name

cpus_per_node int Yes CPU cores per node

memory_mb int Yes Memory per node in MB

max_walltime_secs int Yes Maximum walltime in seconds

gpus_per_node int No GPUs per node

gpu_type string No GPU model (e.g., "H100")

shared bool No Whether partition supports shared jobs

min_nodes int No Minimum required nodes

requires_explicit_request bool No Must be explicitly requested

Resource Matching Algorithm

When generating schedulers, Torc uses this algorithm to match resource requirements to partitions:

Filter by resources: Partitions must satisfy:

CPUs >= required CPUs

Memory >= required memory

GPUs >= required GPUs (if specified)

Max walltime >= required runtime

Exclude debug partitions: Unless no other partition matches

Prefer best fit:

Partitions that exactly match resource needs

Non-shared partitions over shared

Shorter max walltime over longer

Handle special requirements:

GPU jobs only match GPU partitions

Respect requires_explicit_request flag

Honor min_nodes constraints

Generated Scheduler Format

Example generated Slurm scheduler:

slurm_schedulers: - name: medium_scheduler account: myproject nodes: 1 mem: 64g walltime: 04:00:00 gres: null partition: null # Let Slurm choose based on resources

Corresponding workflow action:

actions: - trigger_type: on_workflow_start action_type: schedule_nodes scheduler: medium_scheduler scheduler_type: slurm num_allocations: 1

Runtime Format Parsing

Resource requirements use ISO 8601 duration format for runtime:

Format Example Meaning

PTnH PT4H 4 hours

PTnM PT30M 30 minutes

PTnS PT3600S 3600 seconds

PTnHnM PT2H30M 2 hours 30 minutes

PnDTnH P1DT12H 1 day 12 hours

Generated walltime uses HH:MM:SS format (e.g., 04:00:00).

See Also

Working with HPC Profiles

Custom HPC Profile Tutorial

Working with Slurm

Resource Requirements

Configuration Reference

HPC Deployment Reference

Configuration guide for deploying Torc on High-Performance Computing systems.

Overview

Running Torc on HPC systems requires special configuration to ensure:

Compute nodes can reach the torc-server running on a login node

The database is stored on a filesystem accessible to all nodes

Network paths use the correct hostnames for the HPC interconnect

Server Configuration on Login Nodes

Hostname Requirements

On most HPC systems, login nodes have multiple network interfaces:

External hostname: Used for SSH access from outside (e.g., kl3.hpc.nrel.gov)

Internal hostname: Used by compute nodes via the high-speed interconnect (e.g., kl3.hsn.cm.kestrel.hpc.nrel.gov)

When running torc-server on a login node, you must use the internal hostname so compute nodes can connect.

NREL Kestrel Example

On NREL's Kestrel system, login nodes use the High-Speed Network (HSN) for internal communication:

Login Node External Hostname Internal Hostname (for -u flag)

kl1 kl1.hpc.nrel.gov kl1.hsn.cm.kestrel.hpc.nrel.gov

kl2 kl2.hpc.nrel.gov kl2.hsn.cm.kestrel.hpc.nrel.gov

kl3 kl3.hpc.nrel.gov kl3.hsn.cm.kestrel.hpc.nrel.gov

Starting the server:

# On login node kl3, use the internal hostname torc-server run \ --database /scratch/$USER/torc.db \ -u kl3.hsn.cm.kestrel.hpc.nrel.gov \ --port 8085

Connecting clients:

# Set the API URL using the internal hostname export TORC_API_URL="http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1" # Now torc commands will use this URL torc workflows list

Finding the Internal Hostname

If you're unsure of your system's internal hostname, try these approaches:

# Check all network interfaces hostname -A # Look for hostnames in the hosts file grep $(hostname -s) /etc/hosts # Check Slurm configuration for the control machine scontrol show config | grep ControlMachine

Consult your HPC system's documentation or support team for the correct internal hostname format.

Database Placement

The SQLite database must be on a filesystem accessible to both:

The login node running torc-server

All compute nodes running jobs

Recommended Locations

Filesystem Pros Cons

Scratch (/scratch/$USER/) Fast, shared, high capacity May be purged periodically

Project (/projects/) Persistent, shared May have quotas

Home (~) Persistent Often slow, limited space

Best practice: Use scratch for active workflows, backup completed workflows to project storage.

# Create a dedicated directory mkdir -p /scratch/$USER/torc # Start server with scratch database torc-server run \ --database /scratch/$USER/torc/workflows.db \ -u $(hostname -s).hsn.cm.kestrel.hpc.nrel.gov \ --port 8085

Database Backup

For long-running workflows, periodically backup the database:

# SQLite backup (safe while server is running) sqlite3 /scratch/$USER/torc.db ".backup /projects/$USER/torc_backup.db"

Port Selection

Login nodes are shared resources. To avoid conflicts:

Use a non-default port: Choose a port in the range 8000-9999

Check for conflicts: lsof -i :8085

Consider using your UID: --port $((8000 + UID % 1000))

# Use a unique port based on your user ID MY_PORT=$((8000 + $(id -u) % 1000)) torc-server run \ --database /scratch/$USER/torc.db \ -u kl3.hsn.cm.kestrel.hpc.nrel.gov \ --port $MY_PORT

Running in tmux/screen

Always run torc-server in a terminal multiplexer to prevent loss on disconnect:

# Start a tmux session tmux new -s torc # Start the server torc-server run \ --database /scratch/$USER/torc.db \ -u kl3.hsn.cm.kestrel.hpc.nrel.gov \ --port 8085 # Detach with Ctrl+b, then d # Reattach later with: tmux attach -t torc

Complete Configuration Example

Server Configuration File

Create ~/.config/torc/config.toml:

[server] # Use internal hostname for compute node access url = "kl3.hsn.cm.kestrel.hpc.nrel.gov" port = 8085 database = "/scratch/myuser/torc/workflows.db" threads = 4 completion_check_interval_secs = 30.0 log_level = "info" [server.logging] log_dir = "/scratch/myuser/torc/logs"

Client Configuration File

Create ~/.config/torc/config.toml (or add to existing):

[client] # Match the server's internal hostname and port api_url = "http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1" format = "table" [client.run] output_dir = "/scratch/myuser/torc/output"

Environment Variables

Alternatively, set environment variables in your shell profile:

# Add to ~/.bashrc or ~/.bash_profile export TORC_API_URL="http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1" export TORC_CLIENT__RUN__OUTPUT_DIR="/scratch/$USER/torc/output"

Slurm Job Runner Configuration

When submitting workflows to Slurm, the job runners on compute nodes need to reach the server. The TORC_API_URL is automatically passed to Slurm jobs.

Verify connectivity from a compute node:

# Submit an interactive job salloc -N 1 -t 00:10:00 # Test connectivity to the server curl -s "$TORC_API_URL/workflows" | head # Exit the allocation exit

Troubleshooting

"Connection refused" from compute nodes

Verify the server is using the internal hostname:

torc-server run -u <internal-hostname> --port 8085

Check the server is listening on all interfaces:

netstat -tlnp | grep 8085

Verify no firewall blocks the port:

# From a compute node nc -zv <internal-hostname> 8085

Database locked errors

SQLite may report locking issues on network filesystems:

Ensure only one torc-server instance is running

Use a local scratch filesystem rather than NFS home directories

Consider increasing completion_check_interval_secs to reduce database contention

Server stops when SSH disconnects

Always use tmux or screen (see above). If the server dies unexpectedly:

# Check if the server is still running pgrep -f torc-server # Check server logs tail -100 /scratch/$USER/torc/logs/torc-server*.log

See Also

Configuration Reference

HPC Profiles Reference

Working with Slurm

Server Deployment

Creating a Custom HPC Profile

This tutorial walks you through creating a custom HPC profile for a cluster that Torc doesn't have built-in support for.

Before You Start

Request Built-in Support First!

If your HPC system is widely used, consider requesting that Torc developers add it as a built-in profile. This benefits everyone using that system.

Open an issue at github.com/NREL/torc/issues with:

Your HPC system name and organization

Partition names and their resource limits (CPUs, memory, walltime, GPUs)

How to detect the system (environment variable or hostname pattern)

Any special requirements (minimum nodes, exclusive partitions, etc.)

Built-in profiles are maintained by the Torc team and stay up-to-date as systems change.

When to Create a Custom Profile

Create a custom profile when:

Your HPC isn't supported and you need to use it immediately

You have a private or internal cluster

You want to test profile configurations before submitting upstream

Quick Start: Auto-Generate from Slurm

If you're on a Slurm cluster, you can automatically generate a profile from the cluster configuration:

# Generate profile from current Slurm cluster torc hpc generate # Specify a custom name torc hpc generate --name mycluster --display-name "My Research Cluster" # Skip standby/preemptible partitions torc hpc generate --skip-stdby # Save to a file torc hpc generate --skip-stdby -o mycluster-profile.toml

This queries sinfo and scontrol to extract:

Partition names, CPUs, memory, and time limits

GPU configuration from GRES

Node sharing settings

Hostname-based detection pattern

The generated profile can be added directly to your config file. You may want to review and adjust:

requires_explicit_request: Set to true for partitions that shouldn't be auto-selected

description: Add human-readable descriptions for each partition

After generation, skip to Step 4: Verify the Profile.

Manual Profile Creation

If automatic generation isn't available or you need more control, follow these steps.

Step 1: Gather Partition Information

Collect information about your HPC's partitions. On most Slurm systems:

# List all partitions sinfo -s # Get detailed partition info sinfo -o "%P %c %m %l %G"

For this tutorial, let's say your cluster "ResearchCluster" has these partitions:

Partition CPUs/Node Memory Max Walltime GPUs

batch 48 192 GB 72 hours -

short 48 192 GB 4 hours -

gpu 32 256 GB 48 hours 4x A100

himem 48 1024 GB 48 hours -

Step 2: Identify Detection Method

Determine how Torc can detect when you're on this system. Common methods:

Environment variable (most common):

echo $CLUSTER_NAME # e.g., "research" echo $SLURM_CLUSTER # e.g., "researchcluster"

Hostname pattern:

hostname # e.g., "login01.research.edu"

For this tutorial, we'll use the environment variable CLUSTER_NAME=research.

Step 3: Create the Configuration File

Create or edit your Torc configuration file:

# Linux mkdir -p ~/.config/torc nano ~/.config/torc/config.toml # macOS mkdir -p ~/Library/Application\ Support/torc nano ~/Library/Application\ Support/torc/config.toml

Add your custom profile:

# Custom HPC Profile for ResearchCluster [client.hpc.custom_profiles.research] display_name = "Research Cluster" description = "University Research HPC System" detect_env_var = "CLUSTER_NAME=research" default_account = "my_project" # Batch partition - general purpose [[client.hpc.custom_profiles.research.partitions]] name = "batch" cpus_per_node = 48 memory_mb = 192000 # 192 GB in MB max_walltime_secs = 259200 # 72 hours in seconds shared = false # Short partition - quick jobs [[client.hpc.custom_profiles.research.partitions]] name = "short" cpus_per_node = 48 memory_mb = 192000 max_walltime_secs = 14400 # 4 hours shared = true # Allows sharing nodes # GPU partition [[client.hpc.custom_profiles.research.partitions]] name = "gpu" cpus_per_node = 32 memory_mb = 256000 # 256 GB max_walltime_secs = 172800 # 48 hours gpus_per_node = 4 gpu_type = "A100" shared = false # High memory partition [[client.hpc.custom_profiles.research.partitions]] name = "himem" cpus_per_node = 48 memory_mb = 1048576 # 1024 GB (1 TB) max_walltime_secs = 172800 # 48 hours shared = false

Step 4: Verify the Profile

Check that Torc recognizes your profile:

# List all profiles torc hpc list

You should see your custom profile:

Known HPC profiles: ╭──────────┬──────────────────┬────────────┬──────────╮ │ Name │ Display Name │ Partitions │ Detected │ ├──────────┼──────────────────┼────────────┼──────────┤ │ kestrel │ NREL Kestrel │ 15 │ │ │ research │ Research Cluster │ 4 │ ✓ │ ╰──────────┴──────────────────┴────────────┴──────────╯

View the partitions:

torc hpc partitions research

Partitions for research: ╭─────────┬───────────┬───────────┬─────────────┬──────────╮ │ Name │ CPUs/Node │ Mem/Node │ Max Walltime│ GPUs │ ├─────────┼───────────┼───────────┼─────────────┼──────────┤ │ batch │ 48 │ 192 GB │ 72h │ - │ │ short │ 48 │ 192 GB │ 4h │ - │ │ gpu │ 32 │ 256 GB │ 48h │ 4 (A100) │ │ himem │ 48 │ 1024 GB │ 48h │ - │ ╰─────────┴───────────┴───────────┴─────────────┴──────────╯

Step 5: Test Partition Matching

Verify that Torc correctly matches resource requirements to partitions:

# Should match 'short' partition torc hpc match research --cpus 8 --memory 16g --walltime 2h # Should match 'gpu' partition torc hpc match research --cpus 16 --memory 64g --walltime 8h --gpus 2 # Should match 'himem' partition torc hpc match research --cpus 24 --memory 512g --walltime 24h

Step 6: Test Scheduler Generation

Create a test workflow to verify scheduler generation:

# test_workflow.yaml name: profile_test description: Test custom HPC profile resource_requirements: - name: standard num_cpus: 16 memory: 64g runtime: PT2H - name: gpu_compute num_cpus: 16 num_gpus: 2 memory: 128g runtime: PT8H jobs: - name: preprocess command: echo "preprocessing" resource_requirements: standard - name: train command: echo "training" resource_requirements: gpu_compute depends_on: [preprocess]

Generate schedulers:

torc slurm generate --account my_project --profile research test_workflow.yaml

You should see the generated workflow with appropriate schedulers for each partition.

Step 7: Use Your Profile

Now you can submit workflows using your custom profile:

# Auto-detect the profile (if on the cluster) torc submit-slurm --account my_project workflow.yaml # Or explicitly specify the profile torc submit-slurm --account my_project --hpc-profile research workflow.yaml

Advanced Configuration

Hostname-Based Detection

If your cluster doesn't set a unique environment variable, use hostname detection:

[client.hpc.custom_profiles.research] display_name = "Research Cluster" detect_hostname = ".*\\.research\\.edu" # Regex pattern

Minimum Node Requirements

Some partitions require a minimum number of nodes:

[[client.hpc.custom_profiles.research.partitions]] name = "large_scale" cpus_per_node = 128 memory_mb = 512000 max_walltime_secs = 172800 min_nodes = 16 # Must request at least 16 nodes

Explicit Request Partitions

Some partitions shouldn't be auto-selected:

[[client.hpc.custom_profiles.research.partitions]] name = "priority" cpus_per_node = 48 memory_mb = 192000 max_walltime_secs = 86400 requires_explicit_request = true # Only used when explicitly requested

Troubleshooting

Profile Not Detected

If torc hpc detect doesn't find your profile:

Check the environment variable or hostname:

echo $CLUSTER_NAME hostname

Verify the detection pattern in your config matches exactly

Test with explicit profile specification:

torc hpc show research

No Partition Found for Job

If torc slurm generate can't find a matching partition:

Check if any partition satisfies all requirements:

torc hpc match research --cpus 32 --memory 128g --walltime 8h

Verify memory is specified in MB in the config (not GB)

Verify walltime is in seconds (not hours)

Configuration File Location

Torc looks for config files in these locations:

Linux: ~/.config/torc/config.toml

macOS: ~/Library/Application Support/torc/config.toml

Windows: %APPDATA%\torc\config.toml

You can also use the TORC_CONFIG environment variable to specify a custom path.

Contributing Your Profile

If your HPC is used by others, please contribute it upstream:

Fork the Torc repository

Add your profile to src/client/hpc_profiles.rs

Add tests for your profile

Submit a pull request

Or simply open an issue with your partition information and we'll add it for you.

See Also

Working with HPC Profiles - General HPC profile usage

HPC Profiles Reference - Complete configuration options

Slurm Workflows - Simplified Slurm approach

Remote Workers

Running Torc jobs on remote machines.

Setting Up Remote Workers - Configuring remote worker nodes

Remote Workers

Run workflows across multiple machines via SSH without requiring an HPC scheduler.

Overview

Torc supports three execution modes:

Local (torc run) - Jobs run on the current machine

HPC (torc submit-slurm) - Jobs run on Slurm-allocated nodes

Remote Workers (torc remote run) - Jobs run on SSH-accessible machines

Remote workers are ideal for:

Ad-hoc clusters of workstations or cloud VMs

Environments without a scheduler

Testing distributed workflows before HPC deployment

Worker File Format

Create a text file listing remote machines:

# Lines starting with # are comments # Format: [user@]hostname[:port] # Simple hostname worker1.example.com # With username alice@worker2.example.com # With custom SSH port admin@192.168.1.10:2222 # IPv4 address 10.0.0.5 # IPv6 address (must be in brackets for port specification) [2001:db8::1] [::1]:2222

Each host can only appear once. Duplicate hosts will cause an error.

Worker Management

Workers are stored in the database and persist across command invocations. This means you only need to specify workers once, and subsequent commands can reference them by workflow ID.

Add Workers

torc remote add-workers <workflow-id> <worker>...

Add one or more workers directly on the command line:

torc remote add-workers 42 worker1.example.com alice@worker2.example.com admin@192.168.1.10:2222

Add Workers from File

torc remote add-workers-from-file <worker-file> [workflow-id]

Example:

torc remote add-workers-from-file workers.txt 42

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

List Workers

torc remote list-workers [workflow-id]

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Remove a Worker

torc remote remove-worker <worker> [workflow-id]

Example:

torc remote remove-worker worker1.example.com 42

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Commands

Start Workers

torc remote run [workflow-id] [options]

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Workers are fetched from the database. If you want to add workers from a file at the same time:

torc remote run <workflow-id> --workers <worker-file> [options]

Options:

Option Default Description

--workers none Worker file to add before starting

-o, --output-dir torc_output Output directory on remote machines

--max-parallel-ssh 10 Maximum parallel SSH connections

-p, --poll-interval 5.0 How often workers poll for jobs (seconds)

--max-parallel-jobs auto Maximum parallel jobs per worker

--num-cpus auto CPUs per worker (auto-detected if not specified)

--memory-gb auto Memory per worker (auto-detected if not specified)

--num-gpus auto GPUs per worker (auto-detected if not specified)

--skip-version-check false Skip version verification (not recommended)

Example:

# First time: add workers and start torc remote run 42 --workers workers.txt \ --output-dir /data/torc_output \ --poll-interval 10 # Subsequent runs: workers already in database torc remote run 42 --output-dir /data/torc_output

Check Status

torc remote status [workflow-id] [options]

Shows which workers are still running. Workers are fetched from the database. If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Options:

Option Default Description

--max-parallel-ssh 10 Maximum parallel SSH connections

Stop Workers

torc remote stop [workflow-id] [options]

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Options:

Option Default Description

--force false Send SIGKILL instead of SIGTERM

--max-parallel-ssh 10 Maximum parallel SSH connections

Collect Logs

torc remote collect-logs [workflow-id] [options]

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Options:

Option Default Description

-l, --local-output-dir remote_logs Local directory for collected logs

--remote-output-dir torc_output Remote output directory

--delete false Delete remote logs after successful collection

--max-parallel-ssh 10 Maximum parallel SSH connections

Example with deletion:

# Collect logs and clean up remote workers torc remote collect-logs 42 --delete

Delete Logs

torc remote delete-logs [workflow-id] [options]

Delete the output directory from all remote workers without collecting logs first. Use collect-logs --delete if you want to save logs before deleting.

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Options:

Option Default Description

--remote-output-dir torc_output Remote output directory

--max-parallel-ssh 10 Maximum parallel SSH connections

Typical Workflow

Create a workflow:

torc workflows create my_workflow.yaml

Add workers:

# From command line torc remote add-workers 42 worker1.example.com worker2.example.com # Or from file torc remote add-workers-from-file workers.txt 42

Start workers:

torc remote run 42

Monitor status:

torc remote status 42

Collect logs when complete:

torc remote collect-logs 42 -l ./logs

Or combine steps 2 and 3:

torc remote run 42 --workers workers.txt

How It Works

Version Check: Verifies all remote machines have the same torc version

Worker Start: Uses nohup to start detached workers that survive SSH disconnection

Job Execution: Each worker polls the server for available jobs and executes them locally

Completion: Workers exit when the workflow is complete or canceled

The server coordinates job distribution. Multiple workers can safely poll the same workflow without double-allocating jobs.

SSH Configuration

Workers connect using SSH with these options:

ConnectTimeout=30 - 30 second connection timeout

BatchMode=yes - No password prompts (requires key-based auth)

StrictHostKeyChecking=accept-new - Accept new host keys automatically

For custom SSH configuration, use ~/.ssh/config on the local machine:

Host worker1 HostName worker1.example.com User alice Port 2222 IdentityFile ~/.ssh/worker_key

Then reference the alias in your worker file:

worker1 worker2 worker3

Resource Monitoring

If your workflow has resource monitoring enabled, each worker collects utilization data:

name: my_workflow resource_monitor_config: enabled: true granularity: time_series sample_interval_seconds: 5

The collect-logs command retrieves these databases along with job logs.

Troubleshooting

No Workers Configured

No workers configured for workflow 42. Use 'torc remote add-workers' or '--workers' flag.

Add workers to the workflow using torc remote add-workers or the --workers flag on run.

Version Mismatch

Error: Version check failed on 2 worker(s): worker1: Version mismatch: local=0.7.0, worker1=0.6.5 worker2: Version mismatch: local=0.7.0, worker2=0.6.5

Install the same torc version on all machines, or use --skip-version-check (not recommended for production).

SSH Connection Failed

Error: SSH connectivity check failed for 1 worker(s): worker1: SSH connection failed to worker1: Permission denied (publickey)

Verify SSH key-based authentication works:

ssh worker1.example.com true

Worker Died Immediately

[FAILED] worker1: Process died immediately. Last log: Error: connection refused...

The worker couldn't connect to the server. Check:

Server is accessible from the remote machine

Firewall allows connections on the server port

The --url points to the correct server address

Workers Not Claiming Jobs

If workers start but don't claim jobs:

Check the workflow is initialized: torc workflows status <id>

Check jobs are ready: torc jobs list <id>

Check resource requirements match available resources

Comparison with Slurm

Feature Remote Workers Slurm

Scheduler required No Yes

Resource allocation Manual (worker file) Automatic

Fault tolerance Limited Full (job requeue)

Walltime limits No Yes

Priority/queuing No Yes

Best for Ad-hoc clusters, testing Production HPC

Security Considerations

Workers authenticate to the torc server (if authentication is enabled)

SSH keys should be properly secured

Workers run with the permissions of the SSH user on each machine

The torc server URL is passed to workers and visible in process lists

Fault Tolerance & Recovery

Handling failures and recovering workflows automatically.

Automatic Failure Recovery - Automatic retry and resource adjustment

Configurable Failure Handlers - Per-job retry logic based on exit codes

AI-Assisted Recovery - Intelligent error classification with AI agents

Job Checkpointing - Saving and restoring job state

Automatic Failure Recovery

This guide explains how to use Torc's automatic recovery features to handle workflow failures without manual intervention.

Overview

Torc provides automatic failure recovery through two commands:

torc recover - One-shot recovery for Slurm workflows

torc watch --recover - Continuous monitoring with automatic recovery

When jobs fail, the system:

Diagnoses the failure cause (OOM, timeout, or unknown)

Applies heuristics to adjust resource requirements

Resets failed jobs and submits new Slurm allocations

(watch only) Resumes monitoring until completion or max retries

This deterministic approach handles the majority of HPC failures without human intervention.

Why Deterministic Recovery?

Most HPC job failures fall into predictable categories:

Failure Type Frequency Solution

Out of Memory ~60% Increase memory allocation

Timeout ~25% Increase runtime limit

Transient errors ~10% Simple retry

Code bugs ~5% Manual intervention

For 85-90% of failures, the solution is mechanical: increase resources and retry. This doesn't require AI judgment—simple heuristics work well.

Recovery Architecture

flowchart LR A[torc watch polling] --> B{Workflow complete?} B -->|No| A B -->|Yes, with failures| C[Diagnose failures check resources] C --> D[Apply heuristics adjust resources] D --> E[Submit new allocations] E --> A B -->|Yes, success| F[Exit 0] style A fill:#4a9eff,color:#fff style B fill:#6c757d,color:#fff style C fill:#ffc107,color:#000 style D fill:#ffc107,color:#000 style E fill:#28a745,color:#fff style F fill:#28a745,color:#fff

Failure Detection

Torc tracks resource usage during job execution:

Memory usage (RSS and peak)

CPU utilization

Execution time

This data is analyzed to determine failure causes:

OOM Detection:

Peak memory exceeds specified limit

Exit code 137 (SIGKILL from OOM killer)

Flag: likely_oom: true

Timeout Detection:

Execution time within 10% of runtime limit

Job was killed (not graceful exit)

Flag: likely_timeout: true

Recovery Heuristics

Failure Type Detection Default Action

Out of Memory Peak memory > limit, exit code 137 Increase memory by 1.5x

Timeout Execution time near limit Increase runtime by 1.5x

Unknown Other exit codes Skip (likely bug)

Note: By default, jobs with unknown failure causes are not retried, since they likely have script or data bugs that won't be fixed by retrying. Use --retry-unknown to also retry these jobs (e.g., to handle transient errors like network issues).

The torc recover Command

For one-shot recovery when a workflow has failed:

# Preview what would be done (recommended first step) torc recover 42 --dry-run # Execute the recovery torc recover 42

This command:

Detects and cleans up orphaned jobs from terminated Slurm allocations

Checks that the workflow is complete and no workers are active

Diagnoses failure causes (OOM, timeout, etc.)

Adjusts resource requirements based on heuristics

Runs optional recovery hook for custom logic

Resets failed jobs and regenerates Slurm schedulers

Submits new allocations

Note: Step 1 (orphan cleanup) handles the case where Slurm terminated an allocation unexpectedly, leaving jobs stuck in "running" status. This is done automatically before checking preconditions.

Options

torc recover <workflow_id> \ --memory-multiplier 1.5 \ # Memory increase factor for OOM (default: 1.5) --runtime-multiplier 1.4 \ # Runtime increase factor for timeout (default: 1.4) --retry-unknown \ # Also retry jobs with unknown failure causes --recovery-hook "bash fix.sh" \ # Custom script for unknown failures --dry-run # Preview without making changes

Example Output

Diagnosing failures... Applying recovery heuristics... Job 107 (train_model): OOM detected, increasing memory 8g -> 12g Applied fixes: 1 OOM, 0 timeout Resetting 1 job(s) for retry... Reset 1 job(s) Reinitializing workflow... Regenerating Slurm schedulers... Submitted Slurm allocation with 1 job Recovery complete for workflow 42 - 1 job(s) had memory increased Reset 1 job(s). Slurm schedulers regenerated and submitted.

The torc watch --recover Command

The torc watch command can automatically recover from common failures:

torc watch 42 --recover

This will:

Poll the workflow until completion

On failure, diagnose the cause (OOM, timeout, etc.)

Adjust resource requirements based on heuristics

Reset failed jobs and submit new Slurm allocations

Resume monitoring

Repeat until success or max retries exceeded

Options

torc watch <workflow_id> \ -r \ # Enable automatic recovery (--recover) -m 3 \ # Maximum recovery attempts (--max-retries) --memory-multiplier 1.5 \ # Memory increase factor for OOM --runtime-multiplier 1.5 \ # Runtime increase factor for timeout --retry-unknown \ # Also retry jobs with unknown failures --recovery-hook "bash fix.sh" \ # Custom recovery script -p 60 \ # Seconds between status checks (--poll-interval) -o output \ # Directory for job output files (--output-dir) -s \ # Display job counts during polling (--show-job-counts) --auto-schedule \ # Automatically schedule nodes for stranded jobs --auto-schedule-threshold 5 \ # Min retry jobs before scheduling (default: 5) --auto-schedule-cooldown 1800 \ # Seconds between auto-schedule attempts (default: 1800) --auto-schedule-stranded-timeout 7200 # Schedule stranded jobs after this time (default: 7200)

Custom Recovery Hooks

For failures that torc can't handle automatically (not OOM or timeout), you can provide a custom recovery script using --recovery-hook. This is useful for domain-specific recovery logic, such as adjusting Apache Spark cluster sizes or fixing configuration issues.

torc watch 42 --recover --recovery-hook "bash fix-spark-cluster.sh"

The hook receives the workflow ID in two ways:

As an argument: bash fix-spark-cluster.sh 42

As an environment variable: TORC_WORKFLOW_ID=42

Your script can use torc CLI commands to query and modify the workflow:

#!/bin/bash # fix-spark-cluster.sh - Example recovery hook for Spark jobs WORKFLOW_ID=$1 # or use $TORC_WORKFLOW_ID # Find failed jobs FAILED_JOBS=$(torc jobs list $WORKFLOW_ID --status failed -f json | jq -r '.[].id') for JOB_ID in $FAILED_JOBS; do # Get current resource requirements JOB_INFO=$(torc jobs get $JOB_ID -f json) RR_ID=$(echo "$JOB_INFO" | jq -r '.resource_requirements_id') # Check if this is a Spark job that needs more nodes # (your logic here - parse logs, check error messages, etc.) # Update resource requirements torc resource-requirements update $RR_ID --num-nodes 16 echo "Updated job $JOB_ID to use 16 nodes" done

When a recovery hook is provided:

Jobs with unknown failures are automatically included for retry

The hook runs before reset-status is called

If the hook fails (non-zero exit), auto-recovery stops with an error

After the hook succeeds, failed jobs are reset and retried

Auto-Scheduling for Failure Handlers

When using failure handlers that create retry jobs, the originally planned compute capacity may not be sufficient. The --auto-schedule option enables automatic scheduling of additional Slurm nodes when:

No schedulers available: If there are ready jobs but no active or pending Slurm allocations, new schedulers are immediately regenerated and submitted.

Retry jobs accumulating: If there are active schedulers but retry jobs (jobs with attempt_id > 1) are accumulating beyond the threshold, additional schedulers are submitted after the cooldown period.

This is particularly useful for workflows with failure handlers that retry failed jobs, ensuring those retries get scheduled without manual intervention.

Example: Failure Handler with Auto-Scheduling

# Submit a workflow with failure handlers torc submit-slurm --account my_project workflow.yaml # Watch with auto-scheduling enabled (uses defaults) torc watch $WORKFLOW_ID --auto-schedule

With default settings:

If all Slurm allocations complete but retry jobs remain, new allocations are submitted

If 5+ retry jobs accumulate while allocations are running, additional capacity is scheduled

After scheduling, the system waits 30 minutes before considering another auto-schedule

If fewer than 5 retry jobs are waiting for 2 hours, they're scheduled anyway (stranded timeout)

Choosing the Right Command

Use Case Command

One-shot recovery after failure torc recover

Continuous monitoring torc watch -r

Preview what recovery would do torc recover --dry-run

Production long-running workflows torc watch -r

Manual investigation, then retry torc recover

Complete Workflow Example

1. Submit a Workflow

torc submit-slurm --account myproject workflow.yaml

Output:

Created workflow 42 with 100 jobs Submitted to Slurm with 10 allocations

2. Start Watching with Auto-Recovery

torc watch 42 --recover --max-retries 3 --show-job-counts

Note: The --show-job-counts flag is optional. Without it, the command polls silently until completion, which reduces server load for large workflows.

Output:

Watching workflow 42 (poll interval: 60s, recover enabled, max retries: 3, job counts enabled) completed=0, running=10, pending=0, failed=0, blocked=90 completed=25, running=10, pending=0, failed=0, blocked=65 ... completed=95, running=0, pending=0, failed=5, blocked=0 Workflow 42 is complete Workflow completed with failures: - Failed: 5 - Canceled: 0 - Terminated: 0 - Completed: 95 Attempting automatic recovery (attempt 1/3) Diagnosing failures... Applying recovery heuristics... Job 107 (train_model_7): OOM detected, increasing memory 8g -> 12g Job 112 (train_model_12): OOM detected, increasing memory 8g -> 12g Job 123 (train_model_23): OOM detected, increasing memory 8g -> 12g Job 131 (train_model_31): OOM detected, increasing memory 8g -> 12g Job 145 (train_model_45): OOM detected, increasing memory 8g -> 12g Applied fixes: 5 OOM, 0 timeout Resetting failed jobs... Regenerating Slurm schedulers and submitting... Recovery initiated. Resuming monitoring... Watching workflow 42 (poll interval: 60s, recover enabled, max retries: 3, job counts enabled) completed=95, running=5, pending=0, failed=0, blocked=0 ... Workflow 42 is complete Workflow completed successfully (100 jobs)

3. If No Recoverable Jobs Found

If all failures are from unknown causes (not OOM or timeout):

Applying recovery heuristics... 2 job(s) with unknown failure cause (skipped, use --retry-unknown to include) No recoverable jobs found. 2 job(s) failed with unknown causes. Use --retry-unknown to retry jobs with unknown failure causes. Or use the Torc MCP server with your AI assistant to investigate.

This prevents wasting allocation time on jobs that likely have script or data bugs.

4. If Max Retries Exceeded

If failures persist after max retries:

Max retries (3) exceeded. Manual intervention required. Use the Torc MCP server with your AI assistant to investigate.

At this point, you can use the MCP server with an AI assistant to investigate the root cause.

Log Files

All torc watch output is logged to both the terminal and a log file:

<output-dir>/watch_<hostname>_<workflow_id>.log

For example: output/watch_myhost_42.log

This ensures you have a complete record of the watch session even if your terminal disconnects.

When to Use Manual Recovery

Automatic recovery works well for resource-related failures, but some situations require manual intervention:

Use Manual Recovery When:

Jobs keep failing after max retries

The heuristics aren't solving the problem

Need to investigate root cause

Unknown failure modes

Exit codes that don't indicate OOM/timeout

Application-specific errors

Code bugs

Jobs fail consistently with same error

No resource issue detected

Cost optimization

Want to analyze actual usage before increasing

Need to decide whether job is worth more resources

MCP Server for Manual Recovery

The Torc MCP server provides tools for AI-assisted investigation:

Tool Purpose

get_workflow_status Get overall workflow status

list_failed_jobs List failed jobs with error info

get_job_logs Read stdout/stderr logs

check_resource_utilization Detailed resource analysis

update_job_resources Manually adjust resources

resubmit_workflow Regenerate Slurm schedulers

Best Practices

1. Start with Conservative Resources

Set initial resource requests lower and let auto-recovery increase them:

Jobs that succeed keep their original allocation

Only failing jobs get increased resources

Avoids wasting HPC resources on over-provisioned jobs

2. Set Reasonable Max Retries

--max-retries 3 # Good for most workflows

Too many retries can waste allocation time on jobs that will never succeed.

3. Use Appropriate Multipliers

For memory-bound jobs:

--memory-multiplier 2.0 # Double on OOM

For time-sensitive jobs where you want larger increases:

--runtime-multiplier 2.0 # Double runtime on timeout

4. Run in tmux or screen

Always run torc watch inside tmux or screen for long-running workflows. HPC workflows can run for hours or days, and you don't want to lose your monitoring session if:

Your SSH connection drops

Your laptop goes to sleep

You need to disconnect and reconnect later

Using tmux (recommended):

# Start a new tmux session tmux new -s torc-watch # Run the watch command torc watch 42 --recover --poll-interval 300 --show-job-counts # Detach from session: press Ctrl+b, then d # Reattach later: tmux attach -t torc-watch

Using screen:

screen -S torc-watch torc watch 42 --recover --poll-interval 300 --show-job-counts # Detach: Ctrl+a, then d # Reattach: screen -r torc-watch

5. Check Resource Utilization Afterward

After completion, review actual usage:

torc reports check-resource-utilization 42

This helps tune future job specifications.

Troubleshooting

Jobs Stuck in "Running" Status

If jobs appear stuck in "running" status after a Slurm allocation ended:

This usually means the allocation was terminated unexpectedly (timeout, node failure, etc.)

The torc recover command automatically handles this as its first step

To manually clean up without triggering recovery, use:
torc workflows sync-status <workflow_id>

To preview what would be cleaned up:
torc workflows sync-status <workflow_id> --dry-run

See Debugging Slurm Workflows for more details.

Jobs Keep Failing After Recovery

If jobs fail repeatedly with the same error:

Check if the error is resource-related (OOM/timeout)

Review job logs: torc jobs logs <job_id>

Check if there's a code bug

Use MCP server with AI assistant to investigate

No Slurm Schedulers Generated

If slurm regenerate fails:

Ensure workflow was created with --account option

Check HPC profile is detected: torc hpc detect

Specify profile explicitly: --profile kestrel

Resource Limits Too High

If jobs are requesting more resources than partitions allow:

Check partition limits: torc hpc partitions <profile>

Use smaller multipliers

Consider splitting jobs into smaller pieces

Comparison: Automatic vs Manual Recovery

Feature Automatic Manual/AI-Assisted

Human involvement None Interactive

Speed Fast Depends on human

Handles OOM/timeout Yes Yes

Handles unknown errors Retry only Full investigation

Cost optimization Basic Can be sophisticated

Use case Production workflows Debugging, optimization

Implementation Details

The Watch Command Flow

Poll is_workflow_complete API

Print status updates

On completion, check for failures

If failures and recover enabled:

Run torc reports check-resource-utilization --include-failed

Parse results for likely_oom and likely_timeout flags

Update resource requirements via API

Run torc workflows reset-status --failed-only --reinitialize

Run torc slurm regenerate --submit

Increment retry counter

Resume polling

Exit 0 on success, exit 1 on max retries exceeded

The Regenerate Command Flow

Query jobs with status uninitialized/ready/blocked

Group by resource requirements

For each group:

Find best partition using HPC profile

Calculate jobs per node

Determine number of allocations needed

Create scheduler config

Update jobs with new scheduler reference

Submit allocations via sbatch

See Also

Configurable Failure Handlers - Per-job retry with exit-code-specific recovery

Resource Monitoring - Understanding resource tracking

Tutorial: Configurable Failure Handlers

This tutorial shows how to configure automatic job retry based on specific exit codes, with optional recovery scripts.

Learning Objectives

By the end of this tutorial, you will:

Understand failure handlers and rules

Configure exit-code-specific recovery

Write recovery scripts

Monitor retry attempts

Prerequisites

Torc installed with the client feature

A running Torc server

What Are Failure Handlers?

Failure handlers provide per-job automatic retry logic based on exit codes. Unlike torc watch --recover which applies workflow-level recovery heuristics, failure handlers let you define:

Exit codes to match: Which return codes trigger recovery

Recovery scripts: Optional scripts to run before retry

Max retries: How many times to retry per exit code

Quick Start

Try the runnable demo:

torc run examples/yaml/failure_handler_demo.yaml

This workflow includes jobs that randomly fail with different exit codes, demonstrating how failure handlers automatically retry them.

Basic Example

Create a workflow with a failure handler:

failure_handlers: - name: job_recovery rules: - exit_codes: [10, 11, 12] recovery_script: ./recovery.sh max_retries: 3 jobs: - name: process_data command: python process.py failure_handler: job_recovery

How It Works

When a job with a failure handler fails:

JobRunner checks the exit code against handler rules

If a matching rule is found and attempt_id < max_retries:

Run recovery_script (if defined) with environment variables

If recovery succeeds: job is reset to Ready with incremented attempt_id

If recovery fails: job is marked as Failed

If no match or max retries exceeded: job is marked as Failed

flowchart TD fail["Job Fails (exit code 10)"] match{"Matching rule for exit code?"} retry{"attempt_id < max_retries?"} recovery["Run recovery script"] success{"Recovery succeeded?"} reset["Reset to Ready attempt_id += 1"] failed["Mark Failed"] fail --> match match -->|Yes| retry match -->|No| failed retry -->|Yes| recovery retry -->|No| failed recovery --> success success -->|Yes| reset success -->|No| failed style fail fill:#dc3545,color:#fff style reset fill:#28a745,color:#fff style failed fill:#6c757d,color:#fff style recovery fill:#ffc107,color:#000

Environment Variables

Recovery scripts receive the following environment variables:

Variable Description

TORC_WORKFLOW_ID Workflow ID

TORC_JOB_ID Job ID

TORC_JOB_NAME Job name

TORC_API_URL Server URL

TORC_OUTPUT_DIR Output directory

TORC_ATTEMPT_ID Current attempt (1, 2, 3...)

TORC_RETURN_CODE Exit code that triggered recovery

Jobs also receive these environment variables during execution:

Variable Description

TORC_WORKFLOW_ID Workflow ID

TORC_JOB_ID Job ID

TORC_JOB_NAME Job name

TORC_API_URL Server URL

TORC_OUTPUT_DIR Output directory

TORC_ATTEMPT_ID Current attempt (starts at 1)

Writing Recovery Scripts

Example recovery script:

#!/bin/bash # recovery.sh - Run before retrying a failed job echo "=== Recovery Script ===" echo "Job: $TORC_JOB_NAME (attempt $TORC_ATTEMPT_ID)" echo "Failed with exit code: $TORC_RETURN_CODE" # Log the recovery attempt LOG_FILE="$TORC_OUTPUT_DIR/recovery.log" echo "$(date): Recovery for $TORC_JOB_NAME (exit $TORC_RETURN_CODE)" >> "$LOG_FILE" # Take action based on exit code case $TORC_RETURN_CODE in 10) echo "Handling convergence error - adjusting parameters" # Modify config files, adjust parameters, etc. ;; 11) echo "Handling resource error - cleaning up" # Free resources, clean temp files, etc. ;; 12) echo "Handling transient error - no action needed" ;; esac exit 0 # Zero = proceed with retry

Make sure your script is executable:

chmod +x recovery.sh

Multiple Rules

Handle different exit codes with different strategies:

failure_handlers: - name: comprehensive_recovery rules: # Convergence errors: adjust parameters - exit_codes: [10] recovery_script: ./adjust_params.sh max_retries: 3 # Resource errors: clean up and retry - exit_codes: [11] recovery_script: ./cleanup.sh max_retries: 2 # Transient errors: simple retry - exit_codes: [12] max_retries: 3 # Exit code 1 is NOT listed - jobs with exit 1 won't retry

Shared Handlers

Multiple jobs can share a failure handler:

failure_handlers: - name: simulation_recovery rules: - exit_codes: [10, 11, 12] recovery_script: ./recovery.sh max_retries: 3 jobs: - name: simulation_1 command: python simulate.py --config config1.yaml failure_handler: simulation_recovery - name: simulation_2 command: python simulate.py --config config2.yaml failure_handler: simulation_recovery

Simple Retry for Any Failure

For jobs that are simply flaky and need retrying on any failure, use match_all_exit_codes:

failure_handlers: - name: simple_retry rules: - match_all_exit_codes: true max_retries: 3 jobs: - name: flaky_job command: ./flaky_script.sh failure_handler: simple_retry

This retries the job up to 3 times on any non-zero exit code, without running a recovery script.

You can also combine match_all_exit_codes with specific exit code rules. Rules are evaluated in order, so put specific rules first:

failure_handlers: - name: mixed_recovery rules: # Specific handling for known error codes - exit_codes: [10] recovery_script: ./fix_convergence.sh max_retries: 3 # Catch-all for any other failures - match_all_exit_codes: true max_retries: 2

Log Files

Each attempt gets separate log files, preserving history across retries:

output/job_stdio/job_wf1_j42_r1_a1.o # Attempt 1 stdout output/job_stdio/job_wf1_j42_r1_a1.e # Attempt 1 stderr output/job_stdio/job_wf1_j42_r1_a2.o # Attempt 2 stdout output/job_stdio/job_wf1_j42_r1_a2.e # Attempt 2 stderr

The a{N} suffix indicates the attempt number.

Comparison with torc watch --recover

Feature Failure Handlers torc watch --recover

Scope Per-job, exit-code-specific Workflow-wide

Triggers Specific exit codes OOM, timeout detection

Recovery Custom scripts Resource adjustment

Timing Immediate (during run) After workflow completes

Configuration In workflow spec Command-line options

Use both together for comprehensive recovery:

Failure handlers for immediate, exit-code-specific retry

torc watch --recover for workflow-level resource adjustments

Slurm Integration with --auto-schedule

When using failure handlers with Slurm workflows, retried jobs need compute nodes to run on. The original Slurm allocations may have already completed or may not have enough capacity for the retries.

Use torc watch --auto-schedule to automatically submit new Slurm allocations when retry jobs are waiting:

# Submit a workflow with failure handlers torc submit-slurm --account my_project workflow.yaml # Watch with auto-scheduling enabled (uses defaults) torc watch $WORKFLOW_ID --auto-schedule

How It Works

No schedulers available: If there are ready jobs but no active or pending Slurm allocations, new schedulers are immediately regenerated and submitted.

Retry jobs accumulating: If there are active schedulers but retry jobs (jobs with attempt_id > 1) exceed the threshold, additional schedulers are submitted after the cooldown period.

Stranded jobs: If retry jobs exist but are below the threshold and have been waiting longer than the stranded timeout, schedulers are submitted anyway. This prevents jobs from being stranded indefinitely when not enough failures occur to reach the threshold.

Options

Option Default Description

--auto-schedule false Enable automatic scheduling for stranded jobs

--auto-schedule-threshold 5 Minimum retry jobs before scheduling (when active)

--auto-schedule-cooldown 1800 Seconds between auto-schedule attempts (30 min)

--auto-schedule-stranded-timeout 7200 Seconds before scheduling stranded jobs anyway (2 hrs)

Example Scenario

Workflow starts with 100 jobs across 10 Slurm allocations Jobs 15, 23, 47 fail with exit code 10 → failure handler retries them → Jobs reset to Ready with attempt_id=2 All 10 allocations finish (97 jobs completed) torc watch detects: - No active schedulers - 3 ready jobs (all retries) → Auto-schedules new allocation New allocation starts, runs the 3 retry jobs → Job 15 succeeds → Jobs 23, 47 fail again → retry with attempt_id=3 Process continues until all jobs succeed or max_retries exceeded

Complete Example

Here's a complete workflow specification with failure handlers. See examples/yaml/failure_handler_simulation.yaml for the full runnable example.

name: failure_handler_simulation description: Simulation sweep with automatic failure recovery failure_handlers: - name: simulation_recovery rules: # Convergence issues: run recovery script, retry up to 3 times - exit_codes: [10] recovery_script: examples/scripts/recovery_demo.sh max_retries: 3 # Resource issues: run recovery script, retry up to 2 times - exit_codes: [11] recovery_script: examples/scripts/recovery_demo.sh max_retries: 2 # Transient errors: simple retry, no recovery script - exit_codes: [12] max_retries: 3 # Note: exit code 1 is intentionally NOT included (unrecoverable) jobs: # Parameterized jobs that may fail with different error codes - name: simulate_m{model}_s{scenario} command: bash examples/scripts/failure_demo_job.sh --fail-rate 0.7 failure_handler: simulation_recovery parameters: model: "0:2" scenario: "0:2" # Runs after all simulations complete - name: aggregate_results command: echo "All simulations completed successfully!" depends_on: - simulate_m{model}_s{scenario} parameters: model: "0:2" scenario: "0:2"

Run this example with:

torc run examples/yaml/failure_handler_simulation.yaml

Tips and Best Practices

1. Use Exit Codes Consistently

Define meaningful exit codes in your scripts:

# simulate.py import sys try: # Simulation code pass except ConvergenceError: sys.exit(10) # Will trigger recovery except ResourceError: sys.exit(11) # Will trigger recovery except TransientError: sys.exit(12) # Will trigger recovery except Exception as e: print(f"Unrecoverable error: {e}") sys.exit(1) # Will NOT trigger recovery

2. Keep Recovery Scripts Simple

Recovery scripts should be quick and focused:

# Good: Simple, focused recovery #!/bin/bash echo "Cleaning up before retry..." rm -f /tmp/lockfile exit 0 # Avoid: Complex logic that might fail

3. Set Reasonable Max Retries

rules: - exit_codes: [10] max_retries: 3 # Good for transient errors - exit_codes: [1] max_retries: 1 # Generic errors - don't retry many times

4. Log Recovery Actions

Your recovery scripts should log what they're doing:

#!/bin/bash LOG_FILE="$TORC_OUTPUT_DIR/recovery_${TORC_JOB_ID}.log" echo "$(date): Recovery attempt $TORC_ATTEMPT_ID for exit code $TORC_RETURN_CODE" >> "$LOG_FILE"

Troubleshooting

Recovery Script Not Running

Check that the script is executable: chmod +x script.sh

Verify the script path is correct (relative to where torc runs)

Check that the exit code matches a rule

Job Keeps Failing After Max Retries

Check the job logs for all attempts

Review the recovery script output

Consider increasing max_retries or fixing the underlying issue

Environment Variables Not Set

Ensure you're accessing the variables correctly:

# Correct echo $TORC_JOB_ID # Also correct echo ${TORC_JOB_ID}

Runnable Examples

The following examples are included with Torc and can be run directly:

Simple demo: examples/yaml/failure_handler_demo.yaml

Parameterized simulation: examples/yaml/failure_handler_simulation.yaml

These examples use:

examples/scripts/failure_demo_job.sh - A bash script that randomly fails with different exit codes

examples/scripts/recovery_demo.sh - A recovery script that logs actions

What Happens Without a Matching Handler

When a job fails with an exit code that doesn't match any failure handler rule, the job enters pending_failed status instead of failed. This provides an opportunity for intelligent recovery:

flowchart TD FAIL["Job fails (exit code 1)"] MATCH{"Failure handler rule matches?"} RETRY["Retry via failure handler"] PENDING["Status: pending_failed Awaiting classification"] AI["AI agent or user classifies error"] TRANSIENT["Retry (transient error)"] PERMANENT["Fail (permanent error)"] FAIL --> MATCH MATCH -->|Yes| RETRY MATCH -->|No| PENDING PENDING --> AI AI -->|Transient| TRANSIENT AI -->|Permanent| PERMANENT style FAIL fill:#dc3545,color:#fff style RETRY fill:#28a745,color:#fff style PENDING fill:#ffc107,color:#000 style AI fill:#4a9eff,color:#fff style TRANSIENT fill:#28a745,color:#fff style PERMANENT fill:#6c757d,color:#fff

Benefits of pending_failed

No immediate cascade: Downstream jobs stay blocked instead of being canceled

Time to analyze: Errors can be reviewed before deciding retry vs fail

AI-assisted recovery: MCP tools allow AI agents to classify errors intelligently

Handling pending_failed Jobs

Option 1: Manual reset

# Reset all pending_failed jobs (along with failed/canceled/terminated) torc workflows reset-status $WORKFLOW_ID --failed-only

Option 2: AI-assisted classification

Use an AI agent with the torc MCP server:

list_pending_failed_jobs - See jobs with their stderr

classify_and_resolve_failures - Apply retry/fail decisions

See AI-Assisted Recovery for details.

Option 3: Catch-all failure handler

To prevent pending_failed status entirely, add a catch-all rule:

failure_handlers: - name: comprehensive_recovery rules: # Specific handling for known codes - exit_codes: [10, 11] recovery_script: ./recover.sh max_retries: 3 # Catch-all for any other failures - match_all_exit_codes: true max_retries: 1

With match_all_exit_codes: true, all failures are handled by the failure handler and will never reach pending_failed status.

Summary

Failure handlers provide fine-grained control over job retry behavior:

Define rules for specific exit codes

Run recovery scripts before retry

Limit retries to prevent infinite loops

Share handlers across multiple jobs

Unmatched failures enter pending_failed for AI-assisted or manual classification

Use failure handlers for immediate, exit-code-specific recovery, and combine with torc watch --recover for comprehensive workflow resilience.

Tutorial: AI-Assisted Failure Recovery

🧪 EXPERIMENTAL: This feature is new and not yet well-tested. The API and behavior may change based on user feedback.

This tutorial shows how to use AI agents (Claude Code, GitHub Copilot, or custom MCP clients) to intelligently classify and recover from workflow failures that can't be handled by rule-based mechanisms.

Learning Objectives

By the end of this tutorial, you will:

Understand the pending_failed job status

Configure workflows for AI-assisted recovery

Use the torc MCP server with AI agents

Classify transient vs permanent errors

Prerequisites

Torc installed with the client feature

A running Torc server

An MCP-compatible AI agent (Claude Code, GitHub Copilot, or custom)

torc-mcp-server installed

The Problem

Traditional recovery mechanisms have limitations:

Mechanism Limitation

Failure handlers Require predefined exit codes

torc watch --recover Only handles OOM and timeout patterns

--retry-unknown Blindly retries everything

Some failures require intelligent classification:

Transient errors: Network timeouts, node failures, service outages - should retry

Permanent errors: Code bugs, missing files, invalid inputs - should fail

AI agents can analyze error logs and make these distinctions.

How It Works

flowchart TD JOB["Job exits with non-zero code"] HANDLER{"Failure handler matches?"} RETRY["Retry via failure handler"] PENDING["Status: pending_failed Awaiting classification"] WATCH["torc watch/recover + AI agent"] CLASSIFY["AI analyzes stderr"] TRANSIENT["Transient error → Retry"] PERMANENT["Permanent error → Fail"] JOB --> HANDLER HANDLER -->|Yes| RETRY HANDLER -->|No| PENDING PENDING --> WATCH WATCH --> CLASSIFY CLASSIFY --> TRANSIENT CLASSIFY --> PERMANENT style JOB fill:#dc3545,color:#fff style PENDING fill:#ffc107,color:#000 style CLASSIFY fill:#4a9eff,color:#fff style TRANSIENT fill:#28a745,color:#fff style PERMANENT fill:#6c757d,color:#fff

When a job fails without a matching failure handler rule, it enters the pending_failed status instead of failed. This prevents immediate downstream job cancellation and gives the AI agent time to classify the error.

Quick Start

Option A: Automatic AI Agent Invocation (Recommended)

Use the --ai-recovery flag to automatically invoke the Claude CLI for classification:

# One-shot recovery with AI classification torc recover 123 --ai-recovery # Continuous monitoring with AI classification torc watch 123 --ai-recovery # Specify a different AI agent torc recover 123 --ai-recovery --ai-agent claude # Default torc recover 123 --ai-recovery --ai-agent copilot # GitHub Copilot

When --ai-recovery is enabled:

Torc detects jobs in pending_failed status

Automatically invokes the AI agent CLI with the torc MCP server

AI agent analyzes stderr and classifies each job as transient (retry) or permanent (fail)

Classifications are applied via MCP tools

Recovery continues with the newly classified jobs

Requirements:

Claude: Claude Code CLI installed (claude command available)

GitHub Copilot: GitHub CLI with Copilot installed (gh copilot command available)

Torc MCP server configured in your AI agent's MCP settings

Option B: Manual AI Agent Invocation

If you prefer manual control, configure your AI agent and invoke it yourself.

1. Start the MCP Server

torc-mcp-server --url http://localhost:8080/torc-service/v1

2. Configure Your AI Agent

Add the torc MCP server to your agent's configuration:

Claude Code (~/.claude/mcp_servers.json):

{ "mcpServers": { "torc": { "command": "torc-mcp-server", "args": ["--url", "http://localhost:8080/torc-service/v1"] } } }

GitHub Copilot (.github/copilot/mcp-config.json or global config):

{ "mcpServers": { "torc": { "command": "torc-mcp-server", "args": ["--url", "http://localhost:8080/torc-service/v1"] } } }

3. Run a Workflow

torc run my_workflow.yaml

4. Monitor with AI Recovery

When jobs fail, use your AI agent to:

List pending failures:

Agent: Use list_pending_failed_jobs with workflow_id=123

Analyze the errors:

Agent: The stderr shows "Connection refused to storage.example.com:443" This is a transient network error - the storage server was temporarily down.

Classify and resolve:

Agent: Use classify_and_resolve_failures to retry these jobs

MCP Tools

The torc MCP server provides these tools for AI-assisted recovery:

list_pending_failed_jobs

Lists jobs with pending_failed status, including their stderr output.

Input:

{ "workflow_id": 123 }

Output:

{ "workflow_id": 123, "pending_failed_count": 2, "pending_failed_jobs": [ { "job_id": 456, "name": "process_data", "return_code": 1, "stderr_tail": "ConnectionError: Connection refused..." } ], "guidance": "Analyze the stderr output to classify each failure..." }

classify_and_resolve_failures

Applies classifications to pending_failed jobs.

Input:

{ "workflow_id": 123, "classifications": [ { "job_id": 456, "action": "retry", "reason": "Transient network error - storage server was down" }, { "job_id": 789, "action": "fail", "reason": "SyntaxError in user code - requires fix" } ], "dry_run": true }

Actions:

retry: Reset to ready status with bumped attempt_id

fail: Set to failed status (triggers downstream cancellation)

Optional resource adjustments:

{ "job_id": 456, "action": "retry", "memory": "16g", "runtime": "PT4H", "reason": "OOM detected in stderr, increasing memory" }

Error Classification Guide

Transient Errors (Should Retry)

Error Pattern Category

Connection refused, Connection timed out Network

NCCL timeout, GPU communication error GPU/HPC

EIO, Input/output error Hardware

Slurm: node failure, PREEMPTED HPC scheduling

Service Unavailable, 503 External service

Permanent Errors (Should Fail)

Error Pattern Category

SyntaxError, IndentationError Code bug

ModuleNotFoundError, ImportError Missing dependency

FileNotFoundError (for input files) Missing data

IndexError, KeyError Logic error

PermissionDenied (consistent) Configuration issue

Integration with Existing Recovery

AI-assisted recovery works alongside other mechanisms:

failure_handlers: - name: known_errors rules: # Known recoverable exit codes handled immediately - exit_codes: [10, 11] recovery_script: ./recover.sh max_retries: 3 # Unknown errors go to pending_failed for AI classification

When a job fails with an exit code not covered by the failure handler, it becomes pending_failed instead of failed.

Using with torc recover

The torc recover command can work with AI agents:

# Check for pending_failed jobs torc jobs list $WORKFLOW_ID --status pending_failed # Use AI agent to classify and resolve # (via MCP tools list_pending_failed_jobs and classify_and_resolve_failures) # After AI classification, reinitialize and continue torc workflows reinitialize $WORKFLOW_ID torc workflows run $WORKFLOW_ID

Custom MCP Servers

You can create domain-specific MCP servers that the AI agent uses alongside torc:

flowchart LR subgraph agent["AI Agent"] LLM["Claude/Copilot"] end subgraph mcp["MCP Servers"] TORC["torc-mcp-server"] CUSTOM["custom-mcp-server"] end subgraph systems["Systems"] API["Torc API"] DOMAIN["Domain APIs"] end LLM --> TORC LLM --> CUSTOM TORC --> API CUSTOM --> DOMAIN style LLM fill:#4a9eff,color:#fff style TORC fill:#28a745,color:#fff style CUSTOM fill:#ffc107,color:#000

For example, a custom MCP server could:

Check external service status (was the storage server down?)

Look up known error patterns in a database

Query infrastructure monitoring systems

Workflow Reset

If AI classification doesn't happen (e.g., no agent is running), jobs stay in pending_failed status indefinitely. To reset them:

# Reset all failed/pending_failed jobs torc workflows reset-status $WORKFLOW_ID --failed-only

Example Session

Here's an example conversation with an AI agent:

User: My workflow 123 has some failures. Can you help? Agent: Let me check what's happening. [Uses get_workflow_status tool] The workflow has 3 jobs in pending_failed status. Agent: [Uses list_pending_failed_jobs tool] I found 3 jobs awaiting classification: Job 456 (process_batch_1): stderr shows "ConnectionError: Connection refused to storage.internal:443" Job 457 (process_batch_2): stderr shows the same connection error Job 458 (validate_data): stderr shows "ValueError: Column 'timestamp' has invalid format" Agent: Analysis: - Jobs 456, 457: Transient network error. The storage server was temporarily unreachable. These should be retried. - Job 458: Permanent error. There's a data format issue that requires fixing the input data. Agent: [Uses classify_and_resolve_failures with dry_run=true] Here's my proposed classification: - Job 456: retry (transient network error) - Job 457: retry (transient network error) - Job 458: fail (invalid input data) Should I apply these classifications? User: Yes, go ahead. Agent: [Uses classify_and_resolve_failures with dry_run=false] Done! Jobs 456 and 457 are now ready to retry. Job 458 has been marked as failed. Note: You'll need to fix the timestamp format in your input data before rerunning job 458.

Summary

AI-assisted recovery provides intelligent failure classification:

pending_failed status: Defers classification for AI analysis

MCP tools: list_pending_failed_jobs, classify_and_resolve_failures

Error patterns: Transient vs permanent classification

Integration: Works with failure handlers and torc recover

How to Checkpoint a Job During Wall-Time Timeout

When running jobs on HPC systems like Slurm, your job may be terminated when the allocated wall-time expires. Torc supports graceful termination, allowing jobs to save checkpoints before exiting. This guide explains how to configure Slurm and your jobs to handle wall-time timeouts gracefully.

Overview

When Slurm is about to reach wall-time, it can be configured to send a SIGTERM signal to the Torc worker process. Torc then:

Sends SIGTERM to jobs with supports_termination: true

Sends SIGKILL to jobs with supports_termination: false (or unset)

Waits for all processes to exit

Reports job status as terminated to the server

Jobs that support termination can catch SIGTERM and perform cleanup operations like saving checkpoints, flushing buffers, or releasing resources.

Enabling Graceful Termination

Configuring Slurm to Send a Signal Before Timeout

By default, Slurm does not send any signal before the job's end time. When the wall-time limit is reached, Slurm immediately terminates all processes. To receive a warning signal before timeout, you must explicitly configure it using the --signal option in the extra field of your Slurm scheduler specification:

slurm_schedulers: - name: gpu_scheduler account: my_project partition: gpu nodes: 1 walltime: "04:00:00" extra: "--signal=B:TERM@300" # Send SIGTERM to batch script 300 seconds before timeout

The --signal option format is [B:]<sig_num>[@sig_time]:

B: prefix sends the signal only to the batch shell (by default, all job steps are signaled but not the batch shell itself)

sig_num is the signal name or number (e.g., TERM, USR1, 10)

sig_time is seconds before the time limit to send the signal (default: 60 if not specified)

Note: Due to Slurm's event handling resolution, the signal may be sent up to 60 seconds earlier than specified.

To enable graceful termination for a job, set supports_termination: true in your job specification:

Configuring a Torc job to be terminated gracefully

jobs: - name: training_job command: python train.py --checkpoint-dir /scratch/checkpoints supports_termination: true resource_requirements: num_cpus: 4 memory: 16g runtime: PT2H

Writing a Job That Handles SIGTERM

Your job script must catch SIGTERM and save its state. Here's a Python example:

import signal import sys import pickle # Global state checkpoint_path = "/scratch/checkpoints/model.pkl" model_state = None def save_checkpoint(): """Save current model state to disk.""" print("Saving checkpoint...") with open(checkpoint_path, 'wb') as f: pickle.dump(model_state, f) print(f"Checkpoint saved to {checkpoint_path}") def handle_sigterm(signum, frame): """Handle SIGTERM by saving checkpoint and exiting.""" print("Received SIGTERM - saving checkpoint before exit") save_checkpoint() sys.exit(0) # Exit cleanly after saving # Register the signal handler signal.signal(signal.SIGTERM, handle_sigterm) # Main training loop def train(): global model_state for epoch in range(1000): # Training logic here... model_state = {"epoch": epoch, "weights": [...]} # Optionally save periodic checkpoints if epoch % 100 == 0: save_checkpoint() if __name__ == "__main__": train()

Bash Script Example

For shell scripts, use trap to catch SIGTERM:

#!/bin/bash CHECKPOINT_FILE="/scratch/checkpoints/progress.txt" # Function to save checkpoint save_checkpoint() { echo "Saving checkpoint at iteration $ITERATION" echo "$ITERATION" > "$CHECKPOINT_FILE" } # Trap SIGTERM and save checkpoint trap 'save_checkpoint; exit 0' SIGTERM # Load checkpoint if exists if [ -f "$CHECKPOINT_FILE" ]; then ITERATION=$(cat "$CHECKPOINT_FILE") echo "Resuming from iteration $ITERATION" else ITERATION=0 fi # Main loop while [ $ITERATION -lt 1000 ]; do # Do work... ITERATION=$((ITERATION + 1)) sleep 1 done

Complete Workflow Example

name: ml_training_workflow user: researcher jobs: - name: preprocess command: python preprocess.py supports_termination: false # Quick job, no checkpointing needed - name: train_model command: python train.py --checkpoint-dir /scratch/checkpoints supports_termination: true # Long job, needs checkpointing depends_on: - preprocess resource_requirements: num_cpus: 8 memory: 32g num_gpus: 1 runtime: PT4H - name: evaluate command: python evaluate.py supports_termination: true depends_on: - train_model slurm_schedulers: - name: gpu_scheduler account: my_project partition: gpu nodes: 1 walltime: "04:00:00" extra: "--signal=B:TERM@300" # Send SIGTERM to batch script 300 seconds before timeout actions: - trigger_type: on_workflow_start action_type: schedule_nodes scheduler: gpu_scheduler scheduler_type: slurm num_allocations: 1

Restarting After Termination

When a job is terminated due to wall-time, it will have status terminated. To continue the workflow:

Re-submit the workflow to allocate new compute time:

torc workflows submit $WORKFLOW_ID

Reinitialize terminated jobs to make them ready again:

torc workflows reinitialize $WORKFLOW_ID

Your job script should detect existing checkpoints and resume from where it left off.

Best Practices

1. Verify Checkpoint Integrity

Add validation to ensure checkpoints are complete:

def save_checkpoint(): temp_path = checkpoint_path + ".tmp" with open(temp_path, 'wb') as f: pickle.dump(model_state, f) # Atomic rename ensures complete checkpoint os.rename(temp_path, checkpoint_path)

2. Handle Multiple Termination Signals

Some systems send multiple signals. Ensure your handler is idempotent:

checkpoint_saved = False def handle_sigterm(signum, frame): global checkpoint_saved if not checkpoint_saved: save_checkpoint() checkpoint_saved = True sys.exit(0)

3. Test Locally

Test your SIGTERM handling locally before running on the cluster:

# Start your job python train.py & PID=$! # Wait a bit, then send SIGTERM sleep 10 kill -TERM $PID # Verify checkpoint was saved ls -la /scratch/checkpoints/

Troubleshooting

Job Killed Without Checkpointing

Symptoms: Job status is terminated but no checkpoint was saved.

Causes:

supports_termination not set to true

Signal handler not registered before training started

Checkpoint save took longer than the buffer time

Solutions:

Verify supports_termination: true in job spec

Register signal handlers early in your script

Checkpoint File Corrupted

Symptoms: Job fails to load checkpoint on restart.

Causes:

Job was killed during checkpoint write

Disk space exhausted

Solutions:

Use atomic file operations (write to temp, then rename)

Check available disk space before checkpointing

Implement checkpoint validation on load

Job Doesn't Receive SIGTERM

Symptoms: Job runs until hard kill with no graceful shutdown.

Causes:

Job running in a subprocess that doesn't forward signals

Container or wrapper script intercepting signals

Solutions:

Use exec in wrapper scripts to replace the shell

Configure signal forwarding in containers

Run the job directly without wrapper scripts

See Also

Working with Slurm - General Slurm configuration

Managing Resources - Resource requirements configuration

Debugging Workflows - Troubleshooting workflow issues

Slurm sbatch --signal option - Customize which signal is sent and when before wall-time timeout

Administration & Security

Server deployment, configuration, and security.

Server Deployment - Deploying the Torc server

Configuration Files - Server configuration

Authentication - User authentication setup

Access Groups - Managing user permissions

Team-Based Access Control Tutorial - Setting up team access

Security Reference - Security best practices

Server Deployment

This guide covers deploying and operating the Torc server in production environments, including logging configuration, daemonization, and service management.

Server Subcommands

The torc-server binary has two main subcommands:

torc-server run

Use torc-server run for:

HPC login nodes - Run the server in a tmux session while your jobs are running.

Development and testing - Run the server interactively in a terminal

Manual startup - When you want to control when the server starts and stops

Custom deployment - Integration with external process managers (e.g., supervisord, custom scripts)

Debugging - Running with verbose logging to troubleshoot issues

# Basic usage torc-server run # With options torc-server run --port 8080 --database ./torc.db --log-level debug torc-server run --completion-check-interval-secs 5

torc-server service

Use torc-server service for:

Production deployment - Install as a system service that starts on boot

Reliability - Automatic restart on failure

Managed lifecycle - Standard start/stop/status commands

Platform integration - Uses systemd (Linux), launchd (macOS), or Windows Services

# Install and start as a user service torc-server service install --user torc-server service start --user # Or as a system service (requires root) sudo torc-server service install sudo torc-server service start

Which to choose?

For HPC login nodes/development/testing: Use torc-server run

For production servers/standalone computers: Use torc-server service install

Quick Start

User Service (Development)

For development, install as a user service (no root required):

# Install with automatic defaults (logs to ~/.torc/logs, db at ~/.torc/torc.db) torc-server service install --user # Start the service torc-server service start --user

System Service (Production)

For production deployment, install as a system service:

# Install with automatic defaults (logs to /var/log/torc, db at /var/lib/torc/torc.db) sudo torc-server service install --user # Start the service sudo torc-server service start --user

The service will automatically start on boot and restart on failure. Logs are automatically configured to rotate when they reach 10 MiB (keeping 5 files max). See the Service Management section for customization options.

Logging System

Torc-server uses the tracing ecosystem for structured, high-performance logging with automatic size-based file rotation.

Console Logging (Default)

By default, logs are written to stdout/stderr only:

torc-server run --log-level info

File Logging with Size-Based Rotation

Enable file logging by specifying a log directory:

torc-server run --log-dir /var/log/torc

This will:

Write logs to both console and file

Automatically rotate when log file reaches 10 MiB

Keep up to 5 rotated log files (torc-server.log, torc-server.log.1, ..., torc-server.log.5)

Oldest files are automatically deleted when limit is exceeded

JSON Format Logs

For structured log aggregation (e.g., ELK stack, Splunk):

torc-server run --log-dir /var/log/torc --json-logs

This writes JSON-formatted logs to the file while keeping human-readable logs on console.

Log Levels

Control verbosity with the --log-level flag or RUST_LOG environment variable:

# Available levels: error, warn, info, debug, trace torc-server run --log-level debug --log-dir /var/log/torc # Or using environment variable RUST_LOG=debug torc-server run --log-dir /var/log/torc

Environment Variables

TORC_LOG_DIR: Default log directory

RUST_LOG: Default log level

Example:

export TORC_LOG_DIR=/var/log/torc export RUST_LOG=info torc-server run

Daemonization (Unix/Linux Only)

Run torc-server as a background daemon:

torc-server run --daemon --log-dir /var/log/torc

Important:

Daemonization is only available on Unix/Linux systems

When running as daemon, you must use --log-dir since console output is lost

The daemon creates a PID file (default: /var/run/torc-server.pid)

Custom PID File Location

torc-server run --daemon --pid-file /var/run/torc/server.pid --log-dir /var/log/torc

Stopping a Daemon

# Find the PID cat /var/run/torc-server.pid # Kill the process kill $(cat /var/run/torc-server.pid) # Or forcefully kill -9 $(cat /var/run/torc-server.pid)

Complete Example: Production Deployment

#!/bin/bash # Production deployment script # Create required directories sudo mkdir -p /var/log/torc sudo mkdir -p /var/run/torc sudo mkdir -p /var/lib/torc # Set permissions (adjust as needed) sudo chown -R torc:torc /var/log/torc sudo chown -R torc:torc /var/run/torc sudo chown -R torc:torc /var/lib/torc # Start server as daemon torc-server run \ --daemon \ --log-dir /var/log/torc \ --log-level info \ --json-logs \ --pid-file /var/run/torc/server.pid \ --database /var/lib/torc/torc.db \ --url 0.0.0.0 \ --port 8080 \ --threads 8 \ --auth-file /etc/torc/htpasswd \ --require-auth

Service Management (Recommended for Production)

Automatic Installation

The easiest way to install torc-server as a service is using the built-in service management commands.

User Service (No Root Required)

Install as a user service that runs under your user account (recommended for development):

# Install with defaults (logs to ~/.torc/logs, database at ~/.torc/torc.db) torc-server service install --user # Or customize the configuration torc-server service install --user \ --log-dir ~/custom/logs \ --database ~/custom/torc.db \ --url 0.0.0.0 \ --port 8080 \ --threads 4 # Start the user service torc-server service start --user # Check status torc-server service status --user # Stop the service torc-server service stop --user # Uninstall the service torc-server service uninstall --user

User Service Defaults:

Log directory: ~/.torc/logs

Database: ~/.torc/torc.db

Listen address: 0.0.0.0:8080

Worker threads: 4

System Service (Requires Root)

Install as a system-wide service (recommended for production):

# Install with defaults sudo torc-server service install # Or customize the configuration sudo torc-server service install \ --log-dir /var/log/torc \ --database /var/lib/torc/torc.db \ --url 0.0.0.0 \ --port 8080 \ --threads 8 \ --auth-file /etc/torc/htpasswd \ --require-auth \ --json-logs # Start the system service sudo torc-server service start # Check status torc-server service status # Stop the service sudo torc-server service stop # Uninstall the service sudo torc-server service uninstall

System Service Defaults:

Log directory: /var/log/torc

Database: /var/lib/torc/torc.db

Listen address: 0.0.0.0:8080

Worker threads: 4

This automatically creates the appropriate service configuration for your platform:

Linux: systemd service (user: ~/.config/systemd/user/, system: /etc/systemd/system/)

macOS: launchd service (user: ~/Library/LaunchAgents/, system: /Library/LaunchDaemons/)

Windows: Windows Service

Manual Systemd Service (Linux)

Alternatively, you can manually create a systemd service:

# /etc/systemd/system/torc-server.service [Unit] Description=Torc Workflow Orchestration Server After=network.target [Service] Type=simple User=torc Group=torc WorkingDirectory=/var/lib/torc Environment="RUST_LOG=info" Environment="TORC_LOG_DIR=/var/log/torc" ExecStart=/usr/local/bin/torc-server run \ --log-dir /var/log/torc \ --json-logs \ --database /var/lib/torc/torc.db \ --url 0.0.0.0 \ --port 8080 \ --threads 8 \ --auth-file /etc/torc/htpasswd \ --require-auth Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target

Then:

sudo systemctl daemon-reload sudo systemctl enable torc-server sudo systemctl start torc-server sudo systemctl status torc-server # View logs journalctl -u torc-server -f

Log Rotation Strategy

The server uses automatic size-based rotation with the following defaults:

Max file size: 10 MiB per file

Max files: 5 rotated files (plus the current log file)

Total disk usage: Maximum of ~50 MiB for all log files

When the current log file reaches 10 MiB, it is automatically rotated:

torc-server.log → torc-server.log.1

torc-server.log.1 → torc-server.log.2

And so on...

Oldest file (torc-server.log.5) is deleted

This ensures predictable disk usage without external tools like logrotate.

Timing Instrumentation

For advanced performance monitoring, enable timing instrumentation:

TORC_TIMING_ENABLED=true torc-server run --log-dir /var/log/torc

This adds detailed timing information for all instrumented functions. Note that timing instrumentation works with both console and file logging.

Troubleshooting

Daemon won't start

Check permissions on log directory:

ls -la /var/log/torc

Check if PID file directory exists:

ls -la /var/run/

Try running in foreground first:

torc-server run --log-dir /var/log/torc

No log files created

Verify --log-dir is specified

Check directory permissions

Check disk space: df -h

Logs not rotating

Log rotation happens automatically when a log file reaches 10 MiB. If you need to verify rotation is working:

Check the log directory for numbered files (e.g., torc-server.log.1)

Monitor disk usage - it should never exceed ~50 MiB for all log files

For testing, you can generate large amounts of logs with --log-level trace

Using Configuration Files

This guide shows how to set up and use configuration files for Torc components.

Quick Start

Create a user configuration file:

torc config init --user

Edit the file at ~/.config/torc/config.toml to set your defaults.

Configuration File Locations

Location Purpose Priority

/etc/torc/config.toml System-wide defaults 1 (lowest)

~/.config/torc/config.toml User preferences 2

./torc.toml Project-specific 3

Environment variables Runtime overrides 4

CLI arguments Explicit overrides 5 (highest)

Available Commands

# Show configuration file locations torc config paths # Show effective (merged) configuration torc config show # Show as JSON torc config show --format json # Create configuration file torc config init --user # User config torc config init --local # Project config torc config init --system # System config (requires root) # Validate configuration torc config validate

Client Configuration

Common client settings:

[client] api_url = "http://localhost:8080/torc-service/v1" format = "table" # or "json" log_level = "info" username = "myuser" [client.run] poll_interval = 5.0 output_dir = "output" max_parallel_jobs = 4 num_cpus = 8 memory_gb = 32.0 num_gpus = 1

Server Configuration

For torc-server:

[server] url = "0.0.0.0" port = 8080 threads = 4 database = "/path/to/torc.db" auth_file = "/path/to/htpasswd" require_auth = false completion_check_interval_secs = 30.0 log_level = "info" https = false [server.logging] log_dir = "/var/log/torc" json_logs = false

Dashboard Configuration

For torc-dash:

[dash] host = "127.0.0.1" port = 8090 api_url = "http://localhost:8080/torc-service/v1" torc_bin = "torc" torc_server_bin = "torc-server" standalone = false server_port = 0 completion_check_interval_secs = 5

Environment Variables

Use environment variables for runtime configuration. Use double underscore (__) to separate nested keys:

# Client settings export TORC_CLIENT__API_URL="http://server:8080/torc-service/v1" export TORC_CLIENT__FORMAT="json" # Server settings export TORC_SERVER__PORT="9999" export TORC_SERVER__THREADS="8" # Dashboard settings export TORC_DASH__PORT="8090"

Overriding with CLI Arguments

CLI arguments always take precedence:

# Uses config file for api_url, but CLI for format torc --format json workflows list # CLI url overrides config file torc --url http://other:8080/torc-service/v1 workflows list

Common Patterns

Development Environment

# ~/.config/torc/config.toml [client] api_url = "http://localhost:8080/torc-service/v1" log_level = "debug" [client.run] poll_interval = 2.0

Team Shared Server

# ~/.config/torc/config.toml [client] api_url = "http://torc.internal.company.com:8080/torc-service/v1" username = "developer"

CI/CD Pipeline

#!/bin/bash export TORC_CLIENT__API_URL="${CI_TORC_SERVER}" export TORC_CLIENT__FORMAT="json" torc run workflow.yaml result=$(torc workflows status $WORKFLOW_ID | jq -r '.status')

HPC Cluster

# Project-local torc.toml [client] api_url = "http://login-node:8080/torc-service/v1" [client.run] num_cpus = 64 memory_gb = 256.0 num_gpus = 8 output_dir = "/scratch/user/workflow_output"

Troubleshooting

Configuration not applied?

Check which files are loaded: torc config validate

View effective config: torc config show

Verify file permissions and syntax

Environment variable not working?

Use double underscore for nesting: TORC_CLIENT__API_URL (not TORC_CLIENT_API_URL)

Invalid configuration?

Run validation: torc config validate

Authentication

Torc supports HTTP Basic authentication to secure access to your workflow orchestration server. This guide explains how to set up and use authentication.

Overview

Torc's authentication system provides:

Multi-user support via htpasswd files

Bcrypt password hashing for secure credential storage

Backward compatibility - authentication is optional by default

Flexible deployment - can require authentication or allow mixed access

CLI and environment variable support for credentials

Server-Side Setup

1. Create User Accounts

Use the torc-htpasswd utility to manage user accounts:

# Add a user (will prompt for password) torc-htpasswd add --file /path/to/htpasswd username # Add a user with password on command line torc-htpasswd add --file /path/to/htpasswd --password mypassword username # Add a user with custom bcrypt cost (higher = more secure but slower) torc-htpasswd add --file /path/to/htpasswd --cost 14 username # Generate a password hash for remote registration (see below) torc-htpasswd hash username # List all users torc-htpasswd list --file /path/to/htpasswd # Verify a password torc-htpasswd verify --file /path/to/htpasswd username # Remove a user torc-htpasswd remove --file /path/to/htpasswd username

The htpasswd file format is simple:

# Torc htpasswd file # Format: username:bcrypt_hash alice:$2b$12$abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOP bob:$2b$12$zyxwvutsrqponmlkjihgfedcba0987654321ZYXWVUTSRQPONMLK

2. Start Server with Authentication

# Optional authentication (backward compatible mode) torc-server run --auth-file /path/to/htpasswd # Required authentication (all requests must authenticate) torc-server run --auth-file /path/to/htpasswd --require-auth # With access control enforcement and admin users torc-server run --auth-file /path/to/htpasswd --require-auth \ --enforce-access-control --admin-user alice --admin-user bob # Can also use environment variable export TORC_AUTH_FILE=/path/to/htpasswd export TORC_ADMIN_USERS=alice,bob torc-server run

Authentication Modes:

No --auth-file: Authentication disabled, all requests allowed (default)

--auth-file only: Authentication optional - authenticated requests are logged, unauthenticated requests allowed

--auth-file --require-auth: Authentication required - unauthenticated requests are rejected

Access Control:

--enforce-access-control: Users can only access workflows they own or have group access to

--admin-user: Adds users to the admin group (can specify multiple times)

3. Server Logs

The server logs authentication events:

INFO torc_server: Loading htpasswd file from: /path/to/htpasswd INFO torc_server: Loaded 3 users from htpasswd file INFO torc_server: Authentication is REQUIRED for all requests ... DEBUG torc::server::auth: User 'alice' authenticated successfully WARN torc::server::auth: Authentication failed for user 'bob' WARN torc::server::auth: Authentication required but no credentials provided

Client-Side Usage

Using Command-Line Flags

# Provide credentials via flags torc --username alice --password mypassword workflows list # Username via flag, password will be prompted torc --username alice workflows list Password: **** # All commands support authentication torc --username alice --password mypassword workflows create workflow.yaml

Using Environment Variables

# Set credentials in environment export TORC_PASSWORD=mypassword # Run commands without flags torc workflows list torc jobs list my-workflow-id

Mixed Approach

# Username from env, password prompted torc workflows list Password: **** ## Security Best Practices ### 1. Use HTTPS in Production Basic authentication sends base64-encoded credentials (easily decoded). **Always use HTTPS** when authentication is enabled: ```bash # Start server with HTTPS torc-server run --https --auth-file /path/to/htpasswd --require-auth # Client connects via HTTPS torc --url https://torc.example.com/torc-service/v1 --username alice workflows list

2. Secure Credential Storage

Do:

Store htpasswd files with restrictive permissions: chmod 600 /path/to/htpasswd

Use environment variables for passwords in scripts

Use password prompting for interactive sessions

Rotate passwords periodically

Don't:

Commit htpasswd files to version control

Share htpasswd files between environments

Pass passwords as command-line arguments in production (visible in process list)

Use weak passwords or low bcrypt costs

3. Bcrypt Cost Factor

The cost factor determines password hashing strength:

Cost 4-8: Fast but weaker (testing only)

Cost 10-12: Balanced (default: 12)

Cost 13-15: Strong (production systems)

Cost 16+: Very strong (high-security environments)

# Use higher cost for production torc-htpasswd add --file prod_htpasswd --cost 14 alice

4. Audit Logging

Monitor authentication events in server logs:

# Run server with debug logging for auth events torc-server run --log-level debug --auth-file /path/to/htpasswd # Or use RUST_LOG for granular control RUST_LOG=torc::server::auth=debug torc-server run --auth-file /path/to/htpasswd

Common Workflows

Development Environment

# 1. Create test user torc-htpasswd add --file dev_htpasswd --password devpass developer # 2. Start server (auth optional) torc-server run --auth-file dev_htpasswd --database dev.db # 3. Use client without auth (still works) torc workflows list # 4. Or with auth torc --username developer --password devpass workflows list

Production Deployment

# 1. Create production users with strong passwords and high cost torc-htpasswd add --file /etc/torc/htpasswd --cost 14 alice torc-htpasswd add --file /etc/torc/htpasswd --cost 14 bob # 2. Secure the file chmod 600 /etc/torc/htpasswd chown torc-server:torc-server /etc/torc/htpasswd # 3. Start server with required auth, access control, and HTTPS torc-server run \ --https \ --auth-file /etc/torc/htpasswd \ --require-auth \ --enforce-access-control \ --admin-user alice \ --database /var/lib/torc/production.db # 4. Clients must authenticate torc --url --prompt-password https://torc.example.com/torc-service/v1 workflows list Password: ****

CI/CD Pipeline

# Store credentials as CI secrets # TORC_PASSWORD=<secure-password> # Use in pipeline export TORC_PASSWORD="${TORC_PASSWORD}" export TORC_API_URL=https://torc.example.com/torc-service/v1 # Run workflow torc workflows create pipeline.yaml torc workflows start "${WORKFLOW_ID}"

Remote User Registration (HPC Environments)

When users cannot directly access the server (e.g., HPC users connecting to a server they don't have login access to), use the hash command to generate credentials:

User (on HPC):

# Generate password hash (username defaults to $USER) torc-htpasswd hash Password for 'alice': Hashing password (cost=12)... alice:$2b$12$xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Send the line above to your server administrator. # Or specify a different username torc-htpasswd hash myusername

The hash output (alice:$2b$12$...) can be safely sent to the server administrator via email, Slack, or any other channel - the bcrypt hash cannot be used to recover the original password.

Administrator (on server):

# Append the user's hash line to the htpasswd file echo "alice:\$2b\$12\$xxxxx..." >> /etc/torc/htpasswd # Or manually edit the file and paste the line vim /etc/torc/htpasswd

Notes:

The password is entered on the user's machine and never transmitted in plaintext

The bcrypt hash is safe to transmit - it can only verify passwords, not recover them

Users can customize the cost factor with --cost if needed

For scripting, use --password flag (though less secure)

Migrating from No Auth to Required Auth

# 1. Start: No authentication torc-server run --database prod.db # 2. Add authentication file (optional mode) torc-server run --auth-file /etc/torc/htpasswd --database prod.db # 3. Monitor logs, ensure clients are authenticating # Look for "User 'X' authenticated successfully" messages # 4. Once all clients authenticate, enable required auth torc-server run --auth-file /etc/torc/htpasswd --require-auth --database prod.db

Troubleshooting

"Authentication required but no credentials provided"

Cause: Server has --require-auth but client didn't send credentials.

Solution:

# Add username and password torc --username alice --password mypass workflows list

"Authentication failed for user 'alice'"

Cause: Wrong password or user doesn't exist in htpasswd file.

Solutions:

# 1. Verify user exists torc-htpasswd list --file /path/to/htpasswd # 2. Verify password torc-htpasswd verify --file /path/to/htpasswd alice # 3. Reset password torc-htpasswd add --file /path/to/htpasswd alice

"No credentials provided, allowing anonymous access"

Cause: Server has --auth-file but not --require-auth, and client didn't authenticate.

Solution: This is normal in optional auth mode. To require auth:

torc-server run --auth-file /path/to/htpasswd --require-auth

Password Prompting in Non-Interactive Sessions

Problem: Scripts or CI/CD fail waiting for password prompt.

Solutions:

# Use environment variable export TORC_PASSWORD=mypassword torc --username alice workflows list # Or pass as flag (less secure - visible in process list) torc --username alice --password mypassword workflows list

Advanced Topics

Multiple Environments

Maintain separate htpasswd files per environment:

# Development torc-htpasswd add --file ~/.torc/dev_htpasswd --password devpass developer # Staging torc-htpasswd add --file /etc/torc/staging_htpasswd --cost 12 alice # Production torc-htpasswd add --file /etc/torc/prod_htpasswd --cost 14 alice

Programmatic Access

When using Torc's Rust, Python, or Julia clients programmatically:

Rust:

#![allow(unused)] fn main() { use torc::client::apis::configuration::Configuration; let mut config = Configuration::new(); config.base_path = "http://localhost:8080/torc-service/v1".to_string(); config.basic_auth = Some(("alice".to_string(), Some("password".to_string()))); }

Python:

from torc import Configuration, ApiClient config = Configuration( host="http://localhost:8080/torc-service/v1", username="alice", password="password" )

Julia:

using Torc using Base64 import OpenAPI client = OpenAPI.Clients.Client( "http://localhost:8080/torc-service/v1"; headers = Dict("Authorization" => "Basic " * base64encode("alice:password")) ) api = Torc.APIClient.DefaultApi(client)

Load Balancer Considerations

When running multiple Torc servers behind a load balancer:

Share the same htpasswd file across all servers (via NFS, S3, etc.)

Or use a configuration management tool to sync htpasswd files

Monitor for htpasswd file changes and reload if needed

Access Groups

Torc supports team-based access control through access groups. This allows teams to share access to workflows while restricting access from other teams.

Overview

Access groups provide:

Team-based access control - Share workflows with specific groups

Role-based membership - Users can be members or admins of groups

Multiple group membership - Users can belong to multiple groups

Workflow sharing - Associate workflows with one or more groups

Concepts

Admin Group

The admin group is a special system group that controls who can create and manage access groups. Admin group membership is managed via server configuration, not through the CLI.

Only members of the admin group can create, delete, or modify access groups

The admin group is created automatically on server startup

Admin users are specified via --admin-user CLI flag or admin_users config option

The admin group cannot be deleted or have its membership modified via the API

Access Groups

An access group is a named collection of users who share access to workflows. Groups have:

Name - A unique identifier for the group

Description - Optional description of the group's purpose

Memberships

Users are added to groups as members. Each membership has:

User name - The username being added

Role - Either "member" or "admin" (for future use)

Workflow Access

Workflows can be associated with multiple groups. When a workflow is associated with a group, all members of that group can access the workflow.

Access Rules

Access to a workflow is granted if any of these conditions are met:

Ownership - The user created the workflow

Group membership - The user belongs to a group that has access to the workflow

CLI Commands

Group Management

Note: Creating, deleting, and modifying access groups requires admin access. Only users who are members of the admin group can perform these operations.

# Create a new group (admin only) torc access-groups create "data-science" --description "Data science team" # List all groups torc access-groups list # Get a specific group torc access-groups get 1 # Delete a group (admin only) torc access-groups delete 1

Membership Management

Note: Adding and removing users from groups requires admin access or group admin role.

# Add a user to a group (admin or group admin only) torc access-groups add-user 1 alice --role member # List members of a group torc access-groups list-members 1 # Remove a user from a group (admin or group admin only) torc access-groups remove-user 1 alice # List groups a user belongs to torc access-groups list-user-groups alice

Workflow Access

Note: Adding and removing workflows from groups requires workflow ownership or admin access.

# Add a workflow to a group (owner or admin only) torc access-groups add-workflow 42 1 # List groups that have access to a workflow torc access-groups list-workflow-groups 42 # Remove a workflow from a group (owner or admin only) torc access-groups remove-workflow 42 1

Common Workflows

Setting Up a Team

As an admin user:

# 1. Create the team group (requires admin access) torc access-groups create "ml-team" --description "Machine learning team" # Output: Successfully created access group: # ID: 1 # Name: ml-team # Description: Machine learning team # 2. Add team members (requires admin access) torc access-groups add-user 1 alice torc access-groups add-user 1 bob

Sharing a Workflow with a Team

# 1. Create a workflow (using any method) torc workflows create examples/sample_workflow.yaml # Output: Created workflow 42 # 2. Add the workflow to the team's group torc access-groups add-workflow 42 1 # Now all members of ml-team (group 1) can access workflow 42

Multi-Team Access

A workflow can be shared with multiple teams:

# Share with data science team (group 1) torc access-groups add-workflow 42 1 # Also share with DevOps team (group 2) torc access-groups add-workflow 42 2 # Both teams can now access the workflow

Checking Group Membership

# List all members in a group torc access-groups list-members 1 # List all groups a user belongs to torc access-groups list-user-groups alice # List all groups with access to a workflow torc access-groups list-workflow-groups 42

JSON Output

All commands support JSON output format for scripting:

# List groups in JSON format torc access-groups list --format json # Get group details in JSON torc access-groups get 1 --format json

Database Schema

Access groups use three tables:

access_group

Column Type Description

id INTEGER Primary key

name TEXT Unique group name

description TEXT Optional description

is_system INTEGER 1 if system group (cannot be deleted), 0 if not

created_at TEXT Timestamp of creation

user_group_membership

Column Type Description

id INTEGER Primary key

user_name TEXT Username of the member

group_id INTEGER Foreign key to access_group

role TEXT Role in the group (member/admin)

created_at TEXT Timestamp of membership creation

workflow_access_group

Column Type Description

workflow_id INTEGER Foreign key to workflow

group_id INTEGER Foreign key to access_group

created_at TEXT Timestamp of association

Enabling Access Control Enforcement

By default, access groups are not enforced - all authenticated users can access all workflows. To enable enforcement, start the server with the --enforce-access-control flag:

torc-server run --enforce-access-control --auth-file /path/to/htpasswd

When enforcement is enabled:

Users can only access workflows they own or have group access to

Anonymous access is denied

API requests to inaccessible workflows return a 403 Forbidden error

Only admin group members can create and manage access groups

The enforcement setting can also be configured in the torc configuration file:

[server] enforce_access_control = true

Configuring Admin Users

Admin users have permission to create, delete, and modify access groups. Configure admin users via:

CLI Flag

torc-server run --admin-user alice --admin-user bob --enforce-access-control

Environment Variable

export TORC_ADMIN_USERS="alice,bob" torc-server run --enforce-access-control

Configuration File

[server] admin_users = ["alice", "bob"] enforce_access_control = true

On server startup, the admin group is automatically created or updated to include the configured users. The admin group is a system group that cannot be deleted or modified via the API.

Future Enhancements

Group admin role - Users with the "admin" role in a group can manage that group's membership

Tutorial: Team-Based Access Control with Access Groups

This tutorial walks you through setting up team-based access control so that workflows can be shared within teams while remaining private from other users.

Learning Objectives

By the end of this tutorial, you will:

Understand how Torc's access control system works

Set up authentication with htpasswd files

Create access groups and add team members

Share workflows with specific teams

Enable access control enforcement on the server

Prerequisites

Torc server and CLI installed

Basic familiarity with the command line

Administrative access to start/restart the server

Scenario

You're setting up Torc for an organization with two teams:

ML Team: Alice and Bob work on machine learning workflows

Data Team: Carol and Dave work on data processing workflows

Each team should only be able to see and manage their own workflows, but some workflows may need to be shared between teams.

Step 1: Create an htpasswd File

First, create an htpasswd file with user credentials. Torc uses bcrypt-hashed passwords for security.

# Create the htpasswd directory mkdir -p /etc/torc # Add users using torc-htpasswd utility torc-htpasswd -c /etc/torc/htpasswd alice # Enter password when prompted torc-htpasswd /etc/torc/htpasswd bob torc-htpasswd /etc/torc/htpasswd carol torc-htpasswd /etc/torc/htpasswd dave

Verify the file was created:

cat /etc/torc/htpasswd

Expected output (hashes will differ):

alice:$2b$12$LQv3c1yqBWVHxkd0LHAkCOYz6TtxMQJqhN82lfIptSLnyJqRZaJ/K bob:$2b$12$... carol:$2b$12$... dave:$2b$12$...

Step 2: Start the Server with Authentication and Access Control

Start the server with authentication required, access control enforcement enabled, and Alice as an admin user:

torc-server run \ --database /var/lib/torc/torc.db \ --auth-file /etc/torc/htpasswd \ --require-auth \ --enforce-access-control \ --admin-user alice

You should see:

INFO Starting torc-server version=0.8.0 (abc1234) INFO Loaded 4 users from htpasswd file INFO Authentication is REQUIRED for all requests INFO Access control is ENABLED - users can only access their own workflows and workflows shared via access groups INFO Admin users configured: ["alice"] INFO Listening on localhost:8080

Note: The --admin-user flag specifies users who can create and manage access groups. Only admin users can create, delete, or modify groups.

Step 3: Configure CLI Authentication

Set up credentials for each user. In a new terminal:

# Set the API URL export TORC_API_URL="http://localhost:8080/torc-service/v1" # Set credentials for Alice read -s TORC_PASSWORD && export TORC_PASSWORD # It will prompt you for the password with displaying it. # Verify connection torc ping

Expected output:

{ "status": "ok" }

Step 4: Create Access Groups

As Alice (who is an admin user), create the two team groups:

# Create the ML team group (requires admin access) torc access-groups create "ml-team" --description "Machine Learning Team"

Output:

Successfully created access group: ID: 1 Name: ml-team Description: Machine Learning Team

# Create the Data team group torc access-groups create "data-team" --description "Data Processing Team"

Output:

Successfully created access group: ID: 2 Name: data-team Description: Data Processing Team

List the groups to verify:

torc access-groups list

Output:

╭────┬────────────┬─────────────────────────╮ │ ID │ Name │ Description │ ├────┼────────────┼─────────────────────────┤ │ 1 │ ml-team │ Machine Learning Team │ │ 2 │ data-team │ Data Processing Team │ ╰────┴────────────┴─────────────────────────╯

Step 5: Add Team Members

Add users to their respective teams:

# Add Alice and Bob to the ML team torc access-groups add-user 1 alice torc access-groups add-user 1 bob # Add Carol and Dave to the Data team torc access-groups add-user 2 carol torc access-groups add-user 2 dave

Verify team membership:

# List ML team members torc access-groups list-members 1

Output:

╭───────────┬────────╮ │ User Name │ Role │ ├───────────┼────────┤ │ alice │ member │ │ bob │ member │ ╰───────────┴────────╯

# Check which groups Alice belongs to torc access-groups list-user-groups alice

Output:

╭────┬─────────┬───────────────────────╮ │ ID │ Name │ Description │ ├────┼─────────┼───────────────────────┤ │ 1 │ ml-team │ Machine Learning Team │ ╰────┴─────────┴───────────────────────╯

Step 6: Create Workflows as Different Users

Now let's create workflows and see how access control works.

As Alice (ML Team)

export TORC_PASSWORD="alice_password" # Create a workflow cat > /tmp/ml_training.yaml << 'EOF' name: ml_training_workflow description: Train a machine learning model jobs: - name: train_model command: echo "Training model..." resource_requirements: small resource_requirements: - name: small num_cpus: 1 memory: 1g runtime: PT10M EOF WORKFLOW_ID=$(torc workflows create /tmp/ml_training.yaml -f json | jq -r '.id') echo "Alice created workflow: $WORKFLOW_ID"

As Carol (Data Team)

export TORC_PASSWORD="carol_password" # Create a different workflow cat > /tmp/data_pipeline.yaml << 'EOF' name: data_pipeline description: Process incoming data jobs: - name: process_data command: echo "Processing data..." resource_requirements: small resource_requirements: - name: small num_cpus: 1 memory: 1g runtime: PT10M EOF torc workflows create /tmp/data_pipeline.yaml

Step 7: Observe Access Control in Action

Carol Cannot Access Alice's Workflow

Still as Carol, try to access Alice's workflow:

# Try to get Alice's workflow (assuming ID 1) torc workflows get 1

Output:

{ "error": "Forbidden", "message": "User 'carol' does not have access to workflow 1" }

Carol Can Only See Her Own Workflows

torc workflows list

Output:

╭────┬───────────────┬─────────────────────────┬───────╮ │ ID │ Name │ Description │ User │ ├────┼───────────────┼─────────────────────────┼───────┤ │ 2 │ data_pipeline │ Process incoming data │ carol │ ╰────┴───────────────┴─────────────────────────┴───────╯

Carol only sees her own workflow, not Alice's.

Step 8: Share a Workflow with Another Team

Sometimes workflows need to be shared between teams. Alice can share her workflow with the Data team.

As Alice, Share the Workflow

export TORC_PASSWORD="alice_password" # Share workflow 1 with the data team (group 2) torc access-groups add-workflow 1 2 echo "Shared workflow 1 with data-team"

Verify the Sharing

# List groups that have access to workflow 1 torc access-groups list-workflow-groups 1

Output:

╭────┬────────────┬─────────────────────────╮ │ ID │ Name │ Description │ ├────┼────────────┼─────────────────────────┤ │ 2 │ data-team │ Data Processing Team │ ╰────┴────────────┴─────────────────────────╯

Carol Can Now Access the Shared Workflow

export TORC_PASSWORD="carol_password" # Now Carol can access the workflow torc workflows get 1

Output:

╭────────────────────────────────────────┬────────────────────────────╮ │ Field │ Value │ ├────────────────────────────────────────┼────────────────────────────┤ │ ID │ 1 │ │ Name │ ml_training_workflow │ │ User │ alice │ │ Description │ Train a machine learning │ │ │ model │ ╰────────────────────────────────────────┴────────────────────────────╯

Carol can now see and interact with Alice's workflow because she's a member of the data-team, which has been granted access.

Step 9: Revoke Access

If you need to remove access:

export TORC_PASSWORD="alice_password" # Remove the data team's access to workflow 1 torc access-groups remove-workflow 1 2 echo "Revoked data-team access to workflow 1"

Now Carol can no longer access the workflow.

Access Control Summary

Here's how access is determined:

Can user access workflow? ├── Is user the workflow owner? → YES → ALLOWED ├── Is user in a group with access to this workflow? → YES → ALLOWED └── Otherwise → DENIED

Access Rules

Ownership: Users always have access to workflows they created

Group Membership: Users have access to workflows shared with any group they belong to

No Inheritance: Access is explicit—being in one group doesn't grant access to another group's workflows

Configuration Reference

Server Flags

Flag Description

--auth-file Path to htpasswd file

--require-auth Require authentication for all requests

--enforce-access-control Enable access control enforcement

--admin-user Add user to admin group (can be repeated)

Configuration File

You can also configure these in config.toml:

[server] auth_file = "/etc/torc/htpasswd" require_auth = true enforce_access_control = true admin_users = ["alice", "bob"]

Troubleshooting

"Anonymous access not allowed"

This error appears when:

No credentials are provided

--require-auth is enabled

Solution: Set the TORC_PASSWORD environment variable.

"User is not a system administrator"

This error appears when trying to create, delete, or modify access groups without admin privileges.

Solution: Either:

Add the user to the admin group in the server configuration using --admin-user or admin_users in config.toml

Use an account that is already an admin

"User does not have access to workflow"

This error appears when:

The user is not the workflow owner

The user is not in any group with access to the workflow

--enforce-access-control is enabled

Solution: Either the workflow owner needs to share it with a group the user belongs to, or add the user to an appropriate group.

Authentication Working but Access Control Not Enforced

Check that --enforce-access-control flag is set when starting the server.

What You Learned

In this tutorial, you learned:

How to create an htpasswd file with user credentials

How to start the server with authentication and access control

How to create and manage access groups

How to add users to groups

How to share workflows with teams

How access control decisions are made

Next Steps

Learn about Configuration Files to set up persistent configuration

Explore Server Deployment for production setups

See the Access Groups Reference for all available commands

Security Reference

This document describes Torc's security features, threat model, and best practices for secure deployments.

Authentication & Authorization

HTTP Basic Authentication

Torc uses HTTP Basic authentication with bcrypt password hashing.

Security Properties:

✅ Industry-standard authentication method

✅ Bcrypt hashing with configurable work factor (cost 4-31)

✅ No plaintext password storage

✅ Salt automatically generated per password

⚠️ Credentials sent base64-encoded (requires HTTPS)

Architecture:

Client Request ↓ [Authorization: Basic base64(username:password)] ↓ Server Middleware ↓ Extract credentials → Verify against htpasswd file (bcrypt) ↓ Success: Add username to request context Failure: Return None authorization (logged) ↓ API Handler (receives authorization context)

Authentication Modes

Mode Configuration Behavior

Disabled No --auth-file All requests allowed, no authentication

Optional --auth-file only Valid credentials logged, invalid/missing allowed

Required --auth-file --require-auth Invalid/missing credentials rejected

Recommendation: Use Required mode in production.

Transport Security

HTTPS/TLS

When to use HTTPS:

✅ Always when authentication is enabled

✅ When transmitting sensitive workflow data

✅ Over untrusted networks (internet, shared networks)

✅ Compliance requirements (PCI-DSS, HIPAA, etc.)

Configuration:

# Server torc-server run --https --auth-file /etc/torc/htpasswd # Client torc --url https://torc.example.com/torc-service/v1 workflows list

TLS Version: Torc uses the system's OpenSSL/native-tls library. Ensure:

TLS 1.2 minimum (TLS 1.3 preferred)

Strong cipher suites enabled

Valid certificates from trusted CA

Network Security

Deployment Patterns:

Pattern 1: Internal Network Only

[Torc Clients] ←→ [Torc Server] (Trusted internal network)

May use HTTP if network is truly isolated

Still recommend HTTPS for defense in depth

Pattern 2: Load Balancer with TLS Termination

[Torc Clients] ←HTTPS→ [Load Balancer] ←HTTP→ [Torc Server] (Internet) (Internal trusted network)

TLS terminates at load balancer

Internal traffic may use HTTP

Ensure load balancer validates certificates

Pattern 3: End-to-End TLS

[Torc Clients] ←HTTPS→ [Torc Server] (Internet or untrusted network)

Most secure pattern

TLS all the way to Torc server

Required for compliance scenarios

Internal Network Quick Setup

For internal, trusted networks where the primary goal is preventing accidental access rather than defending against malicious attacks, use this optimized configuration:

# 1. Create htpasswd with lower cost factor (faster auth) torc-htpasswd add --cost 8 /etc/torc/htpasswd alice torc-htpasswd add --cost 8 /etc/torc/htpasswd bob # 2. Run server with auth and access control torc-server run \ --auth-file /etc/torc/htpasswd \ --require-auth \ --enforce-access-control \ --credential-cache-ttl-secs 60

This provides:

User isolation (users only see their own workflows)

Fast CLI response (~10ms first auth, <1ms cached)

Protection against accidental modifications

Simple username/password authentication

For higher security requirements (internet-facing, compliance, sensitive data), use the default cost factor (12) and HTTPS.

Credential Management

Password Requirements

Recommendations:

Minimum 12 characters

Mix of uppercase, lowercase, numbers, symbols

No dictionary words or common patterns

Unique per user and environment

Bcrypt Cost Factor:

Cost Approx Time Use Case

4 ~1ms Testing only

8 ~10ms Internal trusted networks

10 ~50ms Low-security internal

12 ~250ms Default, general use, internet-facing

14 ~1s High security, sensitive data

15+ > 2s Maximum security, infrequent logins

Cost Selection Criteria:

Higher cost = more CPU, slower login

Balance security vs. user experience

Consider attack surface (internet-facing vs. internal)

For internal networks with trusted users, cost 8 provides reasonable security with fast response

Creating passwords with a specific cost:

# Default cost (12) - good for most deployments torc-htpasswd add /etc/torc/htpasswd alice # Lower cost (8) - faster, suitable for internal networks torc-htpasswd add --cost 8 /etc/torc/htpasswd alice # Higher cost (14) - slower, for high-security environments torc-htpasswd add --cost 14 /etc/torc/htpasswd alice

Credential Caching

To improve CLI responsiveness, the server caches successful authentications for a configurable duration (default: 60 seconds). This avoids repeated bcrypt verification for the same credentials.

Configuration:

# CLI option torc-server run --credential-cache-ttl-secs 60 # Environment variable export TORC_CREDENTIAL_CACHE_TTL_SECS=60 # Config file (torc.toml) [server] credential_cache_ttl_secs = 60

Performance impact:

Scenario Response Time

First auth (cost 12) ~250ms

First auth (cost 8) ~10ms

Cached auth (any cost) < 1ms

Security notes:

Passwords are never stored in plaintext in the cache

Cache keys are SHA-256 hashes of credentials

Cache entries auto-expire after TTL

Failed authentications are never cached

Set --credential-cache-ttl-secs 0 to disable caching

Htpasswd File Security

File Permissions:

# Restrict to server process owner only chmod 600 /etc/torc/htpasswd chown torc-server:torc-server /etc/torc/htpasswd

Storage Best Practices:

❌ Never commit to version control

❌ Never share between environments

✅ Store in secure configuration management (Ansible Vault, HashiCorp Vault)

✅ Backup with encryption

✅ Rotate regularly (quarterly recommended)

File Format Security:

# Comments allowed username:$2b$12$hash...

Only bcrypt hashes accepted ( $2a$ , $2b$ , or $2y$ )

No plaintext passwords

No MD5, SHA-1, or weak hashes

Client Credential Storage

Best Practices:

Method Security Use Case

Environment variables ⭐⭐⭐ Scripts, automation, CI/CD

Password prompt ⭐⭐⭐⭐⭐ Interactive sessions

Config files ⭐ Not recommended

Command-line args ⚠️ Visible in process list, avoid

Examples:

# Good: Environment variables read -s TORC_PASSWORD && export TORC_PASSWORD torc workflows list # Good: Password prompt torc alice workflows list Password: **** # Acceptable: CI/CD with secrets TORC_PASSWORD=${{ secrets.TORC_PASSWORD }} torc workflows create # Bad: Command-line argument (visible in `ps`) torc --password mypassword workflows list

Threat Model

Threats Mitigated

Threat Mitigation Effectiveness

Unauthorized API access Required authentication ✅ High

Credential stuffing Bcrypt work factor, rate limiting ✅ Medium-High

Password cracking Bcrypt (cost ≥12) ✅ High

Man-in-the-middle HTTPS/TLS ✅ High

Credential theft (database) No plaintext storage, bcrypt ✅ High

Threats Not Mitigated

Threat Impact Recommendation

DDoS attacks High Use rate limiting, firewalls, CDN

SQL injection Medium Use parameterized queries (Torc does)

Insider threats High Audit logging, least privilege

Compromised client High Network segmentation, monitoring

Side-channel attacks Low Constant-time operations (bcrypt does)

Attack Scenarios

Scenario 1: Compromised htpasswd file

Impact: Attacker has password hashes

Risk: Medium - Bcrypt makes cracking difficult

Mitigation:

Immediately revoke all user accounts

Generate new htpasswd file with fresh passwords

Investigate how file was compromised

Increase bcrypt cost if needed

Scenario 2: Leaked credentials in logs

Impact: Credentials in plaintext in logs

Risk: High

Prevention:

Never log passwords

Sanitize logs before sharing

Restrict log access

Response:

Rotate affected credentials immediately

Audit all log access

Review code for password logging

Scenario 3: Network eavesdropping (HTTP)

Impact: Credentials intercepted in transit

Risk: Critical over untrusted networks

Prevention:

Always use HTTPS when authentication is enabled

Especially critical for internet-facing deployments

Response:

Enable HTTPS immediately

Rotate all credentials (assume compromised)

Review access logs for suspicious activity

Audit & Monitoring

Authentication Events

Server logs authentication events:

# Successful authentication DEBUG torc::server::auth: User 'alice' authenticated successfully # Failed authentication (wrong password) WARN torc::server::auth: Authentication failed for user 'alice' # Missing credentials when required WARN torc::server::auth: Authentication required but no credentials provided # No authentication configured DEBUG torc::server::auth: No authentication configured, allowing request

Recommended Monitoring

Metrics to track:

Failed authentication attempts (per user, total)

Successful authentications (per user)

Requests without credentials (when auth enabled)

Unusual access patterns (time, volume, endpoints)

Alerting thresholds:

5+ failed attempts from same user in 5 minutes

100+ failed attempts total in 1 hour

Authentication from unexpected IP ranges

Access during unusual hours (if applicable)

Log aggregation:

# Collect auth events grep "torc::server::auth" /var/log/torc-server.log # Count failed attempts per user grep "Authentication failed" /var/log/torc-server.log | \ awk '{print $(NF)}' | sort | uniq -c # Monitor in real-time tail -f /var/log/torc-server.log | grep "WARN.*auth"

Compliance Considerations

GDPR / Privacy

User data in htpasswd:

Usernames may be personal data (email addresses)

Password hashes are not personal data (irreversible)

Recommendations:

Allow users to request account deletion

Don't use email addresses as usernames (use aliases)

Document data retention policies

PCI-DSS / SOC2

Requirements that apply:

Transport encryption: Use HTTPS

Access control: Enable required authentication

Password complexity: Enforce strong passwords

Audit logging: Enable and monitor auth logs

Regular reviews: Audit user accounts quarterly

Configuration:

# PCI-DSS compliant setup torc-server run \ --https \ --auth-file /etc/torc/htpasswd \ --require-auth \ --log-level info

Security Checklist

Server Deployment

HTTPS enabled in production

Strong TLS configuration (TLS 1.2+, strong ciphers)

Valid certificate from trusted CA

Required authentication enabled (--require-auth)

Htpasswd file permissions: chmod 600

Htpasswd file owned by server process user

Bcrypt cost ≥ 12 (≥14 for high-security)

Strong passwords enforced

Audit logging enabled

Log rotation configured

Firewall rules limit access

Server runs as non-root user

Regular security updates applied

Client Usage

HTTPS URLs used when auth enabled

Credentials stored in environment variables (not command-line)

Passwords not logged

Passwords not committed to version control

Password prompting used for interactive sessions

CI/CD secrets used for automation

Regular password rotation

Operational

User accounts reviewed quarterly

Inactive accounts disabled/removed

Failed login attempts monitored

Access logs reviewed regularly

Incident response plan documented

Backup htpasswd files encrypted

Disaster recovery tested

Access Control

Admin Group

Torc uses a special admin group to control who can manage access groups. Key properties:

Config-driven: Admin users are configured via --admin-user flag or admin_users config

Auto-created: The admin group is created/synced automatically on server startup

System group: Cannot be deleted or have membership modified via the API

Required for management: Only admin members can create, delete, or modify access groups

Configuration:

# Via CLI torc-server run --admin-user alice --admin-user bob --enforce-access-control # Via config file [server] admin_users = ["alice", "bob"] enforce_access_control = true

Access Group Permissions

Operation Required Permission

Create group System admin

Delete group System admin (non-system groups only)

Add user to group System admin or group admin

Remove user from group System admin or group admin

Add workflow to group Workflow owner or group admin

Remove workflow from group Workflow owner or group admin

List groups Any authenticated user

Get group details Any authenticated user

Future Enhancements

Planned security features:

Token-based authentication: JWT/OAuth2 support

API keys: Long-lived tokens for automation

LDAP/Active Directory integration: Enterprise SSO

Rate limiting: Prevent brute force attacks

2FA/MFA support: Multi-factor authentication

Session management: Token expiration, refresh

Audit trail: Detailed access logging

Resources

OWASP Authentication Cheat Sheet

bcrypt Wikipedia

HTTP Basic Authentication RFC 7617

NIST Password Guidelines

Tools & Integrations

Additional tools and third-party integrations.

Workflow Wizard - Interactive workflow creation

Configuration Files Tutorial - Working with config files

Dashboard Deployment - Deploying the web dashboard

Configuring AI Assistants - Setting up AI integration

AI-Assisted Workflow Management - Using AI for workflow management

Map Python Functions Across Workers - Python integration

Filtering CLI Output with Nushell - Advanced CLI usage

Shell Completions - Tab completion setup

Creating Workflows with the Dashboard Wizard

This tutorial walks you through creating a workflow using the interactive wizard in the Torc dashboard. The wizard provides a guided, step-by-step interface for building workflows without writing YAML or JSON files.

Learning Objectives

By the end of this tutorial, you will:

Create a multi-job workflow using the dashboard wizard

Define job dependencies visually

Configure Slurm schedulers for HPC execution

Set up workflow actions to automatically schedule nodes

Understand how the wizard generates workflow specifications

Prerequisites

Torc dashboard running (see Dashboard Deployment)

Basic understanding of Torc workflows

Overview

The workflow wizard guides you through five steps:

Basics - Workflow name and description

Jobs - Define computational tasks

Schedulers - Configure Slurm schedulers (optional)

Actions - Set up automatic node scheduling (optional)

Review - Preview and create the workflow

Step 1: Open the Create Workflow Modal

Open the Torc dashboard in your browser

Click the Create Workflow button in the top-right corner

Select the Wizard tab at the top of the modal

You'll see the wizard interface with step indicators showing your progress.

Step 2: Configure Basics

Enter the basic workflow information:

Workflow Name (required): A unique identifier for your workflow (e.g., data-pipeline)

Description (optional): A brief description of what the workflow does

Click Next to proceed.

Step 3: Add Jobs

This is where you define the computational tasks in your workflow.

Adding Your First Job

Click + Add Job

Fill in the job details:

Job Name: A unique name (e.g., preprocess)

Command: The shell command to execute (e.g., python preprocess.py)

Setting Dependencies

The Blocked By field lets you specify which jobs must complete before this job can run:

Click the Blocked By dropdown

Select one or more jobs that must complete first

Hold Ctrl/Cmd to select multiple jobs

Configuring Resources

Choose a resource preset or customize:

Small: 1 CPU, 1GB memory

Medium: 8 CPUs, 50GB memory

GPU: 1 CPU, 10GB memory, 1 GPU

Custom: Specify exact requirements

Example: Three-Job Pipeline

Let's create a simple pipeline:

Job 1: preprocess

Name: preprocess

Command: echo "Preprocessing..." && sleep 5

Blocked By: (none - this runs first)

Resources: Small

Job 2: analyze

Name: analyze

Command: echo "Analyzing..." && sleep 10

Blocked By: preprocess

Resources: Medium

Job 3: report

Name: report

Command: echo "Generating report..." && sleep 3

Blocked By: analyze

Resources: Small

Click Next when all jobs are configured.

Step 4: Configure Schedulers (Optional)

If you're running on an HPC system with Slurm, you can define scheduler configurations here. Skip this step for local execution.

Adding a Scheduler

Click + Add Scheduler

Fill in the required fields:

Scheduler Name: A reference name (e.g., compute_scheduler)

Account: Your Slurm account name

Configure optional settings:

Nodes: Number of nodes to request

Wall Time: Maximum runtime (HH:MM:SS format)

Partition: Slurm partition name

QoS: Quality of service level

GRES: GPU resources (e.g., gpu:2)

Memory: Memory per node (e.g., 64G)

Temp Storage: Local scratch space

Extra Slurm Options: Additional sbatch flags

Example: Basic Compute Scheduler

Scheduler Name: compute

Account: my_project

Nodes: 1

Wall Time: 02:00:00

Partition: standard

Assigning Jobs to Schedulers

After defining schedulers, you can assign jobs to them:

Go back to the Jobs step (click Back)

In each job card, find the Scheduler dropdown

Select the scheduler to use for that job

Jobs without a scheduler assigned will run locally.

Click Next when scheduler configuration is complete.

Step 5: Configure Actions (Optional)

Actions automatically schedule Slurm nodes when certain events occur. This is useful for dynamic resource allocation.

Trigger Types

When workflow starts: Schedule nodes immediately when the workflow begins

When jobs become ready: Schedule nodes when specific jobs are ready to run

When jobs complete: Schedule nodes after specific jobs finish

Adding an Action

Click + Add Action

Select the Trigger type

Select the Scheduler to use

For job-based triggers, select which Jobs trigger the action

Set the Number of Allocations (how many Slurm jobs to submit)

Example: Stage-Based Scheduling

For a workflow with setup, compute, and finalize stages:

Action 1: Setup Stage

Trigger: When workflow starts

Scheduler: setup_scheduler

Allocations: 1

Action 2: Compute Stage

Trigger: When jobs become ready

Jobs: compute_job1, compute_job2, compute_job3

Scheduler: compute_scheduler

Allocations: 3

Action 3: Finalize Stage

Trigger: When jobs become ready

Jobs: finalize

Scheduler: finalize_scheduler

Allocations: 1

Click Next to proceed to review.

Step 6: Review and Create

The review step shows the generated workflow specification in JSON format. This is exactly what will be submitted to the server.

Reviewing the Spec

Examine the generated specification to verify:

All jobs are included with correct names and commands

Dependencies (depends_on) match your intended workflow structure

Resource requirements are correctly assigned

Schedulers have the right configuration

Actions trigger on the expected events

Creating the Workflow

Review the Options below the wizard:

Initialize workflow after creation: Builds the dependency graph (recommended)

Run workflow immediately: Starts execution right away

Click Create to submit the workflow

If successful, you'll see a success notification and the workflow will appear in your workflow list.

Example: Complete Diamond Workflow

Here's how to create a diamond-pattern workflow using the wizard:

preprocess / \ work1 work2 \ / postprocess

Jobs Configuration

Job Command Blocked By Resources

preprocess ./preprocess.sh (none) Small

work1 ./work1.sh preprocess Medium

work2 ./work2.sh preprocess Medium

postprocess ./postprocess.sh work1, work2 Small

Generated Spec Preview

The wizard generates a spec like this:

{ "name": "diamond-workflow", "description": "Fan-out and fan-in example", "jobs": [ { "name": "preprocess", "command": "./preprocess.sh", "resource_requirements": "res_1cpu_1g" }, { "name": "work1", "command": "./work1.sh", "depends_on": ["preprocess"], "resource_requirements": "res_8cpu_50g" }, { "name": "work2", "command": "./work2.sh", "depends_on": ["preprocess"], "resource_requirements": "res_8cpu_50g" }, { "name": "postprocess", "command": "./postprocess.sh", "depends_on": ["work1", "work2"], "resource_requirements": "res_1cpu_1g" } ], "resource_requirements": [ {"name": "res_1cpu_1g", "num_cpus": 1, "memory": "1g", "num_gpus": 0, "num_nodes": 1, "runtime": "PT1H"}, {"name": "res_8cpu_50g", "num_cpus": 8, "memory": "50g", "num_gpus": 0, "num_nodes": 1, "runtime": "PT1H"} ] }

Using Parameterized Jobs

The wizard supports job parameterization for creating multiple similar jobs:

In a job card, find the Parameters field

Enter parameters in the format: param_name: "value_spec"

Parameter Formats

Range: i: "1:10" creates jobs for i=1,2,3,...,10

Range with step: i: "0:100:10" creates jobs for i=0,10,20,...,100

List: dataset: "['train', 'test', 'validation']"

Example: Parameterized Processing

Job Name: process_{i}

Command: python process.py --index {i}

Parameters: i: "1:5"

This creates 5 jobs: process_1 through process_5.

Tips and Best Practices

Job Naming

Use descriptive, unique names

Avoid spaces and special characters

For parameterized jobs, include the parameter in the name (e.g., job_{i})

Dependencies

Keep dependency chains as short as possible

Use the fan-out/fan-in pattern for parallelism

Avoid circular dependencies (the server will reject them)

Schedulers

Create separate schedulers for different resource needs

Use descriptive names that indicate the scheduler's purpose

Set realistic wall times to avoid queue priority penalties

Actions

Use on_workflow_start for initial resource allocation

Use on_jobs_ready for just-in-time scheduling

Match allocations to the number of parallel jobs

What You Learned

In this tutorial, you learned:

How to navigate the five-step workflow wizard

How to create jobs with commands, dependencies, and resources

How to configure Slurm schedulers for HPC execution

How to set up actions for automatic node scheduling

How the wizard generates workflow specifications

Next Steps

Diamond Workflow - Learn about file-based implicit dependencies

Simple Parameterization - Create parameter sweeps programmatically

Working with Slurm - Advanced Slurm configuration

Configuration Files Tutorial

This tutorial walks you through setting up Torc configuration files to customize your workflows without specifying options on every command.

What You'll Learn

How to create a configuration file

Configuration file locations and priority

Using environment variables for configuration

Common configuration patterns

Prerequisites

Torc CLI installed

Basic familiarity with TOML format

Step 1: Check Current Configuration

First, let's see what configuration Torc is using:

torc config paths

Output:

Configuration file paths (in priority order): System: /etc/torc/config.toml (not found) User: ~/.config/torc/config.toml (not found) Local: torc.toml (not found) Environment variables (highest priority): Use double underscore (__) to separate nested keys: TORC_CLIENT__API_URL, TORC_CLIENT__FORMAT, TORC_SERVER__PORT, etc. No configuration files found. Run 'torc config init --user' to create one.

View the effective configuration (defaults):

torc config show

Step 2: Create a User Configuration File

Create a configuration file in your home directory that applies to all your Torc usage:

torc config init --user

This creates ~/.config/torc/config.toml (Linux/macOS) or the equivalent on your platform.

Step 3: Edit the Configuration

Open the configuration file in your editor:

# Linux/macOS $EDITOR ~/.config/torc/config.toml # Or find the path torc config paths

Here's a typical user configuration:

[client] # Connect to your team's Torc server api_url = "http://torc-server.internal:8080/torc-service/v1" # Default to JSON output for scripting format = "json" # Enable debug logging log_level = "debug" # Username for authentication username = "alice" [client.run] # Default poll interval for local runs poll_interval = 10.0 # Default output directory output_dir = "workflow_output" # Resource limits for local execution num_cpus = 8 memory_gb = 32.0 num_gpus = 1

Step 4: Validate Your Configuration

After editing, validate the configuration:

torc config validate

Output:

Validating configuration... Loading configuration from: - /home/alice/.config/torc/config.toml Configuration is valid. Key settings: client.api_url = http://torc-server.internal:8080/torc-service/v1 client.format = json server.port = 8080 dash.port = 8090

Step 5: Create a Project-Local Configuration

For project-specific settings, create a torc.toml in your project directory:

cd ~/myproject torc config init --local

Edit torc.toml:

[client] # Project-specific server (overrides user config) api_url = "http://localhost:8080/torc-service/v1" [client.run] # Project-specific output directory output_dir = "results" # This project needs more memory memory_gb = 64.0

Step 6: Understanding Priority

Configuration sources are loaded in this order (later sources override earlier):

Built-in defaults (lowest priority)

System config (/etc/torc/config.toml)

User config (~/.config/torc/config.toml)

Project-local config (./torc.toml)

Environment variables (TORC_*)

CLI arguments (highest priority)

Example: If you have api_url set in your user config but run:

torc --url http://other-server:8080/torc-service/v1 workflows list

The CLI argument takes precedence.

Step 7: Using Environment Variables

Environment variables are useful for CI/CD pipelines and temporary overrides.

Use double underscore (__) to separate nested keys:

# Override client.api_url export TORC_CLIENT__API_URL="http://ci-server:8080/torc-service/v1" # Override client.format export TORC_CLIENT__FORMAT="json" # Override server.port export TORC_SERVER__PORT="9999" # Verify torc config show | grep api_url

Step 8: Server Configuration

If you're running torc-server, you can configure it too:

[server] # Bind to all interfaces url = "0.0.0.0" port = 8080 # Use 4 worker threads threads = 4 # Database location database = "/var/lib/torc/torc.db" # Authentication auth_file = "/etc/torc/htpasswd" require_auth = true # Background job processing interval completion_check_interval_secs = 30.0 # Log level log_level = "info" [server.logging] # Enable file logging log_dir = "/var/log/torc" json_logs = true

Step 9: Dashboard Configuration

Configure torc-dash:

[dash] # Bind address host = "0.0.0.0" port = 8090 # API server to connect to api_url = "http://localhost:8080/torc-service/v1" # Standalone mode settings standalone = false

Common Configuration Patterns

Development Setup

# ~/.config/torc/config.toml [client] api_url = "http://localhost:8080/torc-service/v1" format = "table" log_level = "debug" [client.run] poll_interval = 2.0 output_dir = "output"

Production Server

# /etc/torc/config.toml [server] url = "0.0.0.0" port = 8080 threads = 8 database = "/var/lib/torc/production.db" require_auth = true auth_file = "/etc/torc/htpasswd" completion_check_interval_secs = 30.0 log_level = "info" [server.logging] log_dir = "/var/log/torc" json_logs = true

CI/CD Pipeline

# In CI script export TORC_CLIENT__API_URL="${CI_TORC_SERVER_URL}" export TORC_CLIENT__FORMAT="json" torc run workflow.yaml

Troubleshooting

Configuration Not Loading

Check which files are being loaded:

torc config validate

Environment Variables Not Working

Remember to use double underscore (__) for nesting:

# Correct TORC_CLIENT__API_URL=http://... # Wrong (single underscore) TORC_CLIENT_API_URL=http://...

View Effective Configuration

See the merged result of all configuration sources:

torc config show

Next Steps

See the Configuration Reference for all available options

Learn about Server Deployment for production setups

Set up Authentication for secure access

Dashboard Deployment Tutorial

This tutorial covers three common deployment scenarios for the Torc web dashboard (torc-dash). Each scenario addresses different environments and use cases.

Prefer the terminal? If you work primarily in SSH sessions or terminal environments, consider using the Terminal UI (TUI) instead. The TUI provides the same workflow and job management capabilities without requiring a web browser or SSH tunnels.

Overview of Deployment Scenarios

Scenario Environment Use Case

1. Standalone Local computer Single-computer workflows, development, testing

2. All-in-One Login Node HPC login node Small HPC workflows (< 100 jobs)

3. Shared Server HPC login node + dedicated server Large-scale multi-user HPC workflows

Prerequisites

Before starting, ensure you have:

Built Torc binaries (see Installation):

cargo build --release --workspace

Added binaries to PATH:

export PATH="$PATH:/path/to/torc/target/release"

Initialized the database (if not using standalone mode):

sqlx database setup

Scenario 1: Local Development (Standalone Mode)

Best for: Single-computer workflows on your laptop or workstation. Also ideal for development, testing, and learning Torc.

This is the simplest setup - everything runs on one machine with a single command. Use this when you want to run workflows entirely on your local computer without HPC resources.

Architecture

flowchart TB subgraph computer["Your Computer"] browser["Browser"] dash["torc-dash (web UI)"] server["torc-server (managed)"] cli["torc CLI"] db[("SQLite DB")] browser --> dash dash -->|"HTTP API"| server dash -->|"executes"| cli cli -->|"HTTP API"| server server --> db end

Setup

Step 1: Start the dashboard in standalone mode

torc-dash --standalone

This single command:

Automatically starts torc-server on a free port

Starts the dashboard on http://127.0.0.1:8090

Configures the dashboard to connect to the managed server

Step 2: Open your browser

Navigate to http://localhost:8090

Step 3: Create and run a workflow

Click Create Workflow

Upload a workflow specification file (YAML, JSON, or KDL)

Click Create

Click Initialize on the new workflow

Click Run Locally to execute

Configuration Options

# Custom dashboard port torc-dash --standalone --port 8080 # Specify database location torc-dash --standalone --database /path/to/my.db # Faster job completion detection torc-dash --standalone --completion-check-interval-secs 2 # Specify binary paths (if not in PATH) torc-dash --standalone \ --torc-bin /path/to/torc \ --torc-server-bin /path/to/torc-server

Stopping

Press Ctrl+C in the terminal. This stops both the dashboard and the managed server.

Scenario 2: All-in-One Login Node

Best for: Small HPC workflows (fewer than 100 jobs) where you want the complete Torc stack running on the login node, with jobs submitted to Slurm.

This is the simplest HPC setup - everything runs on the login node. It's ideal for individual users running small HPC workflows without needing a dedicated server infrastructure.

Important: Login nodes are shared resources. The torc-dash and torc-server applications consume minimal resources when workflows are small (e.g., less than 100 jobs). If you run these applications on bigger workflows, especially with faster job completion interval checks, you may impact other users.

Architecture

flowchart TB subgraph local["Your Local Machine"] browser["Browser"] end subgraph login["Login Node"] dash["torc-dash (port 8090)"] server["torc-server (port 8080)"] cli["torc CLI"] db[("SQLite DB")] slurm["sbatch/squeue"] dash -->|"HTTP API"| server dash -->|"executes"| cli cli -->|"HTTP API"| server server --> db cli --> slurm end subgraph compute["Compute Nodes (Slurm)"] runner1["torc-slurm-job-runner (job 1)"] runner2["torc-slurm-job-runner (job 2)"] runnerN["torc-slurm-job-runner (job N)"] runner1 -->|"HTTP API"| server runner2 -->|"HTTP API"| server runnerN -->|"HTTP API"| server end browser -->|"SSH tunnel"| dash slurm --> compute

Setup

Step 1: Start torc-server on the login node

# Start server torc-server run \ --port 8080 \ --database $SCRATCH/torc.db \ --completion-check-interval-secs 60

Or as a background process:

nohup torc-server run \ --port 8080 \ --database $SCRATCH/torc.db \ > $SCRATCH/torc-server.log 2>&1 &

Step 2: Start torc-dash on the same login node

# Set API URL to local server export TORC_API_URL="http://localhost:8080/torc-service/v1" # Start dashboard torc-dash --port 8090

Or in the background:

nohup torc-dash --port 8090 > $SCRATCH/torc-dash.log 2>&1 &

Step 3: Access via SSH tunnel

From your local machine:

ssh -L 8090:localhost:8090 user@login-node

Important: Use localhost in the tunnel command, not the login node's hostname. This works because torc-dash binds to 127.0.0.1 by default.

Open http://localhost:8090 in your browser.

Submitting to Slurm

Via Dashboard:

Create a workflow with Slurm scheduler configuration

Click Initialize

Click Submit (not "Run Locally")

Via CLI:

export TORC_API_URL="http://localhost:8080/torc-service/v1" # Create workflow with Slurm actions torc workflows create my_slurm_workflow.yaml # Submit to Slurm torc submit <workflow_id>

Monitoring Slurm Jobs

The dashboard shows job status updates as Slurm jobs progress:

Go to Details tab

Select Jobs

Enable Auto-refresh

Watch status change from pending → running → completed

You can also monitor via:

Events tab for state transitions

Debugging tab for job logs after completion

Scenario 3: Shared Server on HPC

Best for: Large-scale multi-user HPC environments where a central torc-server runs persistently on a dedicated server, and multiple users access it via torc-dash from login nodes.

This is the most scalable setup, suitable for production deployments with many concurrent users and large workflows.

Architecture

flowchart TB subgraph local["Your Local Machine"] browser["Browser"] end subgraph login["Login Node"] dash["torc-dash (port 8090)"] cli["torc CLI"] dash -->|"executes"| cli end subgraph shared["Shared Server"] server["torc-server (port 8080)"] db[("SQLite DB")] server --> db end browser -->|"SSH tunnel"| dash dash -->|"HTTP API"| server cli -->|"HTTP API"| server

Setup

Step 1: Start torc-server on the shared server

On the shared server (e.g., a dedicated service node):

# Start server with production settings torc-server run \ --port 8080 \ --database /shared/storage/torc.db \ --completion-check-interval-secs 60

For production, consider running as a systemd service:

torc-server service install --user \ --port 8080 \ --database /shared/storage/torc.db

Step 2: Start torc-dash on a login node

SSH to the login node and start the dashboard:

# Connect to the shared server export TORC_API_URL="http://shared-server:8080/torc-service/v1" # Start dashboard (accessible only from login node by default) torc-dash --port 8090

Step 3: Access the dashboard via SSH tunnel

From your local machine, create an SSH tunnel:

ssh -L 8090:localhost:8090 user@login-node

Important: Use localhost in the tunnel command, not the login node's hostname. The tunnel forwards your local port to localhost:8090 as seen from the login node, which matches where torc-dash binds (127.0.0.1:8090).

Then open http://localhost:8090 in your local browser.

Using the CLI

Users can also interact with the shared server via CLI:

# Set the API URL export TORC_API_URL="http://shared-server:8080/torc-service/v1" # Create and run workflows torc workflows create my_workflow.yaml torc workflows run <workflow_id>

Authentication

For multi-user environments, enable authentication:

# Create htpasswd file with users torc-htpasswd create /path/to/htpasswd torc-htpasswd add /path/to/htpasswd alice torc-htpasswd add /path/to/htpasswd bob # Start server with authentication torc-server run \ --port 8080 \ --auth-file /path/to/htpasswd \ --require-auth

See Authentication for details.

Comparison Summary

Feature Standalone All-in-One Login Node Shared Server

Setup complexity Low Medium Medium-High

Multi-user support No Single user Yes

Slurm integration No Yes Yes

Database location Local Login node Shared storage

Persistence Session only Depends on setup Persistent

Best for Single-computer workflows Small HPC workflows (< 100 jobs) Large-scale production

Troubleshooting

Cannot connect to server

# Check if server is running curl http://localhost:8080/torc-service/v1/workflows # Check server logs tail -f torc-server.log

SSH tunnel not working

# Verify tunnel is established lsof -i :8090 # Check for port conflicts netstat -tuln | grep 8090

Slurm jobs not starting

# Check Slurm queue squeue --me # Check Slurm job logs cat output/slurm_output_*.e

Dashboard shows "Disconnected"

Verify API URL in Configuration tab

Check network connectivity to server

Ensure server is running and accessible

Next Steps

Web Dashboard Guide - Complete feature reference

Working with Slurm - Detailed Slurm configuration

Server Deployment - Production server setup

Authentication - Securing your deployment

Configuring AI Assistants

Complete guide for configuring AI assistants (Claude Code, GitHub Copilot) to work with Torc.

Overview

Torc provides an MCP (Model Context Protocol) server that enables AI assistants to interact with workflows. The torc-mcp-server binary acts as a bridge between AI assistants and the Torc HTTP API.

Available Tools

The AI assistant has access to these Torc operations:

Tool Description

get_workflow_status Get workflow info with job counts by status

get_job_details Get detailed job info including resource requirements

get_job_logs Read stdout/stderr from job log files

list_failed_jobs List all failed jobs in a workflow

list_jobs_by_status Filter jobs by status

check_resource_utilization Analyze resource usage and detect OOM/timeout issues

update_job_resources Modify job resource requirements

restart_jobs Reset and restart failed jobs

resubmit_workflow Regenerate Slurm schedulers and submit new allocations

cancel_jobs Cancel specific jobs

create_workflow_from_spec Create a workflow from JSON specification

Environment Variables

Variable Description Default

TORC_API_URL Torc server URL http://localhost:8080/torc-service/v1

TORC_OUTPUT_DIR Directory containing job logs output

TORC_PASSWORD Password for authentication (optional) —

Claude Code Configuration

Configuration Scopes

Claude Code supports MCP configuration at three scopes:

Scope File Use Case

Project .mcp.json in project root Team-shared configuration (commit to git)

Local .mcp.json with --scope local Personal project settings (gitignored)

User ~/.claude.json Cross-project personal tools

CLI Commands

# Add the Torc MCP server claude mcp add torc \ --scope project \ -e TORC_API_URL=http://localhost:8080/torc-service/v1 \ -e TORC_OUTPUT_DIR=/path/to/your/output \ -- /path/to/torc-mcp-server # List configured MCP servers claude mcp list # Get details about the torc server claude mcp get torc # Remove the MCP server claude mcp remove torc

Manual Configuration

Create or edit .mcp.json in your project root:

{ "mcpServers": { "torc": { "command": "/path/to/torc-mcp-server", "env": { "TORC_API_URL": "http://localhost:8080/torc-service/v1", "TORC_OUTPUT_DIR": "/path/to/your/output" } } } }

Environment Variable Expansion

You can use environment variable expansion in .mcp.json:

{ "mcpServers": { "torc": { "command": "/path/to/torc-mcp-server", "env": { "TORC_API_URL": "${TORC_API_URL:-http://localhost:8080/torc-service/v1}", "TORC_OUTPUT_DIR": "${TORC_OUTPUT_DIR:-./output}" } } } }

VS Code + GitHub Copilot Configuration

Prerequisites

VS Code 1.99 or later

GitHub Copilot extension installed

GitHub Copilot subscription (Business, Enterprise, Pro, or Pro+)

Configuration

Create .vscode/mcp.json in your project root:

{ "servers": { "torc": { "command": "/path/to/torc-mcp-server", "env": { "TORC_API_URL": "http://localhost:8080/torc-service/v1", "TORC_OUTPUT_DIR": "./output" } } } }

Verify Setup

Open the Command Palette (Ctrl+Shift+P / Cmd+Shift+P)

Run "MCP: List Servers"

Verify "torc" appears in the list

Usage

In Copilot Chat, use Agent Mode (@workspace or the agent icon) to access MCP tools.

VS Code Remote SSH for HPC

For users running Torc on HPC clusters, VS Code's Remote SSH extension allows you to use Copilot Chat with the MCP server running directly on the cluster.

Architecture

┌─────────────────────┐ ┌─────────────────────────────────────┐ │ Local Machine │ SSH │ HPC Cluster │ │ │◄───────►│ │ │ VS Code │ │ torc-mcp-server ◄──► torc-server │ │ (Copilot Chat) │ │ ▲ │ │ │ │ │ │ └─────────────────────┘ │ .vscode/mcp.json │ └─────────────────────────────────────┘

The MCP server runs on the HPC, communicates with the Torc server on the HPC, and VS Code proxies requests through SSH. No ports need to be exposed to your local machine.

Step 1: Build torc-mcp-server on the HPC

# On the HPC (via SSH or login node) cd /path/to/torc cargo build --release -p torc-mcp-server

Step 2: Configure MCP in your project

Create .vscode/mcp.json in your project directory on the HPC:

{ "servers": { "torc": { "command": "/path/on/hpc/torc/target/release/torc-mcp-server", "env": { "TORC_API_URL": "http://localhost:8080/torc-service/v1", "TORC_OUTPUT_DIR": "./output" } } } }

Important: MCP servers configured in workspace settings (.vscode/mcp.json) run on the remote host, not your local machine.

Step 3: Connect and use

Install the Remote - SSH extension

Connect to the HPC: Remote-SSH: Connect to Host...

Open your project folder on the HPC

Open Copilot Chat and use Agent Mode

HPC-Specific Tips

Module systems: If your HPC uses modules, you may need to set PATH in the env to include required dependencies

Shared filesystems: Place .vscode/mcp.json in a project directory on a shared filesystem accessible from compute nodes

Firewalls: The MCP server only needs to reach the Torc server on the HPC's internal network

How It Works

Torc uses the Model Context Protocol (MCP), an open standard for connecting AI assistants to external tools. The torc-mcp-server binary:

Receives tool calls from the AI assistant via stdio

Translates them to Torc HTTP API calls

Returns results in a format the assistant can understand

The server is stateless—it simply proxies requests to your running Torc server. All workflow state remains in Torc's database.

Security Considerations

The MCP server has full access to your Torc server

Consider using authentication if your Torc server is exposed

The server can modify workflows (restart, cancel, update resources)

Review proposed actions before they execute

Troubleshooting

Claude doesn't see the tools

Verify the MCP server is configured: claude mcp list

Check the config file is valid JSON: cat .mcp.json | jq .

Check that the path to torc-mcp-server is correct and the binary exists

Start a new Claude Code session (MCP servers are loaded at startup)

"Failed to connect to server"

Ensure your Torc server is running

Check that TORC_API_URL is correct

Verify network connectivity

"Permission denied" or "Authentication failed"

Set TORC_PASSWORD if your server requires auth

Check that the credentials are correct

Logs not found

Ensure TORC_OUTPUT_DIR points to your job output directory

Check that jobs have actually run (logs are created at runtime)

See Also

AI-Assisted Workflow Management Tutorial

Configuration Reference

HPC Deployment

Tutorial: AI-Assisted Workflow Management

This tutorial shows how to use AI assistants to manage Torc workflows using natural language.

What You'll Learn

Set up an AI assistant to work with Torc

Create and manage workflows through conversation

Debug failures by asking questions

Prerequisites

Torc installed and server running

Claude Code or VS Code with GitHub Copilot

What Can AI Assistants Do?

With Torc's AI integration, you can manage workflows using natural language:

Task Example

Create workflows "Create a workflow with 10 parallel jobs"

Check status "What's the status of workflow 42?"

Debug failures "Why did a job in workflow 5 fail?"

Fix problems "Restart the failed jobs with doubled memory"

Investigate "Check if any jobs exceeded their memory limits"

Quick Setup: Claude Code

Step 1: Add the Torc MCP server to your project:

claude mcp add torc \ --scope project \ -e TORC_API_URL=http://localhost:8080/torc-service/v1 \ -e TORC_OUTPUT_DIR=./output \ -- /path/to/torc-mcp-server

Step 2: Verify it's configured:

claude mcp list

Step 3: Start a new Claude Code session and try:

"What workflows exist?"

Quick Setup: VS Code + Copilot

Step 1: Create .vscode/mcp.json in your project:

{ "servers": { "torc": { "command": "/path/to/torc-mcp-server", "env": { "TORC_API_URL": "http://localhost:8080/torc-service/v1", "TORC_OUTPUT_DIR": "./output" } } } }

Step 2: In Copilot Chat, use Agent Mode and try:

"What workflows exist?"

Example: Creating a Workflow

> Create a workflow with 3 stages: a preprocess script, 10 work jobs that each take an index, and a postprocess job. The work jobs each require 10 CPUs and 20 GB of memory. The other jobs have trivial resource requirements. ⏺ I'll create a workflow spec file with the 3-stage fan-out/fan-in pattern you described. ⏺ Created the workflow spec at ./workflow.json with: Structure: - Stage 1: preprocess - runs first, outputs preprocess_output.txt - Stage 2: work_0 through work_9 - 10 parallel jobs, each reads preprocess output - Stage 3: postprocess - waits for all work jobs to complete (fan-in pattern) Resource Requirements: | Name | CPUs | Memory | Runtime | |-------|------|--------|---------| | small | 1 | 1g | 1 hour | | large | 10 | 20g | 2 hours | Next steps: 1. Edit the spec to replace placeholder commands with your actual scripts 2. Run locally: torc run workflow.json 3. Or submit to Slurm: torc submit workflow.json (requires adding an account)

You can create much more complex workflows. Output is not shown, but this prompt will create a workflow with job dependencies defined implicitly through job-file relationships:

Create a workflow with four stages of jobs. There are 3 initial jobs each dependent on an input file. Each of those jobs fans out into 10 work jobs through output files. Each of those jobs creates a file. In the third stage there is a script that aggregates each of the 10 jobs into one file. The final postprocess job aggregates those three files into one. I will run it with Slurm: account=my_account. The work jobs require 10 cpus and 30 GB of memory. The other jobs have trivial resource requirements.

Example: Adjusting jobs' resource requirements

Suppose that jobs fail because they use too much CPU or memory.

> Check the resource utilization for workflow 20 ⏺ torc - check_resource_utilization (MCP)(workflow_id: 20) ⎿ { "over_utilization_count": 1, "run_id": null, … +17 lines (ctrl+o to expand) ⏺ Resource Utilization for Workflow 20 Summary: 3 job results analyzed, 1 resource violation found CPU Over-Utilization | Job ID | Job Name | Specified | Peak Used | Over By | |--------|---------------|----------------|-----------|---------| | 264 | cpu_heavy_job | 3 cores (300%) | 602.5% | +100.8% | The cpu_heavy_job used approximately 6 CPU cores at peak, but was only allocated 3 cores. Would you like me to update the resource requirements for this job? I can increase the CPU allocation to 7 cores to provide headroom for future runs. > yes ⏺ torc - update_job_resources (MCP)(job_id: 264, num_cpus: 7) ⎿ { "job_id": 264, "resource_requirements_id": 45, … +8 lines (ctrl+o to expand) ⏺ Done. Updated cpu_heavy_job (job 264) resources: | Setting | Value | |---------|------------| | CPUs | 7 (was 3) | | Memory | 1g | | Runtime | 15 minutes | The job is now configured with enough CPU capacity for future runs.

Example: Recovering a failed workflow that timed out

The workflow in tests/workflows/timeout_auto_recovery_test is used to test torc's auto-recovery of walltime timeouts. In this example, the workflow was left in a failed state.

Here is a conversation in Claude Code after running the workflow:

> analyze the logs for workflow 6 Workflow 6 Log Analysis Summary Workflow: timeout_auto_recovery_test - Test workflow for automatic timeout recovery Job Results | Job | Name | Status | Runtime | Return Code | |-----|------------|---------------|---------|--------------| | 31 | (fast job) | ✅ Completed | 1.0 min | 0 | | 32 | job_slow | ⛔ Terminated | 5.9 min | -9 (SIGKILL) | Terminated Job Details (job_slow) The job was killed by signal 9 (SIGKILL) after running for ~6 minutes, which matches the expected behavior for a timeout test. stdout log shows the job ran through minute 6 before being killed: Slow job starting at Sat Dec 27 15:40:59 MST 2025 This job will run for 10 minutes (exceeds 5 min runtime and 8 min walltime) It should be killed by Slurm and recovered by the watcher Minute 1 of 10... Minute 2 of 10... ... Minute 6 of 10... stderr log: Empty (no errors written) Conclusion The timeout auto-recovery test is working as expected: - The slow job exceeded its 5-minute runtime limit - It was killed by the system (SIGKILL, return code -9) at ~6 minutes - Peak memory usage was ~4.1 MB (no memory issues)

Example: Debugging a failed workflow with over-consumption of memory

The workflow in tests/workflows/oom_auto_recovery_test is used to test torc's diagnosis of memory over-consumption errors.

Here is a conversation in Claude Code after running the workflow:

Analyze the logs for workflow 16

● torc - analyze_workflow_logs (MCP)(workflow_id: 16) ⎿ { "error_count": 0, "errors_by_type": [], … +7 lines (ctrl+o to expand)

● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)

● The workflow has 3 failed jobs out of 10 total, but the log analyzer didn't find error patterns. Let me dig deeper:

● torc - list_failed_jobs (MCP)(workflow_id: 16) ⎿ { "failed_job_count": 3, "failed_jobs": [ … +18 lines (ctrl+o to expand)

● torc - check_resource_utilization (MCP)(workflow_id: 16, include_failed: true) ⎿ { "failed_jobs": [ { … +137 lines (ctrl+o to expand)

● Workflow 16 Analysis Summary

Workflow: oom_test - "Test workflow with an OOM" Status: 7 completed, 3 failed (10 total jobs)

Root Cause: Out of Memory (OOM)

All 3 failed jobs were killed due to OOM (return code 137):

Job ID Name Peak Memory Configured Over-Utilization

103 work_1 N/A (killed early) 10g SIGKILL 137

109 work_7 29.91 GB 10g +199.1%

110 work_8 26.89 GB 10g +168.9%

Critical Warning: More Failures Expected

9 jobs exceeded their 10 GB memory limit - including 6 jobs that happened to complete but used ~30 GB each:

work_2 through work_6, work_9, work_10 all peaked at ~30.01 GB (+200.1% over limit)

These completed due to available system memory but are also misconfigured

Recommendation

Increase memory allocation to 32g for all work jobs

Use automated recovery - I can run a dry-run of the recovery tool to preview fixes

Would you like me to run recover_workflow with dry_run=true to preview the recovery actions?

yes

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true) ⎿ { "dry_run": true, "memory_multiplier": 1.5, … +9 lines (ctrl+o to expand)

● Recovery Preview (Dry Run)

Proposed Actions:

3 failed jobs would have memory increased (1.5x multiplier: 10g → 15g)

Failed jobs would be reset to ready status

Slurm schedulers would be regenerated

⚠️ Important Consideration

The default 1.5x memory multiplier would increase memory from 10g → 15g, but the jobs actually peaked at ~30 GB. This means:

15g may still cause OOM failures

A 3.5x multiplier (10g → 35g) would be safer based on actual usage

Would you like me to proceed with:

Default recovery (1.5x = 15g) - may require another recovery cycle

Custom recovery with a higher memory multiplier (e.g., 3.5x = 35g)

2

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true, memory_multiplier: 3.5) ⎿ { "dry_run": true, "memory_multiplier": 3.5, … +9 lines (ctrl+o to expand)

● Recovery Preview (3.5x Memory Multiplier)

Proposed Actions:

3 failed jobs would have memory increased: 10g → 35g

Failed jobs would be reset to ready status

Slurm schedulers would be regenerated

This should safely cover the ~30 GB peak memory usage observed.

Would you like me to proceed with these recovery actions?

yes

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: false, memory_multiplier: 3.5) ⎿ { "dry_run": false, "memory_multiplier": 3.5, … +7 lines (ctrl+o to expand)

● Recovery Complete ✓

Actions taken:

3 jobs had memory increased from 10g → 35g

Failed jobs reset to ready status

Slurm schedulers regenerated and submitted

● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)

● Current Status:

Status Count

Completed 7

Ready 3

The 3 previously failed jobs (work_1, work_7, work_8) are now ready to run with 35g memory. Slurm allocations have been submitted and will execute these jobs automatically.

What You Learned

How to configure AI assistants with Torc (one command for Claude Code, one file for VS Code)

How to create workflows through conversation

How to debug and fix failures using natural language

Next Steps

Configuring AI Assistants — Full configuration options, all tools, troubleshooting

Automatic Failure Recovery — Use torc watch for automated recovery

Configuration Files — Set up Torc configuration

Map a Python function to compute nodes

This tutorial will teach you how to build a workflow from Python functions instead of CLI executables and run on it on an HPC with Slurm.

Pre-requisites

This tutorial requires installation of the python package torc-client. Until the latest version is published at pypi.org, you must clone this repository install the package in a virtual environment. Use Python 3.11 or later.

git clone https://github.com/NREL/torc cd torc/python_client python -m venv .venv source .venv/bin/activate pip install -e .

Workflow Description

Let's suppose that your code is in a module called simulation.py and looks something like this:

def run(job_name: str, input_params: dict) -> dict: """Runs one simulation on a set of input parameters. Returns ------- job_name: str Name of the job. dict Result of the simulation. """ return { "inputs": input_params, "result": 5, "output_data_path": f"/projects/my-project/{job_name}", } def postprocess(results: list[dict]) -> dict: """Collects the results of the workers and performs post-processing. Parameters ---------- results : list[dict] Results from each simulation Returns ------- dict Final result """ total = 0 paths = [] for result in results: assert "result" in result assert "output_data_path" in result total += result["result"] paths.append(result["output_data_path"]) return {"total": total, "output_data_paths": paths}

You need to run this function on hundreds of sets of input parameters and want torc to help you scale this work on an HPC.

The recommended procedure for this task is torc's Python API as shown below. The goal is to mimic the behavior of Python's concurrent.futures.ProcessPoolExecutor.map as much as possible.

Similar functionality is also available with Dask.

Resource Constraints

Each function call needs 4 CPUs and 20 GiB of memory.

The function call takes 1 hour to run.

A compute node with 92 GiB of memory are easiest to acquire but would only be able to run 4 jobs at a time. The 180 GiB nodes are fewer in number but would use fewer AUs because they would be able to run 8 jobs at a time.

Torc Overview

Here is what torc does to solve this problem:

User creates a workflow in Python.

User passes a callable function as well as a list of all input parameters that need to be mapped to the function.

For each set of input parameters torc creates a record in the user_data table in the database, creates a job with a relationship to that record as an input, and creates a placeholder for data to be created by that job.

When torc runs each job it reads the correct input parameters from the database, imports the user's function, and then calls it with the input parameters.

When the function completes, torc stores any returned data in the database.

When all workers complete torc collects all result data from the database into a list and passes that to the postprocess function. It also stores any returned data from that function into the database.

Build the workflow

Write a script to create the workflow. Note that you need to correct the api URL and the Slurm account.

import getpass import os from torc import make_api, map_function_to_jobs, setup_logging from torc.openapi_client import ( DefaultApi, ResourceRequirementsModel, SlurmSchedulerModel, WorkflowModel, ) TORC_API_URL = os.getenv("TORC_API_URL", "http://localhost:8080/torc-service/v1") def create_workflow(api: DefaultApi) -> WorkflowModel: """Create the workflow""" workflow = WorkflowModel( user=getpass.getuser(), name="map_function_workflow", description="Example workflow that maps a function across workers", ) return api.create_workflow(workflow) def build_workflow(api: DefaultApi, workflow: WorkflowModel): """Creates a workflow with implicit job dependencies declared through files.""" workflow_id = workflow.id assert workflow_id is not None params = [ {"input1": 1, "input2": 2, "input3": 3}, {"input1": 4, "input2": 5, "input3": 6}, {"input1": 7, "input2": 8, "input3": 9}, ] assert workflow.id is not None rr = api.create_resource_requirements( ResourceRequirementsModel( workflow_id=workflow_id, name="medium", num_cpus=4, memory="20g", runtime="P0DT1H", ), ) api.create_slurm_scheduler( SlurmSchedulerModel( workflow_id=workflow_id, name="short", account="my_account", mem="180224", walltime="04:00:00", nodes=1, ), ) jobs = map_function_to_jobs( api, workflow_id, "simulation", "run", params, resource_requirements_id=rr.id, # Note that this is optional. postprocess_func="postprocess", ) print(f"Created workflow with ID {workflow_id} {len(jobs)} jobs.") def main(): setup_logging() api = make_api(TORC_API_URL) workflow = create_workflow(api) try: build_workflow(api, workflow) except Exception: api.delete_workflow(workflow.id) raise if __name__ == "__main__": main()

Requirements:

Your run function should raise an exception if there is a failure. If that happens, torc will record a non-zero return code for the job.

If you want torc to store result data in the database, return it from your run function. Note: be careful on how much result data you return. If you are using a custom database for one workflow, store as much as you want. If you are using a shared server, ensure that you are following its administrator's policies. You could consider storing large data in files and only storing file paths in the database.

If you choose to define a postprocess function and want torc to store the final data in the database, return it from that function.

The params must be serializable in JSON format because they will be stored in the database. Basic types like numbers and strings and lists and dictionaries of those will work fine. If you need to store complex, custom types, consider these options:

Define data models with Pydantic. You can use their existing serialization/de-serialization methods or define custom methods.

Pickle your data and store the result as a string. Your run function would need to understand how to de-serialize it. Note that this has portability limitations. (Please contact the developers if you would like to see this happen automatically.)

Torc must be able to import simulation.py from Python. Here are some options:

Put the script in the current directory.

Install it in the environment.

Specify its parent directory like this: map_function_to_jobs(..., module_directory="my_module")

python map_function_across_workers.py

Create the workflow.

python examples/python/map_function_across_workers.py Created workflow 342 with 4 jobs.

Run the workflow.

$ torc run 342

View the result data overall or by job (if your run and postprocess functions return something). Note that listing all user-data will return input parameters.

$ torc -f json user-data list 342

Other jobs

You could add "normal" jobs to the workflow as well. For example, you might have preprocessing and post-processing work to do. You can add those jobs through the API. You could also add multiple rounds of mapped functions. map_function_to_jobs provides a depends_on_job_ids parameter to specify ordering. You could also define job-job relationships through files or user-data as discussed elsewhere in this documentation.

Tutorial 11: Filtering CLI Output with Nushell

This tutorial teaches you how to filter and analyze Torc CLI output using Nushell, a modern shell with powerful structured data capabilities.

Learning Objectives

By the end of this tutorial, you will:

Understand why Nushell is useful for filtering Torc output

Know how to filter jobs by status, name, and other fields

Be able to analyze results and find failures quickly

Create complex queries combining multiple conditions

Prerequisites

Torc CLI installed and configured

A workflow with jobs (ideally one with various statuses)

Why Nushell?

Torc's CLI can output JSON with the -f json flag. While tools like jq can process JSON, Nushell offers a more readable, SQL-like syntax that's easier to learn and use interactively.

Compare filtering failed jobs:

# jq (cryptic syntax) torc jobs list 123 -f json | jq '.jobs[] | select(.status == "failed")' # Nushell (readable, SQL-like) torc jobs list 123 -f json | from json | get jobs | where status == "failed"

Nushell is:

Cross-platform: Works on Linux, macOS, and Windows

Readable: Uses intuitive commands like where, select, sort-by

Interactive: Tab completion and helpful error messages

Powerful: Built-in support for JSON, YAML, CSV, and more

Installing Nushell

Install Nushell from nushell.sh/book/installation:

# macOS brew install nushell # Windows winget install nushell # Linux (various methods available) cargo install nu

After installation, run nu to start a Nushell session. You can use Nushell interactively or run individual commands with nu -c "command".

Basic Filtering

Setup: Get JSON Output

All examples assume you have a workflow ID. Replace $WORKFLOW_ID with your actual ID:

# In Nushell, set your workflow ID let wf = 123

List All Jobs

torc jobs list $wf -f json | from json | get jobs

This parses the JSON and extracts the jobs array into a table.

Filter by Status

Find all failed jobs:

torc jobs list $wf -f json | from json | get jobs | where status == "failed"

Find jobs that are ready or running:

torc jobs list $wf -f json | from json | get jobs | where status in ["ready", "running"]

Filter by Name Pattern

Find jobs with "train" in the name:

torc jobs list $wf -f json | from json | get jobs | where name =~ "train"

The =~ operator performs substring/regex matching.

Combine Conditions

Find failed jobs with "process" in the name:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" and name =~ "process"

Find jobs that failed or were canceled:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" or status == "canceled"

Selecting and Formatting Output

Select Specific Columns

Show only name and status:

torc jobs list $wf -f json | from json | get jobs | select name status

Sort Results

Sort by name:

torc jobs list $wf -f json | from json | get jobs | sort-by name

Sort failed jobs by ID (descending):

torc jobs list $wf -f json | from json | get jobs | where status == "failed" | sort-by id -r

Count Results

Count jobs by status:

torc jobs list $wf -f json | from json | get jobs | group-by status | transpose status jobs | each { |row| { status: $row.status, count: ($row.jobs | length) } }

Or more simply, count failed jobs:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" | length

Analyzing Results

Find Jobs with Non-Zero Return Codes

torc results list $wf -f json | from json | get results | where return_code != 0

Find Results with Specific Errors

torc results list $wf -f json | from json | get results | where return_code != 0 | select job_id return_code

Join Jobs with Results

Get job names for failed results:

let jobs = (torc jobs list $wf -f json | from json | get jobs) let results = (torc results list $wf -f json | from json | get results | where return_code != 0) $results | each { |r| let job = ($jobs | where id == $r.job_id | first) { name: $job.name, return_code: $r.return_code, job_id: $r.job_id } }

Working with User Data

List User Data Entries

torc user-data list $wf -f json | from json | get user_data

Filter by Key

Find user data with a specific key:

torc user-data list $wf -f json | from json | get user_data | where key =~ "config"

Parse JSON Values

User data values are JSON strings. Parse and filter them:

torc user-data list $wf -f json | from json | get user_data | each { |ud| { key: $ud.key, value: ($ud.value | from json) } }

Practical Examples

Example 1: Debug Failed Jobs

Find failed jobs and get their result details:

# Get failed job IDs let failed_ids = (torc jobs list $wf -f json | from json | get jobs | where status == "failed" | get id) # Show results for those jobs torc results list $wf -f json | from json | get results | where job_id in $failed_ids | select job_id return_code

Example 2: Find Stuck Jobs

Find jobs that have been running for a long time (status is "running"):

torc jobs list $wf -f json | from json | get jobs | where status == "running" | select id name

Example 3: Parameter Sweep Analysis

For a parameterized workflow, find which parameter values failed:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" and name =~ "lr" | get name

Example 4: Export to CSV

Export failed jobs to CSV for further analysis:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" | to csv | save failed_jobs.csv

Quick Reference

Operation Nushell Command

Parse JSON from json

Get field get jobs

Filter rows where status == "failed"

Select columns select name status id

Sort sort-by name

Sort descending sort-by id -r

Count length

Substring match where name =~ "pattern"

Multiple conditions where status == "failed" and name =~ "x"

In list where status in ["ready", "running"]

Group by group-by status

Save to file save output.json

Convert to CSV to csv

Tips

Use nu interactively: Start a Nushell session to explore data step by step

Tab completion: Nushell provides completions for commands and field names

Pipeline debugging: Add | first 5 to see a sample before processing all data

Save queries: Create shell aliases or scripts for common filters

What You Learned

In this tutorial, you learned:

Why Nushell is a great tool for filtering Torc CLI output

How to filter jobs by status and name patterns

How to analyze results and find failures

How to work with user data

Practical examples for debugging workflows

Next Steps

Nushell Documentation - Learn more about Nushell's capabilities

Torc CLI Reference - Full list of CLI commands and their JSON output

Shell Completions

Torc provides shell completion scripts to make working with the CLI faster and more convenient. Completions help you discover commands, avoid typos, and speed up your workflow.

Overview

Shell completions provide:

Command completion - Tab-complete torc subcommands and options

Flag completion - Tab-complete command-line flags and their values

Multi-shell support - Bash, Zsh, Fish, Elvish, and PowerShell

Automatic updates - Completions are generated from the CLI structure

Generating Completions

Use the torc completions command to generate completion scripts for your shell:

# See available shells torc completions --help # Generate for a specific shell torc completions bash torc completions zsh torc completions fish torc completions elvish torc completions powershell

Installation

Bash

User installation

# Create completions directory if it doesn't exist mkdir -p ~/.local/share/bash-completion/completions # Generate and install completions torc completions bash > ~/.local/share/bash-completion/completions/torc # Source the completion file in your current shell source ~/.local/share/bash-completion/completions/torc

Verify installation:

# Restart your shell or source the completion file source ~/.local/share/bash-completion/completions/torc # Test completions torc wor<TAB> # Should complete to "workflows" torc workflows <TAB> # Should show workflow subcommands

Zsh

Option 1: User installation (recommended)

# Create completions directory in your home directory mkdir -p ~/.zfunc # Add to fpath in your ~/.zshrc if not already present echo 'fpath=(~/.zfunc $fpath)' >> ~/.zshrc echo 'autoload -Uz compinit && compinit' >> ~/.zshrc # Generate and install completions torc completions zsh > ~/.zfunc/_torc # Restart shell or source ~/.zshrc source ~/.zshrc

Option 2: Using custom location

# Generate to a custom location mkdir -p ~/my-completions torc completions zsh > ~/my-completions/_torc # Add to ~/.zshrc echo 'fpath=(~/my-completions $fpath)' >> ~/.zshrc echo 'autoload -Uz compinit && compinit' >> ~/.zshrc # Restart shell exec zsh

Troubleshooting Zsh completions:

If completions aren't working, try rebuilding the completion cache:

# Remove completion cache rm -f ~/.zcompdump # Restart shell exec zsh

Fish

# Fish automatically loads completions from ~/.config/fish/completions/ mkdir -p ~/.config/fish/completions # Generate and install completions torc completions fish > ~/.config/fish/completions/torc.fish # Fish will automatically load the completions # Test immediately (no shell restart needed) torc wor<TAB>

Elvish

# Create completions directory mkdir -p ~/.elvish/lib # Generate completions torc completions elvish > ~/.elvish/lib/torc.elv # Add to your ~/.elvish/rc.elv echo 'use torc' >> ~/.elvish/rc.elv # Restart shell

PowerShell

Windows PowerShell / PowerShell Core:

# Create profile directory if it doesn't exist New-Item -ItemType Directory -Force -Path (Split-Path -Parent $PROFILE) # Generate completions to a file torc completions powershell > $HOME\.config\torc_completions.ps1 # Add to your PowerShell profile Add-Content -Path $PROFILE -Value '. $HOME\.config\torc_completions.ps1' # Reload profile . $PROFILE

Alternative: Source inline

# Generate and add directly to profile torc completions powershell | Out-File -Append -FilePath $PROFILE # Reload profile . $PROFILE

Using Completions

Once installed, use Tab to trigger completions:

Command Completion

# Complete subcommands torc <TAB> # Shows: workflows, jobs, files, events, run, submit, tui, ... torc work<TAB> # Completes to: torc workflows torc workflows <TAB> # Shows: create, list, get, delete, submit, run, ...

Flag Completion

# Complete flags torc --<TAB> # Shows: --url, --username, --password, --format, --log-level, --help torc workflows list --<TAB> # Shows available flags for the list command # Complete flag values (where applicable) torc workflows list --format <TAB> # Shows: table, json

Workflow ID Completion

# Some shells support dynamic completion torc workflows get <TAB> # May show available workflow IDs

Examples

Here are some common completion patterns:

# Discover available commands torc <TAB><TAB> # Complete command names torc w<TAB> # workflows torc wo<TAB> # workflows torc j<TAB> # jobs # Navigate subcommands torc workflows <TAB> # create, list, get, delete, ... torc jobs <TAB> # list, get, update, ... # Complete flags torc --u<TAB> # --url, --username torc --url <type-url> torc --format <TAB> # table, json # Complex commands torc workflows create --<TAB> # Shows all available flags for the create command

Updating Completions

When you update Torc to a new version, regenerate the completion scripts to get the latest commands and flags:

# Bash torc completions bash > ~/.local/share/bash-completion/completions/torc source ~/.local/share/bash-completion/completions/torc # Zsh torc completions zsh > ~/.zfunc/_torc rm -f ~/.zcompdump && exec zsh # Fish torc completions fish > ~/.config/fish/completions/torc.fish # Fish reloads automatically # PowerShell torc completions powershell > $HOME\.config\torc_completions.ps1 . $PROFILE

Automation

You can automate completion installation in your dotfiles or setup scripts:

Bash Setup Script

#!/bin/bash # install-torc-completions.sh COMPLETION_DIR="$HOME/.local/share/bash-completion/completions" mkdir -p "$COMPLETION_DIR" if command -v torc &> /dev/null; then torc completions bash > "$COMPLETION_DIR/torc" echo "Torc completions installed for Bash" echo "Run: source $COMPLETION_DIR/torc" else echo "Error: torc command not found" exit 1 fi

Zsh Setup Script

#!/bin/zsh # install-torc-completions.zsh COMPLETION_DIR="$HOME/.zfunc" mkdir -p "$COMPLETION_DIR" if command -v torc &> /dev/null; then torc completions zsh > "$COMPLETION_DIR/_torc" # Add fpath to .zshrc if not already present if ! grep -q "fpath=(.*\.zfunc" ~/.zshrc; then echo 'fpath=(~/.zfunc $fpath)' >> ~/.zshrc echo 'autoload -Uz compinit && compinit' >> ~/.zshrc fi echo "Torc completions installed for Zsh" echo "Run: exec zsh" else echo "Error: torc command not found" exit 1 fi

Post-Installation Check

#!/bin/bash # verify-completions.sh # Test if completions are working if complete -p torc &> /dev/null; then echo "✓ Torc completions are installed" else echo "✗ Torc completions are not installed" echo "Run: torc completions bash > ~/.local/share/bash-completion/completions/torc" fi

Troubleshooting

Completions Not Working

Problem: Tab completion doesn't show torc commands.

Solutions:

Verify torc is in your PATH:

which torc # Should show path to torc binary

Check if completion file exists:

# Bash ls -l ~/.local/share/bash-completion/completions/torc # Zsh ls -l ~/.zfunc/_torc # Fish ls -l ~/.config/fish/completions/torc.fish

Verify completion is loaded:

# Bash complete -p torc # Zsh which _torc

Reload shell or source completion file:

# Bash source ~/.local/share/bash-completion/completions/torc # Zsh exec zsh # Fish (automatic)

Outdated Completions

Problem: New commands or flags don't show in completions.

Solution: Regenerate the completion file after updating Torc:

# Bash torc completions bash > ~/.local/share/bash-completion/completions/torc source ~/.local/share/bash-completion/completions/torc # Zsh torc completions zsh > ~/.zfunc/_torc rm ~/.zcompdump && exec zsh # Fish torc completions fish > ~/.config/fish/completions/torc.fish

Permission Denied

Problem: Cannot write to system completion directory.

Solution: Use user-level completion directory or sudo:

# Use user directory (recommended) mkdir -p ~/.local/share/bash-completion/completions torc completions bash > ~/.local/share/bash-completion/completions/torc # Or use sudo for system-wide sudo torc completions bash > /etc/bash_completion.d/torc

Zsh "command not found: compdef"

Problem: Zsh completion system not initialized.

Solution: Add to your ~/.zshrc:

autoload -Uz compinit && compinit

PowerShell Execution Policy

Problem: Cannot run completion script due to execution policy.

Solution: Adjust execution policy:

# Check current policy Get-ExecutionPolicy # Set policy to allow local scripts Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Shell-Specific Features

Bash

Case-insensitive completion (if configured in .inputrc)

Partial matching support

Menu completion available

Zsh

Advanced completion with descriptions

Correction suggestions

Menu selection

Color support for completions

Fish

Rich descriptions for each option

Real-time syntax highlighting

Automatic paging for long completion lists

Fuzzy matching support

PowerShell

IntelliSense-style completions

Parameter descriptions

Type-aware completions

Best Practices

Keep completions updated: Regenerate after each Torc update

Use version control: Include completion installation in dotfiles

Automate installation: Add to setup scripts for new machines

Test after updates: Verify completions work after shell or Torc updates

Document in team wikis: Help teammates set up completions

Additional Resources

Bash Completion Documentation

Zsh Completion System

Fish Completion Tutorial

PowerShell Tab Completion

Design & Architecture

Internal design documentation for developers.

Client Architecture - Torc client design

Workflow Actions - Workflow action system

Web Dashboard Design - Dashboard architecture

Server API Handler - Server implementation

Central Database - Database schema and design

Workflow Recovery Design - Recovery system architecture

Failure Handler Design - Failure handler implementation

AI-Assisted Recovery Design - AI-assisted error classification

Workflow Graph - Dependency graph implementation

Interface Architecture - Interface design patterns

Client

Torc provides client libraries in multiple languages for workflow management.

Rust Client (Primary)

The Rust client provides both CLI and library interfaces:

Workflow Creation

Parse workflow specification files (JSON, JSON5, YAML, KDL)

Expand parameterized job/file specifications

Create all workflow components atomically via API calls

Handle name-to-ID resolution for dependencies

Workflow Manager

Start/restart/reinitialize workflow execution

Track file changes and update database

Detect changed user_data inputs

Validate workflow state before initialization

API Integration

Auto-generated client from OpenAPI spec

Pagination support for large result sets

Retry logic and error handling

Client Modes

The Rust client operates in multiple modes:

CLI Mode - Command-line interface for interactive use

Library Mode - Programmatic API for integration with other tools

Specification Parser - Reads and expands workflow specifications

API Client - HTTP client for communicating with the server

Python Client

The Python client (torc package) provides programmatic workflow management for Python users:

OpenAPI-generated client for full API access

make_api() helper for easy server connection

map_function_to_jobs() for mapping Python functions across parameters

Integration with Python data science and ML pipelines

See Creating Workflows for usage examples.

Julia Client

The Julia client (Torc.jl package) provides programmatic workflow management for Julia users:

OpenAPI-generated client for full API access

make_api() helper for easy server connection

send_api_command() wrapper with error handling

add_jobs() for batch job creation

map_function_to_jobs() for mapping Julia functions across parameters

See Creating Workflows for usage examples.

Workflow Actions

Workflow actions enable automatic execution of commands and resource allocation in response to workflow lifecycle events. Actions provide hooks for setup, teardown, monitoring, and dynamic resource management throughout workflow execution.

Overview

Actions consist of three components:

Trigger - The condition that activates the action

Action Type - The operation to perform

Configuration - Parameters specific to the action

actions: - trigger_type: "on_workflow_start" action_type: "run_commands" commands: - "mkdir -p output logs" - "echo 'Workflow started' > logs/status.txt"

Trigger Types

Workflow Lifecycle Triggers

on_workflow_start

Executes once when the workflow is initialized.

When it fires: During initialize_jobs after jobs are transitioned from uninitialized to ready/blocked states.

Typical use cases:

Scheduling Slurm allocations

Creating directory structures

Copying initial data

- trigger_type: "on_workflow_start" action_type: "run_commands" commands: - "mkdir -p output checkpoints temp" - "echo 'Workflow started at $(date)' > workflow.log"

on_workflow_complete

Executes once when all jobs reach terminal states (completed, failed, or canceled).

When it fires: After the last job completes, as detected by the job runner.

Typical use cases:

Archiving final results

Uploading to remote storage

Cleanup of temporary files

Generating summary reports

- trigger_type: "on_workflow_complete" action_type: "run_commands" commands: - "tar -czf results.tar.gz output/" - "aws s3 cp results.tar.gz s3://bucket/results/" - "rm -rf temp/"

Job-Based Triggers

on_jobs_ready

Executes when all specified jobs transition to the "ready" state.

When it fires: When the last specified job becomes ready to execute (all dependencies satisfied).

Typical use cases:

Scheduling Slurm allocations

Starting phase-specific monitoring

Pre-computation setup

Notifications before expensive operations

- trigger_type: "on_jobs_ready" action_type: "schedule_nodes" jobs: ["train_model_001", "train_model_002", "train_model_003"] scheduler: "gpu_cluster" scheduler_type: "slurm" num_allocations: 2

Important: The action triggers only when all matching jobs are ready, not individually as each job becomes ready.

on_jobs_complete

Executes when all specified jobs reach terminal states (completed, failed, or canceled).

When it fires: When the last specified job finishes execution.

Typical use cases:

Scheduling Slurm allocations

Cleaning up intermediate files

Archiving phase results

Freeing storage space

Phase-specific reporting

- trigger_type: "on_jobs_complete" action_type: "run_commands" jobs: ["preprocess_1", "preprocess_2", "preprocess_3"] commands: - "echo 'Preprocessing phase complete' >> workflow.log" - "rm -rf raw_data/"

Worker Lifecycle Triggers

Worker lifecycle triggers are persistent by default, meaning they execute once per worker (job runner), not once per workflow.

on_worker_start

Executes when each worker (job runner) starts.

When it fires: After a job runner starts and checks for workflow actions, before claiming any jobs.

Typical use cases:

Worker-specific initialization

Setting up worker-local logging

Copying data to compute node local storage

Initializing worker-specific resources

Recording worker startup metrics

- trigger_type: "on_worker_start" action_type: "run_commands" persistent: true # Each worker executes this commands: - "echo 'Worker started on $(hostname) at $(date)' >> worker.log" - "mkdir -p worker_temp"

on_worker_complete

Executes when each worker completes (exits the main loop).

When it fires: After a worker finishes processing jobs and before it shuts down.

Typical use cases:

Worker-specific cleanup

Uploading worker-specific logs

Recording worker completion metrics

Cleaning up worker-local resources

- trigger_type: "on_worker_complete" action_type: "run_commands" persistent: true # Each worker executes this commands: - "echo 'Worker completed on $(hostname) at $(date)' >> worker.log" - "rm -rf worker_temp"

Job Selection

For on_jobs_ready and on_jobs_complete triggers, specify which jobs to monitor.

Exact Job Names

- trigger_type: "on_jobs_complete" action_type: "run_commands" jobs: ["job1", "job2", "job3"] commands: - "echo 'Specific jobs complete'"

Regular Expressions

- trigger_type: "on_jobs_ready" action_type: "schedule_nodes" job_name_regexes: ["train_model_[0-9]+", "eval_.*"] scheduler: "gpu_cluster" scheduler_type: "slurm" num_allocations: 2

Common regex patterns:

"train_.*" - All jobs starting with "train_"

"model_[0-9]+" - Jobs like "model_1", "model_2"

".*_stage1" - All jobs ending with "_stage1"

"job_(a|b|c)" - Jobs "job_a", "job_b", or "job_c"

Combining Selection Methods

You can use both together - the action triggers when all matching jobs meet the condition:

jobs: ["critical_job"] job_name_regexes: ["batch_.*"] # Triggers when "critical_job" AND all "batch_*" jobs are ready/complete

Action Types

run_commands

Execute shell commands sequentially on a compute node.

Configuration:

- trigger_type: "on_workflow_complete" action_type: "run_commands" commands: - "tar -czf results.tar.gz output/" - "aws s3 cp results.tar.gz s3://bucket/"

Execution details:

Commands run in the workflow's output directory

Commands execute sequentially (one after another)

If a command fails, the action fails (but workflow continues)

Commands run on compute nodes, not the submission node

Uses the shell environment of the job runner process

schedule_nodes

Dynamically allocate compute resources from a Slurm scheduler.

Configuration:

- trigger_type: "on_jobs_ready" action_type: "schedule_nodes" jobs: ["train_model_1", "train_model_2"] scheduler: "gpu_cluster" scheduler_type: "slurm" num_allocations: 2 start_one_worker_per_node: true max_parallel_jobs: 8

Parameters:

scheduler (required) - Name of Slurm scheduler configuration (must exist in slurm_schedulers)

scheduler_type (required) - Must be "slurm"

num_allocations (required) - Number of Slurm allocation requests to submit

start_one_worker_per_node (optional) - Start one job runner per node (default: false)

max_parallel_jobs (optional) - Maximum concurrent jobs per runner

Use cases:

Just-in-time resource allocation

Cost optimization (allocate only when needed)

Separating workflow phases with different resource requirements

Complete Examples

Refer to this example

Execution Model

Action Claiming and Execution

Atomic Claiming: Actions are claimed atomically by workers to prevent duplicate execution

Non-Persistent Actions: Execute once per workflow (first worker to claim executes)

Persistent Actions: Can be claimed and executed by multiple workers

Trigger Counting: Job-based triggers increment a counter as jobs transition; action becomes pending when count reaches threshold

Immediate Availability: Worker lifecycle actions are immediately available after workflow initialization

Action Lifecycle

[Workflow Created] ↓ [initialize_jobs called] ↓ ├─→ on_workflow_start actions become pending ├─→ on_worker_start actions become pending (persistent) ├─→ on_worker_complete actions become pending (persistent) └─→ on_jobs_ready actions wait for job transitions ↓ [Worker Claims and Executes Actions] ↓ [Jobs Execute] ↓ [Jobs Complete] ↓ ├─→ on_jobs_complete actions become pending when all specified jobs complete └─→ on_workflow_complete actions become pending when all jobs complete ↓ [Workers Exit] ↓ [on_worker_complete actions execute per worker]

Important Characteristics

No Rollback: Failed actions don't affect workflow execution

Compute Node Execution: Actions run on compute nodes via job runners

One-Time Triggers: Non-persistent actions trigger once when conditions are first met

No Inter-Action Dependencies: Actions don't depend on other actions

Concurrent Workers: Multiple workers can execute different actions simultaneously

Workflow Reinitialization

When a workflow is reinitialized (e.g., after resetting failed jobs), actions are reset to allow them to trigger again:

Executed flags are cleared: All actions can be claimed and executed again

Trigger counts are recalculated: For on_jobs_ready and on_jobs_complete actions, the trigger count is set based on current job states

Example scenario:

job1 and job2 are independent jobs

postprocess_job depends on both job1 and job2

An on_jobs_ready action triggers when postprocess_job becomes ready

After first run completes:

job1 fails, job2 succeeds

User resets failed jobs and reinitializes

job2 is already Completed, so it counts toward the trigger count

When job1 completes in the second run, postprocess_job becomes ready

The action triggers again because the trigger count reaches the required threshold

This ensures actions properly re-trigger after workflow reinitialization, even when some jobs remain in their completed state.

Limitations

No Action Dependencies: Actions cannot depend on other actions completing

No Conditional Execution: Actions cannot have conditional logic (use multiple actions with different job selections instead)

No Action Retries: Failed actions are not automatically retried

Single Action Type: Each action has one action_type (cannot combine run_commands and schedule_nodes)

No Dynamic Job Selection: Job names/patterns are fixed at action creation time

For complex workflows requiring these features, consider:

Using job dependencies to order operations

Creating separate jobs for conditional logic

Implementing retry logic within command scripts

Creating multiple actions for different scenarios

Web Dashboard (torc-dash)

The torc-dash application is a web gateway that provides a browser-based UI for managing Torc workflows. It bridges a web frontend with the torc ecosystem by proxying API requests and executing CLI commands.

Architecture

flowchart LR Browser["Browser (Web UI)"] <--> Dashboard["torc-dash (Gateway)"] Dashboard <--> Server["torc-server (API)"] Dashboard --> CLI["torc CLI (subprocess)"] style Browser fill:#17a2b8,color:#fff style Dashboard fill:#4a9eff,color:#fff style Server fill:#28a745,color:#fff style CLI fill:#4a9eff,color:#fff

The dashboard acts as a gateway layer that:

Serves embedded static assets - HTML, CSS, and JavaScript bundled into the binary

Proxies API requests - Forwards /torc-service/* requests to a remote torc-server

Executes CLI commands - Runs torc CLI as subprocesses for complex operations

Manages server lifecycle - Optionally spawns and manages a torc-server instance

Core Components

Embedded Static Assets

Uses the rust_embed crate to bundle all files from the static/ directory directly into the binary at compile time:

#![allow(unused)] fn main() { #[derive(Embed)] #[folder = "static/"] struct Assets; }

This enables single-binary deployment with no external file dependencies.

Application State

Shared state across all request handlers:

#![allow(unused)] fn main() { struct AppState { api_url: String, // Remote torc-server URL client: reqwest::Client, // HTTP client for proxying torc_bin: String, // Path to torc CLI binary torc_server_bin: String, // Path to torc-server binary managed_server: Mutex<ManagedServer>, // Optional embedded server state } }

Standalone Mode

When launched with --standalone, torc-dash automatically spawns a torc-server subprocess:

Starts torc-server with configurable port (0 for auto-detection)

Reads TORC_SERVER_PORT=<port> from stdout to discover actual port

Configures API URL to point to the managed server

Tracks process ID for lifecycle management

This enables single-command deployment for local development or simple production setups.

Request Routing

Static File Routes

Route Handler Purpose

/ index_handler Serves index.html

/static/* static_handler Serves embedded assets with MIME types

API Proxy

All /torc-service/* requests are transparently proxied to the remote torc-server:

Browser: GET /torc-service/v1/workflows ↓ torc-dash: forwards to http://localhost:8080/torc-service/v1/workflows ↓ torc-server: responds with workflow list ↓ torc-dash: returns response to browser

The proxy preserves HTTP methods (GET, POST, PUT, PATCH, DELETE), headers, and request bodies.

CLI Command Endpoints

These endpoints execute the torc CLI as subprocesses, enabling operations that require local file access or complex orchestration:

Endpoint CLI Command Purpose

POST /api/cli/create torc workflows create Create workflow from spec file

POST /api/cli/run torc workflows run Run workflow locally

POST /api/cli/submit torc workflows submit Submit to scheduler

POST /api/cli/initialize torc workflows initialize Initialize job dependencies

POST /api/cli/delete torc workflows delete Delete workflow

POST /api/cli/reinitialize torc workflows reinitialize Reinitialize workflow

POST /api/cli/reset-status torc workflows reset-status Reset job statuses

GET /api/cli/run-stream torc workflows run SSE streaming execution

Server Management Endpoints

Endpoint Purpose

POST /api/server/start Start a managed torc-server

POST /api/server/stop Stop the managed server

GET /api/server/status Check server running status

Utility Endpoints

Endpoint Purpose

POST /api/cli/read-file Read local file contents

POST /api/cli/plot-resources Generate resource plots from DB

POST /api/cli/list-resource-dbs Find resource database files

Key Features

Streaming Workflow Execution

The /api/cli/run-stream endpoint uses Server-Sent Events (SSE) to provide real-time feedback:

Event: start Data: Running workflow abc123 Event: stdout Data: Job job_1 started Event: status Data: Jobs: 3 running, 7 completed (total: 10) Event: stdout Data: Job job_1 completed Event: end Data: success Event: exit_code Data: 0

The stream includes:

stdout/stderr from the torc CLI process

Periodic status updates fetched from the API every 3 seconds

Exit code when the process completes

CLI Execution Pattern

All CLI commands follow a consistent execution pattern:

#![allow(unused)] fn main() { async fn run_torc_command(torc_bin: &str, args: &[&str], api_url: &str) -> CliResponse { Command::new(torc_bin) .args(args) .env("TORC_API_URL", api_url) // Pass server URL to CLI .output() .await } }

Returns structured JSON:

{ "success": true, "stdout": "Workflow created: abc123", "stderr": "", "exit_code": 0 }

Configuration Merging

Configuration is merged from multiple sources (highest to lowest priority):

CLI arguments - Command-line flags

Environment variables - TORC_API_URL, TORC_BIN, etc.

Configuration file - TorcConfig from ~/.torc.toml or similar

Design Rationale

Why Proxy Instead of Direct API Access?

CORS avoidance - Browser same-origin policy doesn't apply to server-side requests

Authentication layer - Can add authentication/authorization without modifying torc-server

Request transformation - Can modify requests/responses as needed

Logging and monitoring - Centralized request logging

Why CLI Delegation?

Complex operations like workflow creation are delegated to the existing torc CLI rather than reimplementing:

Code reuse - Leverages tested CLI implementation

Local file access - CLI can read workflow specs from the filesystem

Consistent behavior - Same behavior as command-line usage

Maintenance - Single implementation to maintain

Why Standalone Mode?

Single-binary deployment - One command starts everything needed

Development convenience - Quick local testing without separate server

Port auto-detection - Avoids port conflicts with port 0 support

Server API Handler

The server is a Rust async web service built with Tokio and uses:

Multi-threaded Tokio runtime for concurrent request handling

Modular API structure with separate modules per resource type (workflows.rs, jobs.rs, files.rs, etc.)

OpenAPI-generated types for consistent API contracts

Database-level locking (BEGIN IMMEDIATE TRANSACTION) for critical sections

Key Endpoints

The server implements these key endpoints:

POST /workflows - Create new workflows

POST /workflows/{id}/initialize_jobs - Build dependency graph and mark jobs ready

POST /workflows/{id}/claim_next_jobs - Thread-safe job allocation to workers

POST /jobs/{id}/manage_status_change - Update job status with cascade effects

POST /workflows/{id}/process_changed_job_inputs - Detect changed inputs and reset jobs

Thread Safety

The claim_next_jobs endpoint uses database-level write locks to prevent multiple workers from double-allocating jobs to different clients. This is critical for maintaining consistency in distributed execution.

API Organization

Each resource type (workflows, jobs, files, events, etc.) has its own module in server/src/bin/server/api/, keeping the codebase organized and maintainable. The main routing logic delegates to these specialized modules.

Central Database

The SQLite database is the heart of Torc's coordination model. All workflow state lives in the database, enabling:

Stateless clients and workers - All state persists in the database

Multiple concurrent workers - Workers coordinate through database locks

Fault tolerance - Workers can crash and restart; state is preserved

Workflow resumption - Workflows can be stopped and restarted without losing progress

Core Database Tables

workflow - Top-level workflow records with name, user, description

workflow_status - Workflow execution state (run_id, status)

job - Individual computational tasks with commands and status

job_internal - Internal job data (input hashes for change detection)

job_depends_on - Explicit and implicit job dependencies

file - File artifacts with paths and modification times

user_data - JSON data artifacts for passing information between jobs

job_input_file, job_output_file - Job-file relationships

job_input_user_data, job_output_user_data - Job-user_data relationships

resource_requirements - CPU, memory, GPU, runtime specifications

compute_node - Available compute resources

scheduled_compute_node - Compute nodes allocated to workflows

local_scheduler, slurm_scheduler - Execution environment configurations

result - Job execution results (exit code, stdout, stderr)

event - Audit log of workflow events

Foreign Key Cascades

The schema uses foreign key constraints with cascading deletes. Deleting a workflow automatically removes all associated jobs, files, events, and other related records, ensuring referential integrity.

Concurrency Model

SQLite uses database-level locking with BEGIN IMMEDIATE TRANSACTION to prevent race conditions in critical sections, particularly during job allocation when multiple workers request jobs simultaneously.

Workflow Recovery

Torc provides mechanisms for recovering workflows when Slurm allocations are preempted or fail before completing all jobs. The torc slurm regenerate command creates new schedulers and allocations for pending jobs.

The Recovery Problem

When running workflows on Slurm, allocations can fail or be preempted before all jobs complete. This leaves workflows in a partial state with:

Ready/uninitialized jobs - Jobs that were waiting to run but never got scheduled

Blocked jobs - Jobs whose dependencies haven't completed yet

Orphaned running jobs - Jobs still marked as "running" in the database even though their Slurm allocation has terminated

Simply creating new Slurm schedulers and submitting allocations isn't enough because:

Orphaned jobs block recovery: Jobs stuck in "running" status prevent the workflow from being considered complete, blocking recovery precondition checks

Duplicate allocations: If the workflow had on_workflow_start actions to schedule nodes, those actions would fire again when the workflow is reinitialized, creating duplicate allocations

Missing allocations for blocked jobs: Blocked jobs will eventually become ready, but there's no mechanism to schedule new allocations for them

Orphan Detection

Before recovery can proceed, orphaned jobs must be detected and their status corrected. This is handled by the orphan detection module (src/client/commands/orphan_detection.rs).

How It Works

The orphan detection logic checks for three types of orphaned resources:

Active allocations with terminated Slurm jobs: ScheduledComputeNodes marked as "active" in the database, but whose Slurm job is no longer running (verified via squeue)

Pending allocations that disappeared: ScheduledComputeNodes marked as "pending" whose Slurm job no longer exists (cancelled or failed before starting)

Running jobs with no active compute nodes: Jobs marked as "running" but with no active compute nodes to process them (fallback for non-Slurm cases)

flowchart TD A[Start Orphan Detection] --> B[List active ScheduledComputeNodes] B --> C{For each Slurm allocation} C --> D[Check squeue for job status] D --> E{Job still running?} E -->|Yes| C E -->|No| F[Find jobs on this allocation] F --> G[Mark jobs as failed] G --> H[Update ScheduledComputeNode to complete] H --> C C --> I[List pending ScheduledComputeNodes] I --> J{For each pending allocation} J --> K[Check squeue for job status] K --> L{Job exists?} L -->|Yes| J L -->|No| M[Update ScheduledComputeNode to complete] M --> J J --> N[Check for running jobs with no active nodes] N --> O[Mark orphaned jobs as failed] O --> P[Done] style A fill:#4a9eff,color:#fff style B fill:#4a9eff,color:#fff style C fill:#6c757d,color:#fff style D fill:#4a9eff,color:#fff style E fill:#6c757d,color:#fff style F fill:#4a9eff,color:#fff style G fill:#dc3545,color:#fff style H fill:#4a9eff,color:#fff style I fill:#4a9eff,color:#fff style J fill:#6c757d,color:#fff style K fill:#4a9eff,color:#fff style L fill:#6c757d,color:#fff style M fill:#4a9eff,color:#fff style N fill:#4a9eff,color:#fff style O fill:#dc3545,color:#fff style P fill:#28a745,color:#fff

Integration Points

Orphan detection is integrated into two commands:

torc recover: Runs orphan detection automatically as the first step before checking preconditions. This ensures that orphaned jobs don't block recovery.

torc workflows sync-status: Standalone command to run orphan detection without triggering a full recovery. Useful for debugging or when you want to clean up orphaned jobs without submitting new allocations.

The torc watch Command

The torc watch command also performs orphan detection during its polling loop. When it detects that no valid Slurm allocations exist (via a quick squeue check), it runs the full orphan detection logic to clean up any orphaned jobs before checking if the workflow can make progress.

Recovery Actions

The recovery system uses ephemeral recovery actions to solve these problems.

How It Works

When torc slurm regenerate runs:

flowchart TD A[torc slurm regenerate] --> B[Fetch pending jobs] B --> C{Has pending jobs?} C -->|No| D[Exit - nothing to do] C -->|Yes| E[Build WorkflowGraph from pending jobs] E --> F[Mark existing schedule_nodes actions as executed] F --> G[Group jobs using scheduler_groups] G --> H[Create schedulers for each group] H --> I[Update jobs with scheduler assignments] I --> J[Create on_jobs_ready recovery actions for deferred groups] J --> K{Submit allocations?} K -->|Yes| L[Submit Slurm allocations] K -->|No| M[Done] L --> M style A fill:#4a9eff,color:#fff style B fill:#4a9eff,color:#fff style C fill:#6c757d,color:#fff style D fill:#6c757d,color:#fff style E fill:#4a9eff,color:#fff style F fill:#4a9eff,color:#fff style G fill:#4a9eff,color:#fff style H fill:#4a9eff,color:#fff style I fill:#4a9eff,color:#fff style J fill:#ffc107,color:#000 style K fill:#6c757d,color:#fff style L fill:#ffc107,color:#000 style M fill:#28a745,color:#fff

Step 1: Mark Existing Actions as Executed

All existing schedule_nodes actions are marked as executed using the claim_action API. This prevents them from firing again and creating duplicate allocations:

sequenceDiagram participant R as regenerate participant S as Server participant A as workflow_action table R->>S: get_workflow_actions(workflow_id) S-->>R: [action1, action2, ...] loop For each schedule_nodes action R->>S: claim_action(action_id) S->>A: UPDATE executed=1, executed_at=NOW() end

Step 2: Group Jobs Using WorkflowGraph

The system builds a WorkflowGraph from pending jobs and uses scheduler_groups() to group them by (resource_requirements, has_dependencies). This aligns with the behavior of torc workflows create-slurm:

Jobs without dependencies: Can be scheduled immediately with on_workflow_start

Jobs with dependencies (deferred): Need on_jobs_ready recovery actions to schedule when they become ready

flowchart TD subgraph pending["Pending Jobs"] A[Job A: no deps, rr=default] B[Job B: no deps, rr=default] C[Job C: depends on A, rr=default] D[Job D: no deps, rr=gpu] end subgraph groups["Scheduler Groups"] G1[Group 1: default, no deps Jobs: A, B] G2[Group 2: default, has deps Jobs: C] G3[Group 3: gpu, no deps Jobs: D] end A --> G1 B --> G1 C --> G2 D --> G3 style A fill:#4a9eff,color:#fff style B fill:#4a9eff,color:#fff style C fill:#ffc107,color:#000 style D fill:#17a2b8,color:#fff style G1 fill:#28a745,color:#fff style G2 fill:#28a745,color:#fff style G3 fill:#28a745,color:#fff

Step 3: Create Recovery Actions for Deferred Groups

For groups with has_dependencies = true, the system creates on_jobs_ready recovery actions. These actions:

Have is_recovery = true to mark them as ephemeral

Use a _deferred suffix in the scheduler name

Trigger when the blocked jobs become ready

Schedule additional Slurm allocations for those jobs

flowchart LR subgraph workflow["Original Workflow"] A[Job A: blocked] --> C[Job C: blocked] B[Job B: blocked] --> C end subgraph actions["Recovery Actions"] RA["on_jobs_ready: schedule_nodes job_ids: (A, B) is_recovery: true"] RC["on_jobs_ready: schedule_nodes job_ids: (C) is_recovery: true"] end style A fill:#6c757d,color:#fff style B fill:#6c757d,color:#fff style C fill:#6c757d,color:#fff style RA fill:#ffc107,color:#000 style RC fill:#ffc107,color:#000

Recovery Action Lifecycle

Recovery actions are ephemeral - they exist only during the recovery period:

stateDiagram-v2 [*] --> Created: regenerate creates action Created --> Executed: Jobs become ready, action triggers Executed --> Deleted: Workflow reinitialized Created --> Deleted: Workflow reinitialized classDef created fill:#ffc107,color:#000 classDef executed fill:#28a745,color:#fff classDef deleted fill:#6c757d,color:#fff class Created created class Executed executed class Deleted deleted

When a workflow is reinitialized (e.g., after resetting jobs), all recovery actions are deleted and original actions are reset to their initial state. This ensures a clean slate for the next run.

Database Schema

Recovery actions are tracked using the is_recovery column in the workflow_action table:

Column Type Description

is_recovery INTEGER 0 = normal action, 1 = recovery action

Behavior Differences

Operation Normal Actions Recovery Actions

On reset_actions_for_reinitialize Reset executed to 0 Deleted entirely

Created by Workflow spec torc slurm regenerate

Purpose Configured behavior Temporary recovery

Usage

# Regenerate schedulers for pending jobs torc slurm regenerate <workflow_id> --account <account> # With automatic submission torc slurm regenerate <workflow_id> --account <account> --submit # Using a specific HPC profile torc slurm regenerate <workflow_id> --account <account> --profile kestrel

Implementation Details

The recovery logic is implemented in:

src/client/commands/orphan_detection.rs: Shared orphan detection logic used by recover, watch, and workflows sync-status

src/client/commands/recover.rs: Main recovery command implementation

src/client/commands/slurm.rs: handle_regenerate function for Slurm scheduler regeneration

src/client/workflow_graph.rs: WorkflowGraph::from_jobs() and scheduler_groups() methods

src/server/api/workflow_actions.rs: reset_actions_for_reinitialize function

migrations/20251225000000_add_is_recovery_to_workflow_action.up.sql: Schema migration

Key implementation notes:

WorkflowGraph construction: A WorkflowGraph is built from pending jobs using from_jobs(), which reconstructs the dependency structure from depends_on_job_ids

Scheduler grouping: Jobs are grouped using scheduler_groups() by (resource_requirements, has_dependencies), matching create-slurm behavior

Deferred schedulers: Groups with dependencies get a _deferred suffix in the scheduler name

Allocation calculation: Number of allocations is based on job count and resources per node

Recovery actions: Only deferred groups (jobs with dependencies) get on_jobs_ready recovery actions

Failure Handler Design

This document describes the internal architecture and implementation of failure handlers in Torc. For a user-focused tutorial, see Configurable Failure Handlers.

Overview

Failure handlers provide per-job automatic retry logic based on exit codes. They allow workflows to recover from transient failures without manual intervention or workflow-level recovery heuristics.

flowchart LR subgraph workflow["Workflow Specification"] FH["failure_handlers: - name: handler1 rules: [...]"] JOB["jobs: - name: my_job failure_handler: handler1"] end subgraph server["Server"] DB[(Database)] API["REST API"] end subgraph client["Job Runner"] RUNNER["JobRunner"] RECOVERY["Recovery Logic"] end FH --> DB JOB --> DB RUNNER --> API API --> DB RUNNER --> RECOVERY style FH fill:#4a9eff,color:#fff style JOB fill:#4a9eff,color:#fff style DB fill:#ffc107,color:#000 style API fill:#28a745,color:#fff style RUNNER fill:#17a2b8,color:#fff style RECOVERY fill:#dc3545,color:#fff

Problem Statement

When jobs fail, workflows traditionally have two recovery options:

Manual intervention: User investigates and restarts failed jobs

Workflow-level recovery: torc watch --recover applies heuristics based on detected failure patterns (OOM, timeout, etc.)

Neither approach handles application-specific failures where:

The job itself knows why it failed (via exit code)

A specific recovery action can fix the issue

Immediate retry is appropriate

Failure handlers solve this by allowing jobs to define exit-code-specific retry behavior with optional recovery scripts.

Architecture

Component Interaction

sequenceDiagram participant WS as Workflow Spec participant API as Server API participant DB as Database participant JR as JobRunner participant RS as Recovery Script participant JOB as Job Process Note over WS,DB: Workflow Creation WS->>API: Create workflow with failure_handlers API->>DB: INSERT failure_handler API->>DB: INSERT job with failure_handler_id Note over JR,JOB: Job Execution JR->>API: Claim job JR->>JOB: Execute command JOB-->>JR: Exit code (e.g., 10) Note over JR,API: Failure Recovery JR->>API: GET failure_handler API->>DB: SELECT rules DB-->>API: Rules JSON API-->>JR: FailureHandlerModel JR->>JR: Match exit code to rule JR->>API: POST retry_job (reserves retry) alt Recovery Script Defined JR->>RS: Execute with env vars RS-->>JR: Exit code end JR->>JR: Job returns to Ready queue

Data Model

erDiagram WORKFLOW ||--o{ JOB : contains WORKFLOW ||--o{ FAILURE_HANDLER : contains FAILURE_HANDLER ||--o{ JOB : "referenced by" WORKFLOW { int id PK string name int status_id FK } FAILURE_HANDLER { int id PK int workflow_id FK string name string rules "JSON array" } JOB { int id PK int workflow_id FK string name string command int status int failure_handler_id FK "nullable" int attempt_id "starts at 1" }

Rule Matching

Failure handler rules are stored as a JSON array. When a job fails, rules are evaluated in a specific order to find a match.

Rule Structure

#![allow(unused)] fn main() { pub struct FailureHandlerRule { pub exit_codes: Vec<i32>, // Specific codes to match pub match_all_exit_codes: bool, // Catch-all flag pub recovery_script: Option<String>, pub max_retries: i32, // Default: 3 } }

Matching Priority

Rules are evaluated with specific matches taking priority over catch-all rules:

flowchart TD START["Job fails with exit code X"] SPECIFIC{"Find rule where exit_codes contains X?"} CATCHALL{"Find rule where match_all_exit_codes = true?"} FOUND["Rule matched"] NONE["No matching rule Job marked Failed"] START --> SPECIFIC SPECIFIC -->|Found| FOUND SPECIFIC -->|Not found| CATCHALL CATCHALL -->|Found| FOUND CATCHALL -->|Not found| NONE style START fill:#dc3545,color:#fff style SPECIFIC fill:#4a9eff,color:#fff style CATCHALL fill:#ffc107,color:#000 style FOUND fill:#28a745,color:#fff style NONE fill:#6c757d,color:#fff

This ensures that specific exit code handlers always take precedence, regardless of rule order in the JSON array.

Implementation (job_runner.rs):

#![allow(unused)] fn main() { let matching_rule = rules .iter() .find(|rule| rule.exit_codes.contains(&(exit_code as i32))) .or_else(|| rules.iter().find(|rule| rule.match_all_exit_codes)); }

Recovery Flow

The recovery process is designed to be atomic and safe:

flowchart TD subgraph JobRunner["JobRunner (Client)"] FAIL["Job fails"] FETCH["Fetch failure handler"] MATCH["Match rule to exit code"] CHECK{"attempt_id < max_retries?"} RESERVE["POST /jobs/{id}/retry/{run_id} Reserves retry slot"] SCRIPT{"Recovery script defined?"} RUN["Execute recovery script"] DONE["Job queued for retry"] FAILED["Mark job as Failed"] end subgraph Server["Server (API)"] VALIDATE["Validate run_id matches"] STATUS["Check job status"] MAX["Validate max_retries"] UPDATE["UPDATE job status=Ready attempt_id += 1"] EVENT["INSERT event record"] COMMIT["COMMIT transaction"] end FAIL --> FETCH FETCH --> MATCH MATCH --> CHECK CHECK -->|Yes| RESERVE CHECK -->|No| FAILED RESERVE --> VALIDATE VALIDATE --> STATUS STATUS --> MAX MAX --> UPDATE UPDATE --> EVENT EVENT --> COMMIT COMMIT --> SCRIPT SCRIPT -->|Yes| RUN SCRIPT -->|No| DONE RUN -->|Success or Failure| DONE style FAIL fill:#dc3545,color:#fff style RESERVE fill:#4a9eff,color:#fff style RUN fill:#ffc107,color:#000 style DONE fill:#28a745,color:#fff style FAILED fill:#6c757d,color:#fff style UPDATE fill:#17a2b8,color:#fff style COMMIT fill:#17a2b8,color:#fff

Key Design Decisions

Retry reservation before recovery script: The retry_job API is called before the recovery script runs. This ensures:

The retry slot is reserved atomically

Recovery scripts don't run for retries that won't happen

External resources modified by recovery scripts are consistent

Recovery script failure is non-fatal: If the recovery script fails, the job is still retried. This prevents recovery script bugs from blocking legitimate retries.

Transaction isolation: The retry_job API uses BEGIN IMMEDIATE to prevent race conditions where multiple processes might try to retry the same job.

API Endpoints

GET /failure_handlers/

Fetches a failure handler by ID.

Response:

{ "id": 1, "workflow_id": 42, "name": "simulation_recovery", "rules": "[{\"exit_codes\":[10,11],\"max_retries\":3}]" }

POST /jobs/{id}/retry/{run_id}?max_retries=N

Retries a failed job by resetting its status to Ready.

Query Parameters:

max_retries (required): Maximum retries allowed by the matching rule

Validations:

Job must exist

run_id must match workflow's current run

Job status must be Running, Failed, or Terminated

attempt_id must be less than max_retries

Transaction Safety:

BEGIN IMMEDIATE; -- Acquire write lock SELECT j.*, ws.run_id as workflow_run_id FROM job j JOIN workflow w ON j.workflow_id = w.id JOIN workflow_status ws ON w.status_id = ws.id WHERE j.id = ?; -- Validate conditions... UPDATE job SET status = 2, attempt_id = ? WHERE id = ?; INSERT INTO event (workflow_id, timestamp, data) VALUES (?, ?, ?); COMMIT;

Response:

{ "id": 123, "workflow_id": 42, "name": "my_job", "status": "ready", "attempt_id": 2 }

Recovery Script Execution

Recovery scripts run in a subprocess with environment variables providing context:

flowchart LR subgraph env["Environment Variables"] WID["TORC_WORKFLOW_ID"] JID["TORC_JOB_ID"] JN["TORC_JOB_NAME"] URL["TORC_API_URL"] OUT["TORC_OUTPUT_DIR"] AID["TORC_ATTEMPT_ID"] RC["TORC_RETURN_CODE"] end subgraph script["Recovery Script"] SHELL["bash -c (or cmd /C on Windows)"] CODE["User script code"] end env --> SHELL SHELL --> CODE style WID fill:#4a9eff,color:#fff style JID fill:#4a9eff,color:#fff style JN fill:#4a9eff,color:#fff style URL fill:#4a9eff,color:#fff style OUT fill:#4a9eff,color:#fff style AID fill:#ffc107,color:#000 style RC fill:#dc3545,color:#fff style SHELL fill:#6c757d,color:#fff style CODE fill:#28a745,color:#fff

Log File Naming

Each job attempt produces separate log files to preserve history:

output/job_stdio/ ├── job_wf{W}_j{J}_r{R}_a1.o # Attempt 1 stdout ├── job_wf{W}_j{J}_r{R}_a1.e # Attempt 1 stderr ├── job_wf{W}_j{J}_r{R}_a2.o # Attempt 2 stdout ├── job_wf{W}_j{J}_r{R}_a2.e # Attempt 2 stderr └── ...

Where:

W = workflow_id

J = job_id

R = run_id

aN = attempt number

Database Schema

failure_handler Table

CREATE TABLE failure_handler ( id INTEGER PRIMARY KEY AUTOINCREMENT, workflow_id INTEGER NOT NULL REFERENCES workflow(id) ON DELETE CASCADE, name TEXT NOT NULL, rules TEXT NOT NULL, -- JSON array of FailureHandlerRule UNIQUE(workflow_id, name) );

job Table (relevant columns)

ALTER TABLE job ADD COLUMN failure_handler_id INTEGER REFERENCES failure_handler(id) ON DELETE SET NULL; ALTER TABLE job ADD COLUMN attempt_id INTEGER NOT NULL DEFAULT 1;

Slurm Integration

When a job is retried, it returns to the Ready queue and will be picked up by any available compute node. For Slurm workflows, this may require additional allocations if existing nodes have terminated.

flowchart TD RETRY["Job retried (status = Ready)"] CHECK{"Compute nodes available?"} RUN["Job runs on existing allocation"] SCHEDULE["Auto-schedule triggers new Slurm allocation"] WAIT["Job waits for allocation to start"] EXEC["Job executes"] RETRY --> CHECK CHECK -->|Yes| RUN CHECK -->|No| SCHEDULE SCHEDULE --> WAIT WAIT --> EXEC RUN --> EXEC style RETRY fill:#28a745,color:#fff style CHECK fill:#6c757d,color:#fff style RUN fill:#4a9eff,color:#fff style SCHEDULE fill:#ffc107,color:#000 style WAIT fill:#17a2b8,color:#fff style EXEC fill:#28a745,color:#fff

If auto_schedule_on_ready_jobs actions are configured, new Slurm allocations will be created automatically when retried jobs become ready. See Workflow Actions for details.

Implementation Files

File Purpose

src/client/job_runner.rs try_recover_job(), rule matching

src/client/utils.rs shell_command() cross-platform shell

src/server/api/jobs.rs retry_job() API endpoint

src/server/api/failure_handlers.rs CRUD operations for failure handlers

src/client/workflow_spec.rs Parsing failure handlers from specs

migrations/20260110*.sql Database schema for failure handlers

Comparison with Workflow Recovery

Aspect Failure Handlers Workflow Recovery (torc watch)

Scope Per-job Workflow-wide

Trigger Specific exit codes OOM detection, timeout patterns

Timing Immediate (during job run) After job completion

Recovery Action Custom scripts Resource adjustment, resubmission

Configuration In workflow spec Command-line flags

State Preserved (same workflow run) May start new run

Slurm Reuses or auto-schedules nodes Creates new schedulers

Recommendation: Use both mechanisms together:

Failure handlers for immediate, exit-code-specific recovery

torc watch --recover for workflow-level resource adjustments and allocation recovery

Recovery Outcome and pending_failed Status

When try_recover_job is called, it returns a RecoveryOutcome enum that determines the final job status:

#![allow(unused)] fn main() { pub enum RecoveryOutcome { /// Job was successfully scheduled for retry Retried, /// No failure handler defined - use PendingFailed status NoHandler, /// Failure handler exists but no rule matched - use PendingFailed status NoMatchingRule, /// Max retries exceeded - use Failed status MaxRetriesExceeded, /// API call or other error - use Failed status Error(String), } }

Status Assignment Flow

flowchart TD FAIL["Job fails"] TRY["try_recover_job()"] RETRIED{"Outcome?"} READY["Status: ready attempt_id += 1"] PENDING["Status: pending_failed"] FAILED["Status: failed"] FAIL --> TRY TRY --> RETRIED RETRIED -->|Retried| READY RETRIED -->|NoHandler / NoMatchingRule| PENDING RETRIED -->|MaxRetriesExceeded / Error| FAILED style FAIL fill:#dc3545,color:#fff style READY fill:#28a745,color:#fff style PENDING fill:#ffc107,color:#000 style FAILED fill:#6c757d,color:#fff

pending_failed Status (value 10)

The pending_failed status is a new job state that indicates:

The job failed with a non-zero exit code

No failure handler rule matched the exit code

The job is awaiting classification (retry or fail)

Key properties:

Not terminal: Workflow is not considered complete while jobs are pending_failed

Downstream blocked: Dependent jobs remain in blocked status (not canceled)

Resettable: reset-status --failed-only includes pending_failed jobs

Integration with AI-Assisted Recovery

Jobs in pending_failed status can be classified by an AI agent using MCP tools:

sequenceDiagram participant JR as JobRunner participant API as Torc API participant MCP as torc-mcp-server participant AI as AI Agent JR->>API: complete_job(status=pending_failed) Note over JR,API: Job awaiting classification AI->>MCP: list_pending_failed_jobs(workflow_id) MCP->>API: GET /jobs?status=pending_failed API-->>MCP: Jobs with stderr content MCP-->>AI: Pending jobs + stderr AI->>AI: Analyze error patterns AI->>MCP: classify_and_resolve_failures(classifications) alt action = retry MCP->>API: PUT /jobs/{id} status=ready Note over API: Triggers re-execution else action = fail MCP->>API: PUT /jobs/{id} status=failed Note over API: Triggers downstream cancellation end

See AI-Assisted Recovery Design for full details.

AI-Assisted Recovery Design

🧪 EXPERIMENTAL: This feature is new and not yet well-tested. The API and behavior may change based on user feedback.

This document describes the architecture and implementation of AI-assisted failure recovery in Torc. For a user-focused tutorial, see AI-Assisted Failure Recovery.

Overview

AI-assisted recovery enables intelligent classification of job failures that can't be handled by rule-based mechanisms (failure handlers, OOM/timeout detection). It introduces a new job status (pending_failed) that defers the fail/retry decision to an AI agent.

flowchart TD subgraph traditional["Traditional Recovery"] FAIL1["Job fails"] HANDLER{"Failure handler?"} OOM{"OOM/timeout?"} FAILED1["Status: failed"] RETRY1["Retry"] end subgraph ai["AI-Assisted Recovery"] FAIL2["Job fails"] PENDING["Status: pending_failed"] AGENT["AI agent classifies"] FAILED2["Status: failed"] RETRY2["Retry"] end FAIL1 --> HANDLER HANDLER -->|Match| RETRY1 HANDLER -->|No match| OOM OOM -->|Yes| RETRY1 OOM -->|No| FAILED1 FAIL2 --> PENDING PENDING --> AGENT AGENT -->|Permanent| FAILED2 AGENT -->|Transient| RETRY2 style FAIL1 fill:#dc3545,color:#fff style FAIL2 fill:#dc3545,color:#fff style PENDING fill:#ffc107,color:#000 style AGENT fill:#4a9eff,color:#fff style FAILED1 fill:#6c757d,color:#fff style FAILED2 fill:#6c757d,color:#fff style RETRY1 fill:#28a745,color:#fff style RETRY2 fill:#28a745,color:#fff

Problem Statement

Current recovery mechanisms have blind spots:

Failure handlers: Require predefined exit codes. Many failures use generic exit code 1.

OOM/timeout detection: Only handles resource exhaustion patterns.

--retry-unknown: Blindly retries all failures, wasting compute on unfixable bugs.

Real-world failures often require contextual analysis:

Error Analysis Required Decision

Connection refused to storage.internal:443 Was the storage server down? Retry if transient

NCCL timeout after 1800 seconds Is this a node failure or code bug? Retry if node issue

SyntaxError: invalid syntax Is the code broken? Fail - needs fix

FileNotFoundError: input.csv Missing input or wrong path? Depends on context

AI agents can analyze stderr, correlate with external systems, and make informed decisions.

Architecture

Component Overview

flowchart LR subgraph client["Torc Client"] RUNNER["JobRunner"] WATCH["torc watch"] RECOVER["torc recover"] end subgraph server["Torc Server"] API["REST API"] DB[(SQLite)] end subgraph mcp["MCP Layer"] MCPSRV["torc-mcp-server"] CUSTOM["Custom MCP servers"] end subgraph agent["AI Agent"] LLM["Claude/Copilot/Custom"] end RUNNER --> API WATCH --> RECOVER RECOVER --> API API --> DB MCPSRV --> API LLM --> MCPSRV LLM --> CUSTOM style RUNNER fill:#17a2b8,color:#fff style WATCH fill:#17a2b8,color:#fff style RECOVER fill:#17a2b8,color:#fff style API fill:#28a745,color:#fff style DB fill:#ffc107,color:#000 style MCPSRV fill:#4a9eff,color:#fff style LLM fill:#dc3545,color:#fff

Data Flow

sequenceDiagram participant JR as JobRunner participant API as Torc API participant DB as Database participant MCP as torc-mcp-server participant AI as AI Agent Note over JR,DB: Job Failure JR->>JR: Job exits with code 1 JR->>JR: No failure handler match JR->>API: complete_job(status=pending_failed) API->>DB: UPDATE job SET status=10 Note over AI,DB: AI Classification AI->>MCP: list_pending_failed_jobs(workflow_id) MCP->>API: GET /jobs?status=pending_failed API->>DB: SELECT * FROM job WHERE status=10 DB-->>API: Jobs with pending_failed API-->>MCP: Job list MCP->>MCP: Read stderr files MCP-->>AI: Jobs + stderr content AI->>AI: Analyze errors AI->>MCP: classify_and_resolve_failures(classifications) alt action = retry MCP->>API: PUT /jobs/{id} status=ready API->>DB: UPDATE job SET status=2, attempt_id+=1 else action = fail MCP->>API: PUT /jobs/{id} status=failed API->>DB: UPDATE job SET status=6 Note over API,DB: Triggers downstream cancellation end

Job Status: pending_failed

Status Values

Value Name Description

0 uninitialized Not yet initialized

1 blocked Waiting on dependencies

2 ready Ready to run

3 pending Claimed by worker

4 running Currently executing

5 completed Finished successfully

6 failed Failed (terminal)

7 canceled Canceled by user

8 terminated Killed by signal

9 disabled Skipped

10 pending_failed Awaiting AI classification

Status Transitions

stateDiagram-v2 [*] --> uninitialized uninitialized --> blocked : initialize uninitialized --> ready : no dependencies blocked --> ready : dependencies met ready --> pending : claimed pending --> running : started running --> completed : exit 0 running --> failed : handler match + max retries running --> pending_failed : no handler match running --> ready : failure handler match running --> terminated : signal state "pending_failed" as pending_failed pending_failed --> failed : AI classifies permanent pending_failed --> ready : AI classifies transient pending_failed --> uninitialized : reset-status failed --> [*] completed --> [*] canceled --> [*] terminated --> [*]

Workflow Completion Semantics

A workflow with pending_failed jobs is not complete:

#![allow(unused)] fn main() { fn is_workflow_complete(workflow_id: i64) -> bool { // Jobs in these statuses are "complete" let complete_statuses = [ JobStatus::Completed, JobStatus::Failed, JobStatus::Canceled, JobStatus::Terminated, JobStatus::Disabled, ]; // pending_failed is NOT in this list // So workflows with pending_failed jobs are incomplete !jobs.iter().any(|j| !complete_statuses.contains(&j.status)) } }

This ensures:

torc watch continues monitoring

Downstream jobs remain blocked (not canceled)

The workflow doesn't appear "done" prematurely

Recovery Outcome Enum

The try_recover_job function returns detailed outcomes:

#![allow(unused)] fn main() { pub enum RecoveryOutcome { /// Job was successfully scheduled for retry Retried, /// No failure handler defined - use PendingFailed status NoHandler, /// Failure handler exists but no rule matched - use PendingFailed status NoMatchingRule, /// Max retries exceeded - use Failed status MaxRetriesExceeded, /// API call or other error - use Failed status Error(String), } }

Usage in handle_job_completion:

#![allow(unused)] fn main() { match self.try_recover_job(job_id, ...) { RecoveryOutcome::Retried => { // Job queued for retry, clean up return; } RecoveryOutcome::NoHandler | RecoveryOutcome::NoMatchingRule => { // Check if workflow has use_pending_failed enabled if self.workflow.use_pending_failed.unwrap_or(false) { // Use pending_failed for AI classification final_result.status = JobStatus::PendingFailed; } else { // Use failed status (default behavior) // (status already Failed) } } RecoveryOutcome::MaxRetriesExceeded | RecoveryOutcome::Error(_) => { // Use failed - no recovery possible // (status already Failed) } } }

Enabling AI-Assisted Recovery

AI-assisted recovery is opt-in per workflow using the use_pending_failed flag. By default, jobs that fail without a matching failure handler get the Failed status.

Workflow Specification

Add use_pending_failed: true to your workflow spec to enable:

name: ml_training use_pending_failed: true # Enable AI-assisted recovery jobs: - name: train_model command: python train.py

Without this flag (or with use_pending_failed: false), jobs use the traditional behavior:

Failure handler match → retry

No failure handler → Failed status

Max retries exceeded → Failed status

With use_pending_failed: true:

Failure handler match → retry

No failure handler → PendingFailed status (awaiting AI classification)

Max retries exceeded → Failed status

Why Opt-In?

The default behavior prioritizes predictability and backward compatibility:

Existing workflows continue to work - no breaking changes

Clear failure semantics - jobs either retry or fail immediately

No external dependencies - doesn't require AI agent integration

Opt-in when you want:

Intelligent classification of ambiguous failures

Human/AI review before retry decisions

Reduced compute waste from blind retries

MCP Server Tools

list_pending_failed_jobs

Lists jobs awaiting classification with their stderr content.

Implementation:

#![allow(unused)] fn main() { pub fn list_pending_failed_jobs( config: &Configuration, workflow_id: i64, output_dir: &Path, ) -> Result<CallToolResult, McpError> { // 1. Query jobs with pending_failed status let jobs = paginate_jobs(config, workflow_id, JobListParams::new().with_status(JobStatus::PendingFailed)); // 2. For each job, fetch result and read stderr tail for job in &jobs { let result = get_latest_result(job.id); let stderr_path = get_job_stderr_path(output_dir, ...); let stderr_tail = read_last_n_lines(&stderr_path, 50); // Include in response } // 3. Return structured response with guidance } }

classify_and_resolve_failures

Applies AI classifications to jobs.

Classification struct:

#![allow(unused)] fn main() { pub struct FailureClassification { pub job_id: i64, pub action: String, // "retry" or "fail" pub memory: Option<String>, // Optional resource adjustment pub runtime: Option<String>, pub reason: Option<String>, // For audit trail } }

Implementation:

#![allow(unused)] fn main() { pub fn classify_and_resolve_failures( config: &Configuration, workflow_id: i64, classifications: Vec<FailureClassification>, dry_run: bool, ) -> Result<CallToolResult, McpError> { // 0. Validate workflow has use_pending_failed enabled let workflow = get_workflow(config, workflow_id)?; if !workflow.use_pending_failed.unwrap_or(false) { return Err(invalid_params( "Workflow does not have use_pending_failed enabled" )); } for classification in &classifications { // 1. Validate job is in pending_failed status // 2. Apply resource adjustments if specified // 3. Set status based on action: // - "retry": status = ready, attempt_id += 1 // - "fail": status = failed (triggers cascade) } } }

Validation:

The tool validates that the workflow has use_pending_failed: true before allowing any classifications. This prevents accidental modification of workflows that don't opt into AI-assisted recovery.

Integration with reset-status

The reset-status --failed-only command also resets pending_failed jobs:

-- reset_failed_jobs_only query SELECT id, status FROM job WHERE workflow_id = $1 AND status IN ( $failed_status, $canceled_status, $terminated_status, $pending_failed_status -- Added )

This allows users to reset pending_failed jobs without AI classification if desired.

Error Classification Patterns

The AI agent should recognize common patterns:

Transient Errors

#![allow(unused)] fn main() { const TRANSIENT_PATTERNS: &[&str] = &[ // Network "Connection refused", "Connection timed out", "Network is unreachable", "DNS resolution failed", "Service Unavailable", // GPU/HPC "NCCL timeout", "GPU communication error", "CUDA out of memory", // Could be transient if memory is shared // Hardware "EIO", "Input/output error", // Slurm "PREEMPTED", "NODE_FAIL", "TIMEOUT", // Slurm walltime, not job timeout ]; }

Permanent Errors

#![allow(unused)] fn main() { const PERMANENT_PATTERNS: &[&str] = &[ // Python "SyntaxError", "IndentationError", "ModuleNotFoundError", "ImportError", "NameError", "TypeError", "ValueError", // General "FileNotFoundError", // For input files "PermissionDenied", "AssertionError", "IndexError", "KeyError", ]; }

These patterns are guidance for AI agents, not hard-coded rules. The AI can use context to override (e.g., FileNotFoundError for a file that should be created by an upstream job might be transient if the upstream job is being retried).

Slurm Integration

When pending_failed jobs are classified as "retry", they return to ready status. For Slurm workflows:

If active allocations exist, jobs may run immediately

If no allocations, torc watch --auto-schedule will create new ones

Manual recovery: torc slurm regenerate --submit

Design Decisions

Why a New Status vs. a Flag?

Alternative considered: Add needs_classification: bool flag to jobs.

Decision: New status is cleaner because:

Status is already used for state machine transitions

is_workflow_complete naturally excludes pending_failed

No schema changes to existing status column

Clearer semantics in logs and UI

Why Defer to AI vs. Built-in Heuristics?

Alternative considered: Build pattern matching into Torc directly.

Decision: AI-assisted approach because:

Error patterns are domain-specific and evolving

AI can use context (multiple errors, timing, external systems)

Users can customize via custom MCP servers

Avoids bloating Torc with error classification logic

Why Not Block on AI Response?

Alternative considered: Job runner waits for AI classification.

Decision: Asynchronous classification because:

AI inference adds latency (seconds to minutes)

AI service may be unavailable

Human oversight is valuable for production workflows

Jobs can accumulate for batch classification

CLI Integration

The torc recover and torc watch commands support automatic AI agent invocation:

Command-Line Options

Option Default Description

--ai-recovery false Enable AI-assisted classification

--ai-agent claude AI agent CLI to invoke (currently only claude)

Invocation Flow

When --ai-recovery is enabled:

#![allow(unused)] fn main() { pub fn invoke_ai_agent(workflow_id: i64, agent: &str, output_dir: &Path) -> Result<(), String> { // 1. Check if agent CLI is available (e.g., `which claude`) // 2. Build prompt with workflow context // 3. Spawn agent with --print flag for non-interactive mode // 4. Capture and log output // 5. Return success/failure } }

The prompt instructs the AI agent to:

Call list_pending_failed_jobs to get jobs with stderr

Analyze each job's error to classify as transient or permanent

Call classify_and_resolve_failures with classifications

Agent Requirements

For the claude agent:

Claude Code CLI must be installed (claude command in PATH)

Torc MCP server must be configured in ~/.claude/mcp_servers.json

The --print flag is used for non-interactive execution

Implementation Files

File Purpose

src/models.rs JobStatus::PendingFailed enum variant

src/client/job_runner.rs RecoveryOutcome enum, status assignment

src/client/commands/recover.rs invoke_ai_agent function, CLI integration

src/server/api/jobs.rs reset_failed_jobs_only includes pending_failed

torc-mcp-server/src/tools.rs MCP tool implementations

torc-mcp-server/src/server.rs MCP server handlers

Future Enhancements

Confidence thresholds: AI classifies with confidence score; low confidence escalates to user

Learning from outcomes: Track whether AI classifications led to successful retries

Batch scheduling optimization: AI recommends optimal Slurm allocations for retry jobs

Custom MCP server examples: Templates for domain-specific error classification

Workflow Graph

The WorkflowGraph module provides a directed acyclic graph (DAG) representation of workflow jobs and their dependencies. It serves as the core data structure for dependency analysis, scheduler planning, and execution visualization.

Purpose

The graph abstraction addresses several key challenges:

Unified Representation: Works with both workflow specifications (at creation time) and database models (at runtime), providing a consistent interface for graph algorithms

Dependency Analysis: Enables topological sorting, level computation, and critical path detection

Scheduler Planning: Groups jobs by resource requirements and dependency status for efficient scheduler generation

Sub-workflow Detection: Identifies connected components that can be scheduled independently

Data Structures

JobNode

Represents a single job (or parameterized job template) in the graph:

#![allow(unused)] fn main() { pub struct JobNode { pub name: String, // Job name (may contain {param} placeholders) pub resource_requirements: Option<String>, // Resource requirements name pub instance_count: usize, // 1 for normal jobs, N for parameterized pub name_pattern: String, // Regex pattern matching all instances pub scheduler: Option<String>, // Assigned scheduler pub command: String, // Command to execute } }

WorkflowGraph

The main graph structure with bidirectional edges for efficient traversal:

#![allow(unused)] fn main() { pub struct WorkflowGraph { nodes: HashMap<String, JobNode>, // Jobs indexed by name depends_on: HashMap<String, HashSet<String>>, // Forward edges (blockers) depended_by: HashMap<String, HashSet<String>>, // Reverse edges (dependents) levels: Option<Vec<Vec<String>>>, // Cached topological levels components: Option<Vec<WorkflowComponent>>, // Cached connected components } }

SchedulerGroup

Groups jobs that share scheduling characteristics:

#![allow(unused)] fn main() { pub struct SchedulerGroup { pub resource_requirements: String, // Common RR name pub has_dependencies: bool, // Whether jobs have blockers pub job_count: usize, // Total instances across jobs pub job_name_patterns: Vec<String>, // Regex patterns for matching pub job_names: Vec<String>, // Job names in this group } }

Construction Methods

From Workflow Specification

#![allow(unused)] fn main() { WorkflowGraph::from_spec(&spec) -> Result<Self, Error> }

Builds the graph at workflow creation time:

Creates nodes for each job specification

Resolves explicit dependencies (depends_on)

Resolves regex dependencies (depends_on_regexes)

Computes implicit dependencies from input/output files and user data

From Database Models

#![allow(unused)] fn main() { WorkflowGraph::from_jobs(jobs, resource_requirements) -> Result<Self, Error> }

Builds the graph from fetched database records (used for recovery and visualization):

Creates nodes from JobModel records

Resolves dependencies via depends_on_job_ids (if available)

Falls back to computing dependencies from file relationships

Key Operations

Topological Levels

Groups jobs by dependency depth for parallel execution planning:

Level 0: Jobs with no dependencies (can start immediately)

Level N: Jobs whose dependencies are all in levels < N

Used for execution ordering and TUI visualization.

Connected Components

Identifies independent sub-workflows using BFS traversal:

Each component can be scheduled independently

Enables parallel execution of unrelated job pipelines

Useful for large workflows with multiple independent processing chains

Scheduler Groups

Groups jobs by (resource_requirements, has_dependencies) for scheduler generation:

Jobs without dependencies: Submitted at workflow start

Jobs with dependencies: Submitted on-demand when jobs become ready

Enables the shared generate_scheduler_plan() function used by both torc slurm generate and torc slurm regenerate

Critical Path

Finds the longest path through the graph (by instance count):

Identifies bottleneck jobs that limit parallelism

Used for estimating minimum execution time

Helps prioritize optimization efforts

Integration Points

Scheduler Plan Generation

The SchedulerPlan module uses WorkflowGraph::scheduler_groups() to generate Slurm schedulers:

#![allow(unused)] fn main() { let graph = WorkflowGraph::from_spec(&spec)?; let groups = graph.scheduler_groups(); let plan = generate_scheduler_plan(&graph, &resource_requirements, &profile, ...); }

Execution Plan Visualization

The execution plan display uses WorkflowGraph::from_jobs() for runtime visualization:

#![allow(unused)] fn main() { let graph = WorkflowGraph::from_jobs(&jobs, &resource_requirements)?; let levels = graph.topological_levels()?; // Render DAG visualization in TUI }

Recovery Scenarios

The regenerate command uses the graph to determine scheduler groupings for failed workflows:

#![allow(unused)] fn main() { let graph = WorkflowGraph::from_jobs(&jobs, &resource_requirements)?; let plan = generate_scheduler_plan(&graph, ...); // Apply plan to recreate schedulers and actions }

Design Decisions

Bidirectional Edges

The graph maintains both depends_on and depended_by maps for O(1) traversal in either direction. This is critical for:

Finding what becomes ready when a job completes

Computing connected components efficiently

Building subgraphs for partial analysis

Lazy Computation with Caching

Topological levels and connected components are computed on-demand and cached. This avoids unnecessary computation for simple queries while ensuring efficient repeated access.

Parameterized Job Handling

Parameterized jobs are represented as single nodes with instance_count > 1. The name_pattern field provides a regex for matching expanded instances, enabling scheduler grouping without full expansion.

Interface Architecture

This document describes the design decisions behind Torc's multi-interface architecture and the patterns used to expose functionality consistently across all user-facing interfaces.

Problem Statement

Torc needs to serve diverse user workflows:

Automation scripts need programmatic access via CLI and API clients

Interactive monitoring requires real-time updates via TUI and web dashboard

AI assistants need structured tool access via the MCP protocol

External integrations require language-agnostic HTTP APIs

Each interface has different requirements for output format, error handling, and interaction patterns, yet all must provide consistent access to core functionality.

Design Goals

Consistency: All interfaces expose the same core operations with consistent semantics

Single Source of Truth: Business logic lives in the server; clients are thin wrappers

Interface-Appropriate UX: Each interface adapts presentation to its context

Maintainability: Adding features should require minimal interface-specific code

Discoverability: Users should easily find available operations in each interface

Solution Overview

The architecture follows a layered approach where the CLI serves as the foundation for Rust-based interfaces, while external clients communicate directly with the HTTP API.

graph TD subgraph ui["User Interfaces"] CLI["CLI (torc)"] TUI["TUI (torc tui)"] DASH_BE["Dashboard Backend (torc-dash)"] DASH_FE["Dashboard Frontend (JavaScript)"] MCP["MCP Server (torc-mcp-server)"] PY["Python Client"] JL["Julia Client"] end subgraph lib["Rust Client Library"] API["src/client/apis/ Generated OpenAPI client"] CMD["src/client/commands/ CLI command handlers"] end subgraph server["Server"] HTTP["HTTP API (torc-server)"] DB[(SQLite Database)] end CLI --> CMD CMD --> API TUI --> API TUI --> HTTP DASH_BE --> API MCP --> API DASH_FE --> HTTP PY --> HTTP JL --> HTTP API --> HTTP HTTP --> DB style CLI fill:#4a9eff,color:#fff style TUI fill:#17a2b8,color:#fff style DASH_BE fill:#17a2b8,color:#fff style DASH_FE fill:#17a2b8,color:#fff style MCP fill:#6f42c1,color:#fff style PY fill:#ffc107,color:#000 style JL fill:#ffc107,color:#000 style API fill:#4a9eff,color:#fff style CMD fill:#4a9eff,color:#fff style HTTP fill:#28a745,color:#fff style DB fill:#28a745,color:#fff

Key architectural decisions:

CLI as foundation: The CLI (src/client/commands/) provides the command implementations that other Rust interfaces can reuse.

Shared Rust client library: The TUI, Dashboard backend, and MCP server all use the generated Rust API client (src/client/apis/) as the CLI does. This client library makes HTTP requests to the torc-server.

Direct HTTP access: The TUI, Dashboard JavaScript frontend, Python client, and Julia client also communicate directly with the HTTP API for certain operations.

Interface Implementations

CLI (Command Line Interface)

Location: src/client/commands/

Design Pattern: Subcommand dispatch with format-aware output

The CLI uses clap for argument parsing with a hierarchical command structure:

torc ├── workflows │ ├── create │ ├── list │ ├── run │ └── ... ├── jobs │ ├── list │ ├── get │ └── ... └── ...

Key Design Decisions:

Dual Output Formats: Every list/get command supports --format table (human-readable) and --format json (machine-parseable). This enables both interactive use and scripting.

Pagination Built-In: All list commands include --offset and --limit flags, mirroring the API's pagination model directly.

Environment Variable Fallbacks: Common parameters like --url fall back to environment variables (TORC_API_URL), reducing repetition in scripts.

Consistent Error Output: Errors write to stderr with context, while successful output goes to stdout, enabling clean piping.

Implementation Pattern:

#![allow(unused)] fn main() { pub fn handle_list(config: &Configuration, format: &str) { match list_items(config) { Ok(items) => match format { "json" => println!("{}", serde_json::to_string_pretty(&items).unwrap()), _ => display_table(&items), }, Err(e) => eprintln!("Error: {}", e), } } }

TUI (Terminal User Interface)

Location: src/tui/

Design Pattern: Component-based reactive UI with polling updates

Key Design Decisions:

Separation of Concerns:

app.rs: Application state and business logic

ui.rs: Rendering logic using ratatui

api.rs: API client with anyhow::Result error handling

components.rs: Reusable UI widgets (dialogs, lists)

Blocking API Client: Unlike the async server, the TUI uses reqwest::blocking to simplify the event loop. API calls happen on the main thread between render cycles.

Modal Dialogs: Confirmation dialogs for destructive actions (delete, cancel) prevent accidental data loss in the fast-paced terminal environment.

Vim-Style Navigation: Keyboard shortcuts follow vim conventions (j/k for navigation, Enter for selection) for power users.

State Management:

#![allow(unused)] fn main() { pub struct App { pub workflows: Vec<WorkflowModel>, pub selected_workflow: Option<usize>, pub detail_view: DetailViewType, pub confirmation_dialog: Option<ConfirmationDialog>, } impl App { pub fn handle_key_event(&mut self, key: KeyEvent) -> AppAction { if self.confirmation_dialog.is_some() { return self.handle_dialog_key(key); } // Normal key handling } } }

Dashboard (Web UI)

Location: torc-dash/src/

Design Pattern: Axum web server using the Rust client library

Key Design Decisions:

Shared Client Library: The dashboard uses the same Rust API client (src/client/apis/) as the CLI, TUI, and MCP server, ensuring consistent behavior across all Rust-based interfaces.

Embedded Assets: Static files (HTML, CSS, JS) are embedded at compile time using rust-embed, producing a single binary for deployment.

Server-Sent Events: Real-time updates use SSE for workflow status changes, avoiding the complexity of WebSocket state management.

Separate Binary: The dashboard runs as torc-dash, not part of the main torc binary, allowing independent deployment and scaling.

API Integration Pattern:

#![allow(unused)] fn main() { async fn list_workflows( State(state): State<AppState>, ) -> Result<Json<Vec<WorkflowModel>>, StatusCode> { let workflows = default_api::list_workflows(&state.config, None, None, None, None) .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)?; Ok(Json(workflows.items)) } }

MCP Server (AI Assistant Interface)

Location: torc-mcp-server/src/

Design Pattern: Tool-based RPC with structured outputs

Key Design Decisions:

Blocking Client, Async Transport: The MCP server creates a blocking reqwest client before spawning the tokio runtime. This avoids nested runtime issues when the MCP transport is async but the Torc client expects blocking calls.

Structured JSON Responses: Tool outputs are JSON objects with consistent fields, making them easy for AI models to parse and reason about.

Error as Content: Errors are returned as structured content (not transport failures), giving AI assistants context to retry or explain failures.

Operation Scoping: Tools are scoped to common high-level operations (list workflows, get status, run workflow) rather than exposing every API endpoint.

Tool Implementation:

#![allow(unused)] fn main() { pub fn list_workflows(config: &Configuration) -> Result<CallToolResult, McpError> { let workflows = default_api::list_workflows(config, None, None, None, None) .map_err(|e| McpError::internal_error( format!("Failed to list workflows: {}", e), None ))?; let summary: Vec<_> = workflows.items.iter().map(|w| json!({ "id": w.id, "name": w.name, "status": format!("{:?}", w.status), })).collect(); Ok(CallToolResult::success(vec![ Content::text(serde_json::to_string_pretty(&summary).unwrap()) ])) } }

Python/Julia API Clients

Location: Generated in python_client/ and julia_client/

Design Pattern: OpenAPI-generated clients with language-idiomatic wrappers

Key Design Decisions:

Generated Code: Core API clients are generated from api/openapi.yaml using openapi-generator. This ensures type safety and automatic updates when the API changes.

No Manual Edits: Generated files in openapi_client/ directories should never be manually edited. Customizations go in wrapper modules.

Synchronous and Async: Python client supports both sync and async usage patterns via the generated client's configuration.

Regeneration Workflow:

cd api bash make_api_clients.sh # Regenerates both Python and Julia clients

Alternatives Considered

GraphQL Instead of REST

Rejected because:

REST's simplicity matches Torc's CRUD-heavy operations

OpenAPI provides excellent code generation for multiple languages

Pagination and filtering are well-handled by query parameters

GraphQL's flexibility isn't needed for the fixed data model

Single Unified Binary

Rejected because:

The dashboard has significant web dependencies (static assets, tower middleware)

Separate binaries allow independent deployment

Feature flags still provide unified builds when desired

gRPC for Internal Communication

Rejected because:

HTTP/JSON is more debuggable and accessible

Browser-based dashboard would need a proxy anyway

Python/Julia clients benefit from REST's simplicity

Implementation Guidelines

When adding a new feature that should be exposed to users:

Start with the API: Define the endpoint in api/openapi.yaml with proper schemas

Implement server-side: Add handler in src/server/api/

Regenerate clients: Run api/make_api_clients.sh

Add CLI command: Create handler in src/client/commands/

Update TUI if applicable: Add to relevant view in src/tui/

Update Dashboard if applicable: Add route in torc-dash/src/

Add MCP tool if user-facing: Add tool function in torc-mcp-server/src/

Document in all interfaces: Update reference docs for each affected interface

Summary

Torc's multi-interface architecture achieves consistency through:

Centralized business logic in the server

Generated API clients from a single OpenAPI spec

Interface-specific adapters that translate between user expectations and API semantics

Consistent data models shared across all implementations

Contributing

Contributions to Torc are welcome! This guide will help you get started.

Development Setup

Fork and clone the repository:

git clone https://github.com/your-username/torc.git cd torc

Install Rust and dependencies:

Make sure you have Rust 1.70 or later installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install SQLx CLI:

cargo install sqlx-cli --no-default-features --features sqlite

Set up the database:

# Create .env file echo "DATABASE_URL=sqlite:torc.db" > .env # Run migrations sqlx migrate run

Build and test:

cargo build cargo test

Making Changes

Code Style

Run formatting and linting before committing:

# Format code cargo fmt # Run clippy cargo clippy --all --all-targets --all-features -- -D warnings # Run all checks cargo fmt --check && cargo clippy --all --all-targets --all-features -- -D warnings

Adding Tests

All new functionality should include tests:

# Run specific test cargo test test_name -- --nocapture # Run with logging RUST_LOG=debug cargo test -- --nocapture

Database Migrations

If you need to modify the database schema:

# Create new migration sqlx migrate add <migration_name> # Edit the generated SQL file in migrations/ # Run migration sqlx migrate run # To revert sqlx migrate revert

Submitting Changes

Create a feature branch:

git checkout -b feature/my-new-feature

Make your changes and commit:

git add . git commit -m "Add feature: description"

Ensure all tests pass:

cargo test cargo fmt --check cargo clippy --all-targets --all-features -- -D warnings

Push to your fork:

git push origin feature/my-new-feature

Open a Pull Request:

Go to the original repository and open a pull request with:

Clear description of changes

Reference to any related issues

Test results

Pull Request Guidelines

Keep PRs focused - One feature or fix per PR

Add tests - All new code should be tested

Update documentation - Update README.md, DOCUMENTATION.md, or inline docs as needed

Follow style guidelines - Run cargo fmt and cargo clippy

Write clear commit messages - Describe what and why, not just how

Areas for Contribution

High Priority

Performance optimizations for large workflows

Additional job runner implementations (Kubernetes, etc.)

Improved error messages and logging

Documentation improvements

Features

Workflow visualization tools

Job retry policies and error handling

Workflow templates and libraries

Integration with external systems

Testing

Additional integration tests

Performance benchmarks

Stress testing with large workflows

Code of Conduct

Be respectful and constructive in all interactions. We're all here to make Torc better.

Questions?

Open an issue for bugs or feature requests

Start a discussion for questions or ideas

Check existing issues and discussions first

License

By contributing, you agree that your contributions will be licensed under the BSD 3-Clause License.

Rust Developer Style Guide

This guide establishes coding standards, conventions, and workflows for Rust developers contributing to Torc. Following these guidelines ensures consistency across the codebase and streamlines the review process.

Pre-commit Hooks

Always rely on the pre-commit hooks provided. The repository uses cargo-husky to install Git hooks automatically. Before each commit, the following checks run:

cargo fmt -- --check # Rust formatting cargo clippy --all --all-targets --all-features -- -D warnings dprint check # Markdown formatting

If any check fails, the commit is blocked. Fix the issues before committing.

Installing Pre-commit Hooks

Hooks are installed automatically when you run cargo build for the first time. If you need to reinstall them manually:

cargo install cargo-husky cargo build # Triggers hook installation

Code Formatting

Rust Formatting (rustfmt)

All Rust code must pass cargo fmt --check. Run cargo fmt before committing to auto-format your code.

Key conventions enforced:

4-space indentation

Max line width of 100 characters

Consistent brace placement

Sorted imports

Clippy Compliance

All code must compile without clippy warnings when run with -D warnings:

cargo clippy --all --all-targets --all-features -- -D warnings

Common clippy lints to watch for:

clippy::unwrap_used - Prefer expect() with descriptive messages or proper error handling

clippy::clone_on_copy - Avoid cloning Copy types

clippy::needless_return - Omit unnecessary return keywords

clippy::redundant_closure - Use method references where possible

Markdown Formatting (dprint)

All Markdown files in docs/ must comply with dprint formatting:

dprint check # Verify formatting dprint fmt # Auto-format

Critical requirement: Maximum line length of 100 characters for all Markdown files.

The dprint.json configuration enforces:

{ "lineWidth": 100, "markdown": { "lineWidth": 100, "textWrap": "always" } }

Documentation Standards

All features must be documented in Markdown in the docs/ directory following the Diataxis framework:

Diataxis Categories

Category Location Purpose

Tutorials docs/src/tutorials/ Learning-oriented, step-by-step lessons

How-To Guides docs/src/how-to/ Task-oriented, problem-solving guides

Explanation docs/src/explanation/ Understanding-oriented, conceptual discussions

Reference docs/src/reference/ Information-oriented, technical descriptions

Design Documentation

Significant design choices must be documented in docs/src/explanation/design/. Each design document should cover:

Problem Statement: What problem does this solve?

Design Goals: What are the requirements and constraints?

Solution Overview: High-level architecture description

Implementation Details: Key technical decisions and trade-offs

Alternatives Considered: What other approaches were evaluated?

Existing design documents include:

server.md - API handler design and request processing

database.md - SQLite schema and concurrency model

dashboard.md - Web dashboard architecture

recovery.md - Workflow recovery mechanisms

workflow-graph.md - Dependency graph implementation

Documentation Workflow

Write documentation alongside code changes

Add new pages to docs/src/SUMMARY.md

Run dprint fmt to ensure formatting compliance

Build and preview with mdbook serve docs/

Testing with rstest

All code must include tests using the rstest library for fixtures and parameterized testing.

Test Organization

tests/ ├── common.rs # Shared test utilities and fixtures ├── test_full_workflows.rs # Integration tests ├── test_job_runner.rs # Job runner tests └── scripts/ # Helper scripts for tests

Common Patterns

Fixture Pattern:

#![allow(unused)] fn main() { use rstest::rstest; use serial_test::serial; mod common; use common::{start_server, ServerProcess}; #[rstest] #[serial] fn test_workflow_creation(start_server: &ServerProcess) { let config = &start_server.config; // Test code using the server fixture } }

Parameterized Tests:

#![allow(unused)] fn main() { #[rstest] #[case(0, "immediate")] #[case(60, "one_minute")] #[case(3600, "one_hour")] #[serial] fn test_timeout_handling(#[case] timeout_secs: u64, #[case] description: &str) { // Test runs once for each case } }

Shared Test Utilities (tests/common.rs):

#![allow(unused)] fn main() { pub struct ServerProcess { pub config: Configuration, child: std::process::Child, } impl Drop for ServerProcess { fn drop(&mut self) { // Automatic cleanup on test completion let _ = self.child.kill(); } } pub fn start_server() -> ServerProcess { let port = find_available_port(); // Start server and wait for readiness } }

Test Guidelines

Use #[serial] for integration tests that share resources (ports, database)

Use descriptive expect() messages instead of .unwrap()

Clean up resources using the Drop trait or explicit cleanup functions

Test error conditions not just happy paths

Keep tests focused - one behavior per test function

HTTP API Changes

Changes to the HTTP API require updating the OpenAPI specification and regenerating client libraries.

Workflow

Modify the OpenAPI spec:

# Edit the specification vim api/openapi.yaml

Regenerate API clients:

cd api bash make_api_clients.sh

This regenerates:

Python client: python_client/src/torc/openapi_client/

Julia client: julia_client/Torc/src/api/

Update Rust client code:

The Rust client in src/client/apis/ is generated separately. After modifying the OpenAPI spec, regenerate with the OpenAPI generator for Rust.

Test all clients:

# Rust cargo test # Python cd python_client && pytest # Julia julia --project=julia_client/Torc -e "import Pkg; Pkg.test()"

OpenAPI Conventions

Use descriptive operationId values (e.g., create_workflow, list_jobs)

Include comprehensive request/response schemas

Document all parameters with descriptions

Use appropriate HTTP status codes (200, 400, 404, 500)

Feature Implementation Across Interfaces

When implementing a user-facing feature, ensure it is exposed through the appropriate interfaces. The following table shows where features should be implemented:

Interface Location Primary Use Case

CLI src/client/commands/ Command-line automation, scripting

HTTP API api/openapi.yaml Python/Julia integration, external tools

Dashboard torc-dash/src/ Web-based monitoring and management

TUI src/tui/ Interactive terminal monitoring

MCP Server torc-mcp-server/src/ AI assistant integration

CLI Implementation

Commands are implemented using clap with subcommand enums:

#![allow(unused)] fn main() { // In src/client/commands/<feature>.rs #[derive(Subcommand, Debug, Clone)] pub enum FeatureCommands { /// Create a new resource Create { /// Name of the resource #[arg(short, long)] name: String, }, /// List all resources List { #[arg(long, default_value = "table")] format: String, }, } pub fn handle_feature_commands( config: &Configuration, command: &FeatureCommands, format: &str, ) { match command { FeatureCommands::Create { name } => handle_create(config, name, format), FeatureCommands::List { format: fmt } => handle_list(config, fmt), } } }

CLI Conventions:

Support both --format table and --format json output

Use tabled for table formatting with #[tabled(rename = "...")] for column headers

Include pagination support via --offset and --limit flags

Provide helpful error messages with context

HTTP API (Python/Julia)

After updating api/openapi.yaml, the Python and Julia clients are auto-generated. Ensure:

All new endpoints have proper request/response schemas

Query parameters are documented

Error responses are specified

Run make_api_clients.sh to regenerate clients

Dashboard (torc-dash)

The dashboard is an Axum-based web server with embedded static assets:

#![allow(unused)] fn main() { // In torc-dash/src/main.rs async fn handle_feature_list( State(state): State<AppState>, ) -> Result<Json<Vec<Feature>>, StatusCode> { // Proxy request to Torc API server let features = state.client .get(&format!("{}/features", state.api_url)) .send() .await .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)? .json() .await .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)?; Ok(Json(features)) } }

Dashboard Conventions:

Proxy API requests to the Torc server

Use Axum extractors for request handling

Return JSON for API endpoints

Serve static files for the frontend

TUI (Terminal User Interface)

The TUI uses ratatui with a component-based architecture:

#![allow(unused)] fn main() { // In src/tui/app.rs pub struct App { pub workflows: Vec<WorkflowModel>, pub selected_workflow: Option<usize>, pub detail_view: DetailViewType, } impl App { pub fn handle_key_event(&mut self, key: KeyEvent) -> AppAction { match key.code { KeyCode::Enter => self.select_current(), KeyCode::Char('r') => self.refresh_data(), KeyCode::Char('q') => AppAction::Quit, _ => AppAction::None, } } } }

TUI Conventions:

Use anyhow::Result for error handling

Separate state (app.rs), rendering (ui.rs), and API calls (api.rs)

Support keyboard navigation with vim-style bindings

Display confirmation dialogs for destructive actions

MCP Server (AI Assistant)

The MCP server exposes tools for AI assistants:

#![allow(unused)] fn main() { // In torc-mcp-server/src/main.rs pub fn get_workflow_status( config: &Configuration, workflow_id: i64, ) -> Result<CallToolResult, McpError> { let workflow = default_api::get_workflow(config, workflow_id) .map_err(|e| internal_error(format!("Failed to get workflow: {}", e)))?; let result = serde_json::json!({ "workflow_id": workflow.id, "name": workflow.name, "status": workflow.status, }); Ok(CallToolResult::success(vec![ rmcp::model::Content::text(serde_json::to_string_pretty(&result).unwrap_or_default()) ])) } }

MCP Conventions:

Return structured JSON for tool results

Use descriptive error messages via McpError

Support common workflow operations (list, status, run, cancel)

Keep tool descriptions clear for AI consumption

Error Handling Strategy

Application Code (CLI, TUI, binaries)

Use anyhow::Result for flexible error handling:

#![allow(unused)] fn main() { use anyhow::{Context, Result}; pub fn run_workflow(path: &Path) -> Result<()> { let spec = load_spec(path) .context("Failed to load workflow specification")?; create_workflow(&spec) .context("Failed to create workflow")?; Ok(()) } }

Library Code

Use typed errors with thiserror:

#![allow(unused)] fn main() { use thiserror::Error; #[derive(Error, Debug)] pub enum WorkflowError { #[error("Job {job_id} not found in workflow {workflow_id}")] JobNotFound { job_id: i64, workflow_id: i64 }, #[error("Invalid status transition from {from} to {to}")] InvalidTransition { from: String, to: String }, #[error("API error: {0}")] ApiError(#[from] reqwest::Error), } }

Test Code

Use .expect() with descriptive messages:

#![allow(unused)] fn main() { let workflow = create_workflow(&spec) .expect("Test workflow creation should succeed"); let job = get_job(config, job_id) .expect("Job should exist after creation"); }

Common Patterns

Configuration Priority

CLI arguments override environment variables, which override config files:

#![allow(unused)] fn main() { let api_url = cli_args.url .or_else(|| env::var("TORC_API_URL").ok()) .or_else(|| config.client.as_ref()?.api_url.clone()) .unwrap_or_else(|| "http://localhost:8080/torc-service/v1".to_string()); }

Table Display

Use the tabled crate for CLI table output:

#![allow(unused)] fn main() { use tabled::{Table, Tabled}; #[derive(Tabled)] struct JobRow { #[tabled(rename = "ID")] id: i64, #[tabled(rename = "Name")] name: String, #[tabled(rename = "Status")] status: String, } fn display_jobs(jobs: &[JobModel]) { let rows: Vec<JobRow> = jobs.iter().map(|j| JobRow { id: j.id.unwrap_or(0), name: j.name.clone(), status: format!("{:?}", j.status), }).collect(); println!("{}", Table::new(rows)); } }

Feature Flags

Use Cargo features to conditionally compile components:

#![allow(unused)] fn main() { // In Cargo.toml [features] default = ["client"] client = ["dep:reqwest", "dep:clap"] server = ["dep:sqlx", "dep:axum"] tui = ["client", "dep:ratatui"] // In code #[cfg(feature = "client")] pub mod client; #[cfg(feature = "server")] pub mod server; }

Async Runtime

Create blocking clients before spawning the async runtime to avoid nested runtime issues:

fn main() -> Result<()> { // Create blocking client BEFORE async runtime let client = reqwest::blocking::Client::new(); let server = MyServer::new(client); let runtime = tokio::runtime::Builder::new_multi_thread() .worker_threads(4) .enable_all() .build()?; runtime.block_on(async_main(server)) }

Logging

Use tracing for structured logging:

#![allow(unused)] fn main() { use tracing::{debug, info, warn, error, instrument}; #[instrument(skip(config))] pub fn process_job(config: &Configuration, job_id: i64) -> Result<()> { info!(job_id, "Processing job"); match run_job(job_id) { Ok(result) => { debug!(job_id, ?result, "Job completed successfully"); Ok(()) } Err(e) => { error!(job_id, error = %e, "Job failed"); Err(e) } } } }

Enable debug logging with:

RUST_LOG=debug cargo run RUST_LOG=torc=debug,sqlx=warn cargo run # Fine-grained control

Summary Checklist

Before submitting a pull request, verify:

cargo fmt --check passes

cargo clippy --all --all-targets --all-features -- -D warnings passes

dprint check passes (for Markdown changes)

All tests pass with cargo test

New features have tests using rstest

Documentation added in appropriate Diataxis category

Design decisions documented in docs/src/explanation/design/ if applicable

API changes reflected in api/openapi.yaml

Client libraries regenerated with api/make_api_clients.sh

Feature exposed through appropriate interfaces (CLI, API, TUI, etc.)

Keyboard shortcuts

Torc Documentation