Introduction
Torc is a distributed workflow orchestration system for managing computational pipelines ranging from simple workflows needing to parallelize independent jobs to complex workflows with job dependencies, mixed resource requirements, and multiple stages.
Key Features
- Declarative Workflow Definitions — Define workflows in YAML, JSON, JSON5, or KDL
- Automatic Dependency Resolution — Dependencies inferred from file and data relationships
- Distributed Execution — Run jobs across local machines, HPC clusters, and networked compute nodes
- Resource Management — Track CPU and memory usage across all jobs
- Automatic Failure Recovery — Detect OOM/timeout failures and retry with adjusted resources
- Fault Tolerance — Resume workflows after failures without losing progress
- AI-assisted configuration and management — Use AI tools like Claude Code and Copilot to configure workflows and diagnose problems.
Job Parameterization
Create parameter sweeps with simple syntax:
jobs:
- name: job_{index}
command: bash work.sh {index}
parameters:
index: "1:100"
This expands to 100 jobs.
Implicit Dependencies
Dependencies between jobs are automatically inferred from file relationships. Consider this diamond workflow where one job fans out to parallel jobs, which then converge:
name: diamond_workflow
jobs:
- name: preprocess
command: "preprocess.sh -i ${files.input.f1} -o ${files.output.f2} -o ${files.output.f3}"
- name: work1
command: "work.sh -i ${files.input.f2} -o ${files.output.f4}"
- name: work2
command: "work.sh -i ${files.input.f3} -o ${files.output.f5}"
- name: postprocess
command: "postprocess.sh -i ${files.input.f4} -i ${files.input.f5} -o ${files.output.f6}"
files:
- name: f1
path: input.json
- name: f2
path: intermediate_a.json
- name: f3
path: intermediate_b.json
- name: f4
path: result_a.json
- name: f5
path: result_b.json
- name: f6
path: final_output.json
Torc analyzes which jobs produce and consume each file, automatically building the dependency graph:
flowchart TD
f1([input.json])
preprocess[preprocess]
f2([intermediate_a.json])
f3([intermediate_b.json])
work1[work1]
work2[work2]
f4([result_a.json])
f5([result_b.json])
postprocess[postprocess]
f6([final_output.json])
f1 --> preprocess
preprocess --> f2 & f3
f2 --> work1
f3 --> work2
work1 --> f4
work2 --> f5
f4 & f5 --> postprocess
postprocess --> f6
style f1 fill:#d4edda,stroke:#28a745,color:#155724
style f2 fill:#d4edda,stroke:#28a745,color:#155724
style f3 fill:#d4edda,stroke:#28a745,color:#155724
style f4 fill:#d4edda,stroke:#28a745,color:#155724
style f5 fill:#d4edda,stroke:#28a745,color:#155724
style f6 fill:#d4edda,stroke:#28a745,color:#155724
style preprocess fill:#4a9eff,color:#fff
style work1 fill:#4a9eff,color:#fff
style work2 fill:#4a9eff,color:#fff
style postprocess fill:#4a9eff,color:#fff
No explicit depends_on declarations needed — Torc infers that work1 and work2 depend on
preprocess, and postprocess waits for both to complete.
Who Should Use Torc?
Torc is designed for:
- HPC Users who need to parallelize jobs across cluster resources
- Computational Scientists running parameter sweeps and simulations
- Data Engineers building complex data processing pipelines
- ML/AI Researchers managing training workflows and hyperparameter searches
- Anyone who needs reliable, resumable workflow orchestration
Documentation Structure
This documentation is organized to help you find what you need quickly:
Core Documentation (for all users):
- Getting Started — Installation and local quick start
- Core Concepts — Architecture, job states, and dependencies
- Working with Workflows — Creating and managing workflows
- Tutorials — Step-by-step workflow patterns
- Monitoring & Debugging — TUI, reports, and troubleshooting
- Reference — CLI commands, formats, and configuration
Specialized Topics (below the separator in the sidebar):
- HPC & Slurm — Running on compute clusters
- Fault Tolerance & Recovery — Failure handlers, checkpointing, automatic recovery
- Administration & Security — Server deployment, authentication, access control
- Design & Architecture — Internal design for contributors
Next Steps
- New to Torc? Start with Quick Start (Local)
- Want to understand how it works? Read the Architecture Overview
- Ready to create workflows? Jump to Creating Workflows
- On an HPC cluster? See Quick Start (HPC)
Getting Started
Torc is a distributed workflow orchestration system for managing complex computational pipelines with job dependencies, resource requirements, and distributed execution.
Torc uses a client-server architecture where a central server manages workflow state and coordination, while clients create workflows and job runners execute tasks on compute resources.
How You Interact with Torc
Choose the interface that fits your workflow:
- CLI — Primary interface which provides access to all torc functionality
- Dashboard — Web UI (torc-dash) for visual configuration and monitoring
- TUI — Terminal User Interface (
torc tui) for visual monitoring in a terminal - AI Assistants — Use Claude Code or GitHub Copilot to manage workflows through natural language. "Create a workflow with 10 parallel jobs" or "Why did job 5 fail?"
- Spec Files — YAML, KDL, or JSON5 files for version-controlled workflow definitions
- Python/Julia APIs — Libraries for programmatic workflow generation
All interfaces work with the same server — mix and match as needed.
Architecture
flowchart LR
subgraph you["You"]
CLI["torc CLI"]
end
subgraph server["Server"]
API["torc-server"]
end
subgraph workers["Workers"]
W1["Job Runner"]
W2["Job Runner"]
end
CLI -->|"create workflow"| API
CLI -->|"start workers"| W1
CLI -->|"start workers"| W2
W1 -->|"claim & report"| API
W2 -->|"claim & report"| API
style CLI fill:#4a9eff,color:#fff
style API fill:#28a745,color:#fff
style W1 fill:#ffc107,color:#000
style W2 fill:#ffc107,color:#000
- You create workflows and start job runners via the CLI
- Server tracks workflow state and job dependencies
- Workers poll the server, claim ready jobs, execute them, and report results
Example Files
The repository includes ready-to-run workflow specifications in YAML, JSON5, and KDL formats:
| Example | Description | Tutorial |
|---|---|---|
| diamond_workflow.yaml | Fan-out/fan-in pattern | Diamond Workflow |
| hundred_jobs_parameterized.yaml | 100 parallel jobs via parameterization | Many Jobs |
| hyperparameter_sweep.yaml | ML grid search (3×3×2 = 18 jobs) | Advanced Params |
| multi_stage_barrier_pattern.yaml | Efficient multi-stage workflow | Barriers |
| resource_monitoring_demo.yaml | CPU/memory tracking | — |
| workflow_actions_simple_slurm.yaml | Automated Slurm scheduling | — |
Browse all examples:
See the examples README for the complete list.
Choose Your Execution Mode
Torc supports three fundamentally different execution environments. Choose the one that matches your use case:
Local Execution
Best for: Development, testing, small-scale workflows on your workstation or a single server
- Jobs run directly on the machine where you start the job runner
- No scheduler needed — simple setup with
torc run - Resource management via local CPU/memory/GPU tracking
- → Quick Start (Local)
HPC/Slurm
Best for: Large-scale computations on institutional HPC clusters
- Jobs submitted to Slurm scheduler for compute node allocation
- Automatic resource matching to partitions/QOS
- Built-in profiles for common HPC systems
- → Quick Start (HPC/Slurm)
Remote Workers
Best for: Distributed execution across multiple machines you control via SSH
- Jobs distributed to remote workers over SSH
- No HPC scheduler required — you manage the machines
- Flexible heterogeneous resources (mix of CPU/GPU machines)
- → Quick Start (Remote Workers)
All three modes:
- Share the same workflow specification format
- Use the same server API for coordination
- Support the same monitoring tools (CLI, TUI, Dashboard)
- Can be used together (e.g., develop locally, deploy to HPC)
Continue to the Quick Start guide to run your first workflow.
Installation
Precompiled Binaries (Recommended)
-
Download the appropriate archive for your platform from the releases page:
- Linux:
torc-<version>-x86_64-unknown-linux-gnu.tar.gz - macOS (Intel):
torc-<version>-x86_64-apple-darwin.tar.gz - macOS (Apple Silicon):
torc-<version>-aarch64-apple-darwin.tar.gz
- Linux:
-
Extract the archive:
# For .tar.gz files tar -xzf torc-<version>-<platform>.tar.gz # For .zip files unzip torc-<version>-<platform>.zip -
Add the binaries to a directory in your system PATH:
# Option 1: Copy to an existing PATH directory cp torc* ~/.local/bin/ # Option 2: Add the extracted directory to your PATH export PATH="/path/to/extracted/torc:$PATH"To make the PATH change permanent, add the export line to your shell configuration file (
~/.bashrc,~/.zshrc, etc.).
macOS users: The precompiled binaries are not signed with an Apple Developer certificate. macOS Gatekeeper will block them by default. To allow the binaries to run, remove the quarantine attribute after downloading:
xattr -cr /path/to/torc*
Alternatively, you can right-click each binary and select "Open" to add a security exception.
Site-Specific Installations
Some HPC facilities maintain pre-installed Torc binaries and shared servers. Check if your site is listed below.
NREL Kestrel
Pre-installed binaries are available at:
/scratch/dthom/torc/
├── 0.8.0/
├── ...
└── latest -> 0.8.0 (symlink to current version)
Recommended: Use the
latestdirectory. Torc maintains backwards compatibility, so you'll automatically receive updates and bug fixes without changing your configuration.
Add to your PATH:
export PATH="/scratch/dthom/torc/latest:$PATH"
Or add to your ~/.bashrc for persistence:
echo 'export PATH="/scratch/dthom/torc/latest:$PATH"' >> ~/.bashrc
Shared server: A torc-server instance runs on a dedicated VM within the Kestrel environment.
Contact Daniel Thom for access credentials and the server URL. Once you have access:
export TORC_API_URL="http://<server-address>/torc-service/v1"
export TORC_PASSWORD="<your-password>"
Building from Source
Prerequisites
- Rust 1.70 or later
- SQLite 3.35 or later (usually included with Rust via sqlx)
Clone the Repository
git clone https://github.com/NREL/torc.git
cd torc
Building All Components
Note that the file .env designates the database URL as ./db/sqlite/dev.db Change as desired or
set the environment variable DATABASE_URL.
Initialize the database
# Install sqlx-cli if needed
cargo install sqlx-cli --no-default-features --features sqlite
sqlx database setup
Build everything (server, client, dashboard, job runners):
# Development build
cargo build --workspace
# Release build (optimized, recommended)
cargo build --workspace --release
Build individual components:
# Server
cargo build --release -p torc-server
# Client CLI
cargo build --release -p torc
# Web Dashboard
cargo build --release -p torc-dash
# Slurm job runner
cargo build --release -p torc-slurm-job-runner
Binaries will be in target/release/.
Required: Add this directory to your system path or copy the binaries to a directory already in
your path (e.g., ~/.local/bin/).
Python Client
The Python client provides programmatic workflow management for Python users.
Prerequisites
- Python 3.11 or later
Installation
pip install torc-client
The pytorc command will be available after installation.
Julia Client
The Julia client provides programmatic workflow management for Julia users.
Prerequisites
- Julia 1.10 or later
Installation
Since the package is not yet registered in the Julia General registry, install it directly from GitHub:
using Pkg
Pkg.add(url="https://github.com/NREL/torc.git", subdir="julia_client/Torc")
Then use it in your code:
using Torc
For Developers
Running Tests
Run all tests
cargo test -- --test-threads=1
# Run specific test
cargo test --test test_workflow_manager test_initialize_files_with_updated_files
# Run with debug logging
RUST_LOG=debug cargo test -- --nocapture
Setting Up the Server
Start the server:
# Development mode
cargo run -p torc-server -- run
# Production mode (release build)
./target/release/torc-server run
# Custom port
./target/release/torc-server run --port 8080
Server will start on http://localhost:8080.
When running small workflows for testing and demonstration purposes, we recommend setting this option so that the server detects job completions faster than the default value of 30 seconds.
./target/release/torc-server run --completion-check-interval-secs 5
Quick Start (Local)
This guide walks you through creating and running your first Torc workflow with local execution. Jobs run directly on the current machine, making this ideal for testing, development, or non-HPC environments.
For running workflows on HPC clusters with Slurm, see Quick Start (HPC).
Start the Server
Start a Torc server with a local database. Setting --completion-check-interval-secs ensures job
completions are processed quickly (use this for personal servers, not shared deployments).
torc-server run --database torc.db --completion-check-interval-secs 5
Test the Connection
In a new terminal, verify the client can connect:
torc workflows list
Create a Workflow
Save this as workflow.yaml:
name: hello_world
description: Simple hello world workflow
jobs:
- name: job 1
command: echo "Hello from torc!"
- name: job 2
command: echo "Hello again from torc!"
Note: Torc also accepts
.json,.json5and.kdlworkflow specifications. See Workflow Specification Formats for details.
Run the Workflow
Run jobs locally with a short poll interval for demo purposes:
torc run workflow.yaml --poll-interval 1
This creates the workflow, initializes it, and runs all jobs on the current machine.
View Results
torc results list
Or use the TUI for an interactive view:
torc tui
Example: Diamond Workflow
A workflow with fan-out and fan-in dependencies:
name: diamond_workflow
description: Example workflow with implicit dependencies
jobs:
- name: preprocess
command: "bash tests/scripts/preprocess.sh -i ${files.input.f1} -o ${files.output.f2} -o ${files.output.f3}"
- name: work1
command: "bash tests/scripts/work.sh -i ${files.input.f2} -o ${files.output.f4}"
- name: work2
command: "bash tests/scripts/work.sh -i ${files.input.f3} -o ${files.output.f5}"
- name: postprocess
command: "bash tests/scripts/postprocess.sh -i ${files.input.f4} -i ${files.input.f5} -o ${files.output.f6}"
files:
- name: f1
path: f1.json
- name: f2
path: f2.json
- name: f3
path: f3.json
- name: f4
path: f4.json
- name: f5
path: f5.json
- name: f6
path: f6.json
Dependencies are automatically inferred from file inputs/outputs:
work1andwork2wait forpreprocess(depend on its output files)postprocesswaits for bothwork1andwork2to complete
More Examples
The examples directory contains many more workflow examples in YAML, JSON5, and KDL formats.
Next Steps
- Quick Start (HPC) - Run workflows on Slurm clusters
- Creating Workflows - Detailed workflow creation guide
- Terminal UI - Interactive workflow monitoring
Quick Start (HPC)
This guide walks you through running your first Torc workflow on an HPC cluster with Slurm. Jobs are submitted to Slurm and run on compute nodes.
For local execution (testing, development, or non-HPC environments), see Quick Start (Local).
Prerequisites
- Access to an HPC cluster with Slurm
- A Slurm account/allocation for submitting jobs
- Torc installed (see Installation)
Start the Server
On the login node, start a Torc server with a local database:
torc-server run --database torc.db --completion-check-interval-secs 5
Note: For larger deployments, your team may provide a shared Torc server. In that case, skip this step and set
TORC_API_URLto the shared server address.
Check Your HPC Profile
Torc includes built-in profiles for common HPC systems. Check if your system is detected:
torc hpc detect
If detected, you'll see your HPC system name. To see available partitions:
torc hpc partitions <profile-name>
Note: If your HPC system isn't detected, see Custom HPC Profile or request built-in support.
Create a Workflow with Resource Requirements
Save this as workflow.yaml:
name: hpc_hello_world
description: Simple HPC workflow
resource_requirements:
- name: small
num_cpus: 4
memory: 8g
runtime: PT30M
jobs:
- name: job1
command: echo "Hello from compute node!" && hostname
resource_requirements: small
- name: job2
command: echo "Hello again!" && hostname
resource_requirements: small
depends_on: [job1]
Key differences from local workflows:
- resource_requirements: Define CPU, memory, and runtime needs
- Jobs reference these requirements by name
- Torc matches requirements to appropriate Slurm partitions
Submit the Workflow
Submit with your Slurm account:
torc submit-slurm --account <your-account> workflow.yaml
Torc will:
- Detect your HPC system
- Match job requirements to appropriate partitions
- Generate Slurm scheduler configurations
- Create and submit the workflow
Monitor Progress
Check workflow status:
torc workflows list
torc jobs list <workflow-id>
Or use the interactive TUI:
torc tui
Check Slurm queue:
squeue --me
View Results
Once jobs complete:
torc results list <workflow-id>
Job output is stored in the output/ directory by default.
Example: Multi-Stage Pipeline
A more realistic workflow with different resource requirements per stage:
name: analysis_pipeline
description: Data processing pipeline
resource_requirements:
- name: light
num_cpus: 4
memory: 8g
runtime: PT30M
- name: compute
num_cpus: 32
memory: 64g
runtime: PT2H
- name: gpu
num_cpus: 8
num_gpus: 1
memory: 32g
runtime: PT1H
jobs:
- name: preprocess
command: python preprocess.py
resource_requirements: light
- name: train
command: python train.py
resource_requirements: gpu
depends_on: [preprocess]
- name: evaluate
command: python evaluate.py
resource_requirements: compute
depends_on: [train]
Torc stages resource allocation based on dependencies:
preprocessresources are allocated at workflow starttrainresources are allocated whenpreprocesscompletesevaluateresources are allocated whentraincompletes
This prevents wasting allocation time on resources that aren't needed yet.
Preview Before Submitting
For production workflows, preview the generated Slurm configuration first:
torc slurm generate --account <your-account> workflow.yaml
This shows what schedulers and actions Torc will create without submitting anything.
Next Steps
- Slurm Workflows — How Torc manages Slurm
- Resource Requirements — All resource options
- HPC Profiles — Managing HPC configurations
- Working with Slurm — Advanced Slurm configuration
- Debugging Slurm Workflows — Troubleshooting
Quick Start (Remote Workers)
This guide walks you through running a Torc workflow on multiple remote machines via SSH. Jobs are distributed across workers without requiring an HPC scheduler like Slurm.
For local execution, see Quick Start (Local). For HPC/Slurm execution, see Quick Start (HPC).
Prerequisites
- SSH key-based authentication to all remote machines (no password prompts)
- Torc installed on all machines with matching versions
- Torc server accessible from all machines
Start the Server
Start a Torc server that's accessible from the remote machines. This typically means binding to a network interface (not just localhost):
torc-server run --database torc.db --host 0.0.0.0 --port 8080
Create a Worker File
Create a file listing the remote machines. Each line contains one machine in the format
[user@]hostname[:port]:
# workers.txt
worker1.example.com
alice@worker2.example.com
admin@192.168.1.10:2222
Lines starting with # are comments. Empty lines are ignored.
Create a Workflow
Save this as workflow.yaml:
name: distributed_hello
description: Distributed hello world workflow
jobs:
- name: job 1
command: echo "Hello from $(hostname)!"
- name: job 2
command: echo "Hello again from $(hostname)!"
- name: job 3
command: echo "And once more from $(hostname)!"
Create the Workflow on the Server
torc workflows create workflow.yaml
Note the workflow ID in the output.
Run Workers on Remote Machines
Start workers on all remote machines. Each worker will poll for available jobs and execute them:
torc remote run --workers workers.txt <workflow-id> --poll-interval 5
This will:
- Check SSH connectivity to all machines
- Verify all machines have the same torc version
- Start a worker process on each machine (detached via
nohup) - Report which workers started successfully
Check Worker Status
Monitor which workers are still running:
torc remote status <workflow-id>
View Workflow Progress
Check job status from any machine:
torc jobs list <workflow-id>
Or use the interactive TUI:
torc tui
Collect Logs
After the workflow completes, collect logs from all workers:
torc remote collect-logs <workflow-id> --local-output-dir ./logs
This creates a tarball for each worker containing:
- Worker logs:
torc_worker_<workflow_id>.log - Job stdout/stderr:
job_stdio/job_*.oandjob_stdio/job_*.e - Resource utilization data (if enabled):
resource_utilization/resource_metrics_*.db
Stop Workers
If you need to stop workers before the workflow completes:
torc remote stop <workflow-id>
Add --force to send SIGKILL instead of SIGTERM.
Next Steps
- Remote Workers Guide - Detailed configuration and troubleshooting
- Creating Workflows - Workflow specification format
- Resource Monitoring - Track CPU/memory usage per job
Core Concepts
This section covers the fundamental concepts you need to understand to work effectively with Torc.
- Architecture Overview - How Torc's components work together
- Workflow Definition - Structure of workflows and jobs
- Job State Transitions - How jobs move through their lifecycle
- Dependency Resolution - How job dependencies are managed
- Job Runners - Executing jobs locally and on clusters
- Parallelization Strategies - Running jobs in parallel
- Workflow Reinitialization - Restarting workflows
Architecture
Overview
Torc uses a client-server architecture where a central server manages workflow state and coordination, while clients create workflows and job runners execute tasks on compute resources.
flowchart TB
subgraph ui["User Interfaces"]
DASH["torc-dash<br/>(web)"]
TUI["torc tui<br/>(terminal)"]
CLI["torc CLI"]
end
subgraph server["Server (torc-server)"]
API["HTTP API"]
DB[(SQLite)]
API <--> DB
end
subgraph workers["Job Runners"]
W1["Runner 1"]
W2["Runner 2"]
WN["Runner N"]
end
DASH -->|"HTTP"| API
DASH -->|"executes"| CLI
TUI -->|"HTTP"| API
TUI -->|"executes"| CLI
CLI -->|"HTTP"| API
W1 -->|"claim jobs"| API
W2 -->|"claim jobs"| API
WN -->|"claim jobs"| API
style DASH fill:#17a2b8,color:#fff
style TUI fill:#17a2b8,color:#fff
style CLI fill:#4a9eff,color:#fff
style API fill:#28a745,color:#fff
style DB fill:#28a745,color:#fff
style W1 fill:#ffc107,color:#000
style W2 fill:#ffc107,color:#000
style WN fill:#ffc107,color:#000
Key Components:
| Component | Description |
|---|---|
| torc-dash | Web dashboard for visual workflow management |
| torc tui | Terminal UI for monitoring in SSH environments |
| torc CLI | Command-line tool for all workflow operations |
| Server | HTTP API service that manages workflow state via SQLite |
| Job Runner | Worker process that polls for ready jobs, executes them, and reports results |
Workflow Definition
A workflow is a collection of jobs with dependencies. You define workflows in YAML, JSON5, or JSON files.
Minimal Example
name: hello_world
jobs:
- name: greet
command: echo "Hello, World!"
That's it. One job, no dependencies.
Jobs with Dependencies
name: two_stage
jobs:
- name: prepare
command: ./prepare.sh
- name: process
command: ./process.sh
depends_on: [prepare]
The process job waits for prepare to complete.
Job Parameterization
Create multiple jobs from a single definition using parameters:
name: parameter_sweep
jobs:
- name: task_{i}
command: ./run.sh --index {i}
parameters:
i: "1:10"
This expands to 10 jobs: task_1, task_2, ..., task_10.
Parameter Formats
| Format | Example | Expands To |
|---|---|---|
| Range | "1:5" | 1, 2, 3, 4, 5 |
| Range with step | "0:10:2" | 0, 2, 4, 6, 8, 10 |
| List | "[a,b,c]" | a, b, c |
| Float range | "0.0:1.0:0.25" | 0.0, 0.25, 0.5, 0.75, 1.0 |
Format Specifiers
Control how values appear in names:
- name: job_{i:03d} # job_001, job_002, ...
parameters:
i: "1:100"
- name: lr_{lr:.4f} # lr_0.0010, lr_0.0100, ...
parameters:
lr: "[0.001,0.01,0.1]"
Resource Requirements
Specify what resources each job needs:
name: gpu_workflow
resource_requirements:
- name: gpu_job
num_cpus: 8
num_gpus: 1
memory: 16g
runtime: PT2H
jobs:
- name: train
command: python train.py
resource_requirements: gpu_job
Resource requirements are used for:
- Local execution: ensuring jobs don't exceed available resources
- HPC/Slurm: requesting appropriate allocations
Complete Example
name: data_pipeline
description: Process data in parallel, then aggregate
resource_requirements:
- name: worker
num_cpus: 4
memory: 8g
runtime: PT1H
jobs:
- name: process_{i}
command: python process.py --chunk {i} --output results/chunk_{i}.json
resource_requirements: worker
parameters:
i: "1:10"
- name: aggregate
command: python aggregate.py --input results/ --output final.json
resource_requirements: worker
depends_on:
- process_{i}
parameters:
i: "1:10"
This creates:
- 10 parallel
process_*jobs - 1
aggregatejob that waits for all 10 to complete
Failure Recovery Options
Control how Torc handles job failures:
Default Behavior
By default, jobs that fail without a matching failure handler use Failed status:
name: my_workflow
jobs:
- name: task
command: ./run.sh # If this fails, status = Failed
AI-Assisted Recovery (Opt-in)
Enable intelligent classification of ambiguous failures:
name: ml_training
use_pending_failed: true # Enable AI-assisted recovery
jobs:
- name: train_model
command: python train.py
With use_pending_failed: true:
- Jobs without matching failure handlers get
PendingFailedstatus - AI agent can analyze stderr and decide whether to retry or fail
- See AI-Assisted Recovery for details
See Also
- Workflow Specification Formats — Complete syntax reference
- Job Parameterization — Advanced parameter options
- Dependency Resolution — How dependencies work
Job State Transitions
Jobs progress through a defined lifecycle:
stateDiagram-v2
[*] --> uninitialized
uninitialized --> ready: initialize_jobs
uninitialized --> blocked: has dependencies
blocked --> ready: dependencies met
ready --> pending: runner claims
pending --> running: execution starts
running --> completed: exit 0
running --> failed: exit != 0 (handler match + max retries)
running --> pending_failed: exit != 0 (no handler match)
running --> ready: exit != 0 (failure handler retry)
running --> canceled: user cancels
running --> terminated: system terminates
pending_failed --> failed: AI classifies as permanent
pending_failed --> ready: AI classifies as transient
pending_failed --> uninitialized: reset-status
completed --> [*]
failed --> [*]
canceled --> [*]
terminated --> [*]
classDef waiting fill:#6c757d,color:#fff
classDef ready fill:#17a2b8,color:#fff
classDef active fill:#ffc107,color:#000
classDef success fill:#28a745,color:#fff
classDef error fill:#dc3545,color:#fff
classDef stopped fill:#6f42c1,color:#fff
classDef classification fill:#fd7e14,color:#fff
class uninitialized,blocked waiting
class ready ready
class pending,running active
class completed success
class failed error
class canceled,terminated stopped
class pending_failed classification
State Descriptions
- uninitialized (0) - Job created but dependencies not evaluated
- blocked (1) - Waiting for dependencies to complete
- ready (2) - All dependencies satisfied, ready for execution
- pending (3) - Job claimed by runner
- running (4) - Currently executing
- completed (5) - Finished successfully (exit code 0)
- failed (6) - Finished with error (exit code != 0)
- canceled (7) - Explicitly canceled by user or torc. Never executed.
- terminated (8) - Explicitly terminated by system, such as at wall-time timeout
- pending_failed (10) - Job failed without a matching failure handler. Awaiting AI-assisted classification to determine if the error is transient (retry) or permanent (fail). See AI-Assisted Recovery.
Dependency Resolution
Torc supports two types of dependencies: explicit (declared via depends_on) and implicit
(inferred from file/data relationships). For a hands-on tutorial, see
Diamond Workflow with File Dependencies.
Explicit Dependencies
Declared via depends_on:
jobs:
- name: preprocess
command: preprocess.sh
- name: analyze
command: analyze.sh
depends_on:
- preprocess
Implicit Dependencies via Files
Torc automatically infers dependencies from file relationships. When one job outputs a file and another job inputs it, the dependency is created automatically:
jobs:
- name: preprocess
command: "preprocess.sh -o ${files.output.intermediate}"
- name: analyze
command: "analyze.sh -i ${files.input.intermediate}"
files:
- name: intermediate
path: data/intermediate.json
This creates a diamond workflow pattern:
flowchart TD
input([input.json])
preprocess[preprocess]
intermediate([intermediate.json])
analyze[analyze]
output([output.json])
input --> preprocess
preprocess --> intermediate
intermediate --> analyze
analyze --> output
style input fill:#d4edda,stroke:#28a745,color:#155724
style intermediate fill:#d4edda,stroke:#28a745,color:#155724
style output fill:#d4edda,stroke:#28a745,color:#155724
style preprocess fill:#4a9eff,color:#fff
style analyze fill:#4a9eff,color:#fff
No explicit depends_on needed — Torc infers that analyze depends on preprocess because they
share the intermediate file.
Implicit Dependencies via User Data
User data works like files but stores JSON in the database instead of the filesystem:
jobs:
- name: generate_config
command: |
torc user-data update ${user_data.output.config} --data '{"lr": 0.001}'
- name: run_simulation
command: |
CONFIG=$(torc user-data get ${user_data.input.config} | jq '.data')
python simulate.py --config "$CONFIG"
user_data:
- name: config
flowchart LR
gen[generate_config]
config[(config)]
sim[run_simulation]
gen -->|writes| config
config -->|reads| sim
style config fill:#fff3cd,stroke:#ffc107,color:#856404
style gen fill:#4a9eff,color:#fff
style sim fill:#4a9eff,color:#fff
Use user data for small configuration objects; use files for large datasets.
Resolution Process
During workflow creation, the server:
- Resolves all names to IDs
- Stores explicit dependencies in
job_depends_on - Stores file/user_data relationships in junction tables
- During
initialize_jobs, queries junction tables to add implicit dependencies
Dependency Graph Evaluation
When initialize is called:
- All jobs start in
uninitializedstate - Server builds complete dependency graph from explicit and implicit dependencies
- Jobs with no unsatisfied dependencies are marked
ready - Jobs waiting on dependencies are marked
blocked - As jobs complete, blocked jobs are re-evaluated and may become
ready
Variable Substitution Syntax
In workflow specification files (YAML, JSON5, KDL), use these patterns to reference files and user data in job commands:
| Pattern | Description |
|---|---|
${files.input.NAME} | File path this job reads (creates implicit dependency) |
${files.output.NAME} | File path this job writes (satisfies dependencies) |
${user_data.input.NAME} | User data this job reads |
${user_data.output.NAME} | User data this job writes |
Example:
jobs:
- name: process
command: "python process.py -i ${files.input.raw} -o ${files.output.result}"
See Workflow Specification Formats for complete syntax details.
Job Runners
Job runners are worker processes that execute jobs on compute resources.
Job Runner Modes
Torc supports three execution modes:
- Local Runner (
torc run) - Runs jobs on the local machine with resource tracking - HPC/Slurm Runner (
torc submit-slurm) - Runs jobs on Slurm-allocated compute nodes - Remote Workers (
torc remote run) - Distributes jobs across SSH-accessible machines
Local Runner
The local runner executes jobs directly on the current machine. Start it with:
torc run <workflow-id>
HPC/Slurm Runner
For HPC clusters, jobs run on Slurm-allocated compute nodes. The torc-slurm-job-runner binary is
launched by Slurm on each allocated node and polls the server for work.
Remote Workers
Remote workers enable distributed execution without a scheduler. The torc remote run command
SSH-es into multiple machines and starts a torc run process on each:
torc remote run workers.txt <workflow-id>
Each remote worker runs as a detached process and polls the server for jobs, just like the local runner. The server coordinates job distribution to prevent double-allocation.
Job Allocation Strategies
The job runner supports two different strategies for retrieving and executing jobs:
Resource-Based Allocation (Default)
Used when: --max-parallel-jobs is NOT specified
Behavior:
- Retrieves jobs from the server via the command
claim_jobs_based_on_resources - Server filters jobs based on available compute node resources (CPU, memory, GPU)
- Only returns jobs that fit within the current resource capacity
- Prevents resource over-subscription and ensures jobs have required resources
- Defaults to requiring one CPU and 1 MB of memory for each job.
Use cases:
- When you want parallelization based on one CPU per job.
- When you have heterogeneous jobs with different resource requirements and want intelligent resource management.
Example 1: Run jobs at queue depth of num_cpus:
parameters:
i: "1:100"
jobs:
- name: "work_{i}"
command: bash my_script.sh {i}
use_parameters:
- i
Example 2: Resource-based parallelization:
resource_requirements:
- name: "work_resources"
num_cpus: 32
memory: "200g"
runtime: "PT4H"
num_nodes: 1
parameters:
i: "1:100"
jobs:
- name: "work_{i}"
command: bash my_script.sh {i}
resource_requirements: work_resources
use_parameters:
- i
Simple Queue-Based Allocation
Used when: --max-parallel-jobs is specified
Behavior:
- Retrieves jobs from the server via the command
claim_next_jobs - Server returns the next N ready jobs from the queue (up to the specified limit)
- Ignores job resource requirements completely
- Simply limits the number of concurrent jobs
Use cases: When all jobs have similar resource needs or when the resource bottleneck is not tracked by Torc, such as network or storage I/O. This is the only way to run jobs at a queue depth higher than the number of CPUs in the worker.
Example:
torc run $WORKFLOW_ID \
--max-parallel-jobs 10 \
--output-dir ./results
Job Runner Workflow
The job runner executes a continuous loop with these steps:
flowchart TD
Start([Start]) --> CheckStatus[Check workflow status]
CheckStatus --> IsComplete{Workflow complete<br/>or canceled?}
IsComplete -->|Yes| End([Exit])
IsComplete -->|No| MonitorJobs[Monitor running jobs]
MonitorJobs --> CompleteFinished[Complete finished jobs<br/>Update server status]
CompleteFinished --> ExecuteActions[Execute workflow actions<br/>e.g., schedule Slurm allocations]
ExecuteActions --> ClaimJobs[Claim new jobs from server]
ClaimJobs --> ResourceCheck{Allocation<br/>strategy?}
ResourceCheck -->|Resource-based| ClaimResources[claim_jobs_based_on_resources<br/>Filter by CPU/memory/GPU]
ResourceCheck -->|Queue-based| ClaimQueue[claim_next_jobs<br/>Up to max-parallel-jobs]
ClaimResources --> StartJobs
ClaimQueue --> StartJobs
StartJobs[Start claimed jobs] --> ForEachJob[For each job:<br/>1. Call start_job<br/>2. Execute command<br/>3. Record stdout/stderr]
ForEachJob --> Sleep[Sleep for poll interval]
Sleep --> CheckStatus
style Start fill:#10b981,stroke:#059669,color:#fff
style End fill:#ef4444,stroke:#dc2626,color:#fff
style IsComplete fill:#f59e0b,stroke:#d97706,color:#fff
style ResourceCheck fill:#f59e0b,stroke:#d97706,color:#fff
style CheckStatus fill:#3b82f6,stroke:#2563eb,color:#fff
style MonitorJobs fill:#3b82f6,stroke:#2563eb,color:#fff
style CompleteFinished fill:#3b82f6,stroke:#2563eb,color:#fff
style ExecuteActions fill:#3b82f6,stroke:#2563eb,color:#fff
style ClaimJobs fill:#3b82f6,stroke:#2563eb,color:#fff
style StartJobs fill:#3b82f6,stroke:#2563eb,color:#fff
style ForEachJob fill:#3b82f6,stroke:#2563eb,color:#fff
style Sleep fill:#6b7280,stroke:#4b5563,color:#fff
style ClaimResources fill:#8b5cf6,stroke:#7c3aed,color:#fff
style ClaimQueue fill:#ec4899,stroke:#db2777,color:#fff
- Check workflow status - Poll server to check if workflow is complete or canceled
- Monitor running jobs - Check status of currently executing jobs
- Execute workflow actions - Check for and execute any pending workflow actions, such as scheduling new Slurm allocations.
- Claim new jobs - Request ready jobs from server based on allocation strategy:
- Resource-based:
claim_jobs_based_on_resources - Queue-based:
claim_next_jobs
- Resource-based:
- Start jobs - For each claimed job:
- Call
start_jobto mark job as started in database - Execute job command in a non-blocking subprocess
- Record stdout/stderr output to files
- Call
- Complete jobs - When running jobs finish:
- Call
complete_jobwith exit code and result - Server updates job status and automatically marks dependent jobs as ready
- Call
- Sleep and repeat - Wait for job completion poll interval, then repeat loop
The runner continues until the workflow is complete or canceled.
Resource Management (Resource-Based Allocation Only)
When using resource-based allocation (default), the local job runner tracks:
- Number of CPUs in use
- Memory allocated to running jobs
- GPUs in use
- Job runtime limits
When a ready job is retrieved, the runner checks if sufficient resources are available before executing it.
Parallelization Strategies
Torc provides flexible parallelization strategies to accommodate different workflow patterns and resource allocation scenarios. Understanding these strategies helps you optimize job execution for your specific use case.
Overview
Torc supports two primary approaches to parallel job execution:
- Resource-aware allocation - Define per-job resource requirements and let runners intelligently select jobs that fit available resources
- Queue-depth parallelism - Control the number of concurrent jobs without resource tracking
The choice between these approaches depends on your workflow characteristics and execution environment.
Use Case 1: Resource-Aware Job Allocation
This strategy is ideal for heterogeneous workflows where jobs have varying resource requirements (CPU, memory, GPU, runtime). The server intelligently allocates jobs based on available compute node resources.
How It Works
When you define resource requirements for each job:
resource_requirements:
- name: small
num_cpus: 2
num_gpus: 0
memory: 4g
runtime: PT30M
- name: large
num_cpus: 16
num_gpus: 2
memory: 128g
runtime: PT8H
jobs:
- name: preprocessing
command: ./preprocess.sh
resource_requirements: small
- name: model_training
command: python train.py
resource_requirements: large
The job runner pulls jobs from the server by detecting its available resources automatically.
torc run $WORKFLOW_ID
The server's GET /workflows/{id}/claim_jobs_based_on_resources endpoint:
- Receives the runner's resource capacity
- Queries the ready queue for jobs that fit within those resources
- Returns a set of jobs that can run concurrently without over-subscription
- Updates job status from
readytopendingatomically
Job Allocation Ambiguity: Two Approaches
When you have multiple compute nodes or schedulers with different capabilities, there are two ways to handle job allocation:
Approach 1: Sort Method (Flexible but Potentially Ambiguous)
How it works:
- Jobs do NOT specify a particular scheduler/compute node
- The server uses a
job_sort_methodparameter to prioritize jobs when allocating - Any runner with sufficient resources can claim any ready job
Available sort methods: Define the field job_sort_method in the workflow specification file (YAML/JSON/KDL)
gpus_runtime_memory- Prioritize jobs by GPU count (desc), then runtime (desc), then memory (desc)gpus_memory_runtime- Prioritize jobs by GPU count (desc), then memory (desc), then runtime (desc)none- No sorting, jobs selected in queue order
Tradeoffs:
✅ Advantages:
- Maximum flexibility - any runner can execute any compatible job
- Better resource utilization - if GPU runner is idle, it can pick up CPU-only jobs
- Simpler workflow specifications - no need to explicitly map jobs to schedulers
- Fault tolerance - if one runner fails, others can pick up its jobs
❌ Disadvantages:
- Ambiguity - no guarantee GPU jobs go to GPU runners
- Potential inefficiency - high-memory jobs might land on low-memory nodes if timing is unlucky
- Requires careful sort method selection
- Less predictable job placement
When to use:
- Homogeneous or mostly-homogeneous compute resources
- Workflows where job placement flexibility is valuable
- When you want runners to opportunistically pick up work
- Development and testing environments
Approach 2: Scheduler ID (Deterministic but Less Flexible)
How it works:
- Define scheduler configurations in your workflow spec
- Assign each job a specific
scheduler_id - Runners provide their
scheduler_config_idwhen requesting jobs - Server only returns jobs matching that scheduler ID
Example workflow specification:
slurm_schedulers:
- name: gpu_cluster
partition: gpu
account: myproject
- name: highmem_cluster
partition: highmem
account: myproject
jobs:
- name: model_training
command: python train.py
resource_requirements: large
slurm_scheduler: gpu_cluster # Binds to specific scheduler
- name: large_analysis
command: ./analyze.sh
resource_requirements: highmem
slurm_scheduler: highmem_cluster
Example runner invocation:
# GPU runner - only pulls jobs assigned to gpu_cluster
torc-slurm-job-runner $WORKFLOW_ID \
--scheduler-config-id 1 \
--num-cpus 32 \
--num-gpus 8
# High-memory runner - only pulls jobs assigned to highmem_cluster
torc-slurm-job-runner $WORKFLOW_ID \
--scheduler-config-id 2 \
--num-cpus 64 \
--memory-gb 512
Tradeoffs:
✅ Advantages:
- Zero ambiguity - jobs always run on intended schedulers
- Predictable job placement
- Prevents GPU jobs from landing on CPU-only nodes
- Clear workflow specification - explicit job→scheduler mapping
- Better for heterogeneous clusters (GPU vs CPU vs high-memory)
❌ Disadvantages:
- Less flexibility - idle runners can't help other queues
- Potential resource underutilization - GPU runner sits idle while CPU queue is full
- More complex workflow specifications
- If a scheduler fails, its jobs remain stuck until that scheduler returns
When to use:
- Highly heterogeneous compute resources (GPU clusters, high-memory nodes, specialized hardware)
- Production workflows requiring predictable job placement
- Multi-cluster environments
- When job-resource matching is critical (e.g., GPU-only codes, specific hardware requirements)
- Slurm or HPC scheduler integrations
Choosing Between Sort Method and Scheduler ID
| Scenario | Recommended Approach | Rationale |
|---|---|---|
| All jobs can run anywhere | Sort method | Maximum flexibility, simplest spec |
| Some jobs need GPUs, some don't | Scheduler ID | Prevent GPU waste on CPU jobs |
| Multi-cluster Slurm environment | Scheduler ID | Jobs must target correct clusters |
| Development/testing | Sort method | Easier to experiment |
| Production with SLAs | Scheduler ID | Predictable resource usage |
| Homogeneous compute nodes | Sort method | No benefit to restricting |
| Specialized hardware (GPUs, high-memory, FPGAs) | Scheduler ID | Match jobs to capabilities |
You can also mix approaches: Use scheduler_id for jobs with strict requirements, leave it NULL
for flexible jobs.
Use Case 2: Queue-Depth Parallelism
This strategy is ideal for workflows with homogeneous resource requirements where you simply want to control the level of parallelism.
How It Works
Instead of tracking resources, you specify a maximum number of concurrent jobs:
torc run $WORKFLOW_ID \
--max-parallel-jobs 10 \
--output-dir ./results
or with Slurm:
torc slurm schedule-nodes $WORKFLOW_ID \
--scheduler-config-id 1 \
--num-hpc-jobs 4 \
--max-parallel-jobs 8
Server behavior:
The GET /workflows/{id}/claim_next_jobs endpoint:
- Accepts
limitparameter specifying maximum jobs to return - Ignores all resource requirements
- Returns the next N ready jobs from the queue
- Updates their status from
readytopending
Runner behavior:
- Maintains a count of running jobs
- When count falls below
max_parallel_jobs, requests more work - Does NOT track CPU, memory, GPU, or other resources
- Simply enforces the concurrency limit
Ignoring Resource Consumption
This is a critical distinction: when using --max-parallel-jobs, the runner completely ignores
current resource consumption.
Normal resource-aware mode:
Runner has: 32 CPUs, 128 GB memory
Job A needs: 16 CPUs, 64 GB
Job B needs: 16 CPUs, 64 GB
Job C needs: 16 CPUs, 64 GB
Runner starts Job A and Job B (resources fully allocated)
Job C waits until resources free up
Queue-depth mode with --max-parallel-jobs 3:
Runner has: 32 CPUs, 128 GB memory (IGNORED)
Job A needs: 16 CPUs, 64 GB (IGNORED)
Job B needs: 16 CPUs, 64 GB (IGNORED)
Job C needs: 16 CPUs, 64 GB (IGNORED)
Runner starts Job A, Job B, and Job C simultaneously
Total requested: 48 CPUs, 192 GB (exceeds node capacity!)
System may: swap, OOM, or throttle performance
When to Use Queue-Depth Parallelism
✅ Use queue-depth parallelism when:
-
All jobs have similar resource requirements
# All jobs use ~4 CPUs, ~8GB memory jobs: - name: process_file_1 command: ./process.sh file1.txt - name: process_file_2 command: ./process.sh file2.txt # ... 100 similar jobs -
Resource requirements are negligible compared to node capacity
- Running 100 lightweight Python scripts on a 64-core machine
- I/O-bound jobs that don't consume much CPU/memory
-
Jobs are I/O-bound or sleep frequently
- Data download jobs
- Jobs waiting on external services
- Polling or monitoring tasks
-
You want simplicity over precision
- Quick prototypes
- Testing workflows
- Simple task queues
-
Jobs self-limit their resource usage
- Application has built-in thread pools
- Container resource limits
- OS-level cgroups or resource controls
❌ Avoid queue-depth parallelism when:
-
Jobs have heterogeneous resource requirements
- Mix of 2-CPU and 32-CPU jobs
- Some jobs need 4GB, others need 128GB
-
Resource contention causes failures
- Out-of-memory errors
- CPU thrashing
- GPU memory exhaustion
-
You need efficient bin-packing
- Maximizing node utilization
- Complex resource constraints
-
Jobs are compute-intensive
- CPU-bound numerical simulations
- Large matrix operations
- Video encoding
Queue-Depth Parallelism in Practice
Example 1: Slurm with Queue Depth
# Schedule 4 Slurm nodes, each running up to 8 concurrent jobs
torc slurm schedule-nodes $WORKFLOW_ID \
--scheduler-config-id 1 \
--num-hpc-jobs 4 \
--max-parallel-jobs 8
This creates 4 Slurm job allocations. Each allocation runs a worker that:
- Pulls up to 8 jobs at a time
- Runs them concurrently
- Requests more when any job completes
Total concurrency: up to 32 jobs (4 nodes × 8 jobs/node)
Example 2: Local Runner with Queue Depth
# Run up to 20 jobs concurrently on local machine
torc-job-runner $WORKFLOW_ID \
--max-parallel-jobs 20 \
--output-dir ./output
Example 3: Mixed Approach
You can even run multiple runners with different strategies:
# Terminal 1: Resource-aware runner for large jobs
torc run $WORKFLOW_ID \
--num-cpus 32 \
--memory-gb 256
# Terminal 2: Queue-depth runner for small jobs
torc run $WORKFLOW_ID \
--max-parallel-jobs 50
The ready queue serves both runners. The resource-aware runner gets large jobs that fit its capacity, while the queue-depth runner gets small jobs for fast parallel execution.
Performance Characteristics
Resource-aware allocation:
- Query complexity: O(jobs in ready queue)
- Requires computing resource sums
- Slightly slower due to filtering and sorting
- Better resource utilization
Queue-depth allocation:
- Query complexity: O(1) with limit
- Simple LIMIT clause, no resource computation
- Faster queries
- Simpler logic
For workflows with thousands of ready jobs, queue-depth allocation has lower overhead.
Best Practices
-
Start with resource-aware allocation for new workflows
- Better default behavior
- Prevents resource over-subscription
- Easier to debug resource issues
-
Use scheduler_id for production multi-cluster workflows
- Explicit job placement
- Predictable resource usage
- Better for heterogeneous resources
-
Use sort_method for flexible single-cluster workflows
- Simpler specifications
- Better resource utilization
- Good for homogeneous resources
-
Use queue-depth parallelism for homogeneous task queues
- Many similar jobs
- I/O-bound workloads
- When simplicity matters more than precision
-
Monitor resource usage when switching strategies
- Check for over-subscription
- Verify expected parallelism
- Look for resource contention
-
Test with small workflows first
- Validate job allocation behavior
- Check resource accounting
- Ensure jobs run on intended schedulers
Summary
| Strategy | Use When | Allocation Method | Resource Tracking |
|---|---|---|---|
| Resource-aware + sort_method | Heterogeneous jobs, flexible allocation | Server filters by resources | Yes |
| Resource-aware + scheduler_id | Heterogeneous jobs, strict allocation | Server filters by resources AND scheduler | Yes |
| Queue-depth | Homogeneous jobs, simple parallelism | Server returns next N jobs | No |
Choose the strategy that best matches your workflow characteristics and execution environment. You can even mix strategies across different runners for maximum flexibility.
Workflow Reinitialization
When you modify input files or configuration after a workflow has run, you need a way to re-execute only the affected jobs. Reinitialization handles this by detecting what changed and marking the appropriate jobs for re-execution.
When to Use Reinitialization
Use torc workflows reinitialize when:
- Input files changed — You modified an input file and want dependent jobs to rerun
- Configuration updated — You changed user_data parameters
- Output files missing — Output files were deleted and need regeneration
- Job definition changed — You modified a job's command or other attributes
- Iterative development — You're refining a workflow and need quick iteration
Basic Usage
# Preview what would change (recommended first step)
torc workflows reinitialize <workflow_id> --dry-run
# Reinitialize the workflow
torc workflows reinitialize <workflow_id>
# Force reinitialization even with warnings
torc workflows reinitialize <workflow_id> --force
How Change Detection Works
Reinitialization detects changes through three mechanisms:
1. File Modification Times
For files tracked in the workflow, Torc compares the current st_mtime (modification time) against
the stored value. If a file was modified since the last run, jobs that use it as input are marked
for re-execution.
# Modify an input file
echo "new data" > input.json
# Reinitialize detects the change
torc workflows reinitialize <workflow_id>
# Output: Reset 3 jobs due to changed inputs
2. Job Attribute and User Data Hashing
Torc computes SHA256 hashes of critical job attributes (such as the command) and user_data input
values. If any hash differs from the stored value, the job is marked for re-execution. This detects
changes like modified commands, updated scripts, or changed configuration parameters.
3. Missing Output Files
If a job's output file no longer exists on disk, the job is marked for re-execution regardless of whether inputs changed.
The Reinitialization Process
When you run reinitialize, Torc performs these steps:
- Bump run_id — Increments the workflow's run counter for tracking
- Reset workflow status — Clears the previous run's completion state
- Check file modifications — Compares current
st_mtimevalues to stored values - Check missing outputs — Identifies jobs whose output files no longer exist
- Check user_data changes — Computes and compares input hashes
- Mark affected jobs — Sets jobs needing re-execution to
uninitialized - Re-evaluate dependencies — Runs
initialize_jobsto set jobs toreadyorblocked
Dependency Propagation
When a job is marked for re-execution, all downstream jobs that depend on its outputs are also marked. This ensures the entire dependency chain is re-executed:
preprocess (input changed) → marked for rerun
↓
process (depends on preprocess output) → also marked
↓
postprocess (depends on process output) → also marked
Dry Run Mode
Always use --dry-run first to preview changes without modifying anything:
torc workflows reinitialize <workflow_id> --dry-run
Example output:
Dry run: 5 jobs would be reset due to changed inputs
- preprocess
- analyze_batch_1
- analyze_batch_2
- merge_results
- generate_report
Retrying Failed Jobs
Important: Reinitialization does not automatically retry failed jobs. To retry failed jobs, use
reset-status:
# Reset failed jobs to ready status, then reinitialize to check for other changes
torc workflows reset-status <workflow_id> --failed-only --reinitialize
# Or just reset failed jobs without reinitialization
torc workflows reset-status <workflow_id> --failed-only
Comparison with Full Reset
| Scenario | Use reinitialize | Use reset-status |
|---|---|---|
| Input file changed | Yes | No |
| Job command changed | Yes | No |
| Want to rerun everything | No | Yes |
| Retry failed jobs only | No | Yes (--failed-only) |
| Iterative development | Yes | Depends |
| Changed workflow spec | Create new workflow | Create new workflow |
Working with Workflows
This section covers how to create, configure, and manage workflows.
- Creating Workflows - Getting started with workflow creation
- Workflow Specification Formats - JSON, YAML, and other formats
- Visualizing Workflow Structure - Viewing workflow graphs
- Exporting and Importing Workflows - Moving workflows between systems
- Archiving Workflows - Long-term workflow storage
How to Create Workflows
This guide shows different methods for creating Torc workflows, from the most common (specification files) to more advanced approaches (CLI, API).
Using Workflow Specification Files (Recommended)
The easiest way to create workflows is with specification files. Torc supports YAML, JSON5, and KDL formats.
Create from a YAML File
torc workflows create workflow.yaml
Create from JSON5 or KDL
torc workflows create workflow.json5
torc workflows create workflow.kdl
Torc detects the format from the file extension.
Create and Run in One Step
For quick iteration, combine creation and execution:
# Create and run locally
torc run workflow.yaml
# Create and submit to Slurm
torc submit workflow.yaml
For format syntax and examples, see the Workflow Specification Formats guide. For a complete reference of all fields, see the Workflow Specification Reference.
Using the CLI (Step by Step)
For programmatic workflow construction or when you need fine-grained control, create workflows piece by piece using the CLI.
Step 1: Create an Empty Workflow
torc workflows new \
--name "my_workflow" \
--description "My test workflow"
Output:
Successfully created workflow:
ID: 1
Name: my_workflow
User: dthom
Description: My test workflow
Note the workflow ID (1) for subsequent commands.
Step 2: Add Resource Requirements
torc resource-requirements create \
--name "small" \
--num-cpus 1 \
--memory "1g" \
--runtime "PT10M" \
1 # workflow ID
Output:
Successfully created resource requirements:
ID: 2
Workflow ID: 1
Name: small
Step 3: Add Files (Optional)
torc files create \
--name "input_file" \
--path "/data/input.txt" \
1 # workflow ID
Step 4: Add Jobs
torc jobs create \
--name "process_data" \
--command "python process.py" \
--resource-requirements-id 2 \
--input-file-ids 1 \
1 # workflow ID
Step 5: Initialize and Run
# Initialize the workflow (resolves dependencies)
torc workflows initialize-jobs 1
# Run the workflow
torc run 1
Using the Python API
For complex programmatic workflow construction, use the Python client:
from torc import make_api
from torc.openapi_client import (
WorkflowModel,
JobModel,
ResourceRequirementsModel,
)
# Connect to the server
api = make_api("http://localhost:8080/torc-service/v1")
# Create workflow
workflow = api.create_workflow(WorkflowModel(
name="my_workflow",
user="myuser",
description="Programmatically created workflow",
))
# Add resource requirements
rr = api.create_resource_requirements(ResourceRequirementsModel(
workflow_id=workflow.id,
name="small",
num_cpus=1,
memory="1g",
runtime="PT10M",
))
# Add jobs
api.create_job(JobModel(
workflow_id=workflow.id,
name="job1",
command="echo 'Hello World'",
resource_requirements_id=rr.id,
))
print(f"Created workflow {workflow.id}")
For more details, see the Map Python Functions tutorial.
Using the Julia API
The Julia client provides similar functionality for programmatic workflow construction:
using Torc
import Torc: APIClient
# Connect to the server
api = make_api("http://localhost:8080/torc-service/v1")
# Create workflow
workflow = send_api_command(
api,
APIClient.create_workflow,
APIClient.WorkflowModel(;
name = "my_workflow",
user = get_user(),
description = "Programmatically created workflow",
),
)
# Add resource requirements
rr = send_api_command(
api,
APIClient.create_resource_requirements,
APIClient.ResourceRequirementsModel(;
workflow_id = workflow.id,
name = "small",
num_cpus = 1,
memory = "1g",
runtime = "PT10M",
),
)
# Add jobs
send_api_command(
api,
APIClient.create_job,
APIClient.JobModel(;
workflow_id = workflow.id,
name = "job1",
command = "echo 'Hello World'",
resource_requirements_id = rr.id,
),
)
println("Created workflow $(workflow.id)")
The Julia client also supports map_function_to_jobs for mapping a function across parameters,
similar to the Python client.
Choosing a Method
| Method | Best For |
|---|---|
| Specification files | Most workflows; declarative, version-controllable |
| CLI step-by-step | Scripted workflows, testing individual components |
| Python API | Complex dynamic workflows, integration with Python pipelines |
| Julia API | Complex dynamic workflows, integration with Julia pipelines |
Common Tasks
Validate a Workflow File Without Creating
Use --dry-run to validate a workflow specification without creating it on the server:
torc workflows create --dry-run workflow.yaml
Example output:
Workflow Validation Results
===========================
Workflow: my_workflow
Description: A sample workflow
Components to be created:
Jobs: 100 (expanded from 1 parameterized job specs)
Files: 5
User data records: 2
Resource requirements: 2
Slurm schedulers: 2
Workflow actions: 3
Submission: Ready for scheduler submission (has on_workflow_start schedule_nodes action)
Validation: PASSED
For programmatic use (e.g., in scripts or the dashboard), get JSON output:
torc -f json workflows create --dry-run workflow.yaml
What Validation Checks
The dry-run performs comprehensive validation:
Structural Checks:
- Valid file format (YAML, JSON5, KDL, or JSON)
- Required fields present
- Parameter expansion (shows expanded job count vs. original spec count)
Reference Validation:
depends_onreferences existing jobsdepends_on_regexespatterns are valid and match at least one jobresource_requirementsreferences existschedulerreferences existinput_filesandoutput_filesreference defined filesinput_user_dataandoutput_user_datareference defined user data- All regex patterns (
*_regexesfields) are valid
Duplicate Detection:
- Duplicate job names
- Duplicate file names
- Duplicate user data names
- Duplicate resource requirement names
- Duplicate scheduler names
Dependency Analysis:
- Circular dependency detection (reports all jobs in the cycle)
Action Validation:
- Actions reference existing jobs and schedulers
schedule_nodesactions have requiredschedulerandscheduler_type
Scheduler Configuration:
- Slurm scheduler node requirements are valid
- Warns about heterogeneous schedulers without
jobs_sort_method(see below)
Heterogeneous Scheduler Warning
When you have multiple Slurm schedulers with different resource profiles (memory, GPUs, walltime, partition) and jobs without explicit scheduler assignments, the validation warns about potential suboptimal job-to-node matching:
Warnings (1):
- Workflow has 3 schedulers with different memory (mem), walltime but 10 job(s)
have no explicit scheduler assignment and jobs_sort_method is not set. The
default sort method 'gpus_runtime_memory' will be used (jobs sorted by GPUs,
then runtime, then memory). If this doesn't match your workload, consider
setting jobs_sort_method explicitly to 'gpus_memory_runtime' (prioritize
memory over runtime) or 'none' (no sorting).
This warning helps you avoid situations where:
- Long-walltime nodes pull short-runtime jobs
- High-memory nodes pull low-memory jobs
- GPU nodes pull non-GPU jobs
Solutions:
- Set
jobs_sort_methodexplicitly in your workflow spec - Assign jobs to specific schedulers using the
schedulerfield on each job - Accept the default
gpus_runtime_memorysorting if it matches your workload
Bypassing Validation
To create a workflow despite validation warnings:
torc workflows create --skip-checks workflow.yaml
Note: This bypasses scheduler node validation checks (which are treated as errors), but does not bypass all errors. Errors such as missing references or circular dependencies will always prevent creation.
List Available Workflows
torc workflows list
Delete a Workflow
torc workflows delete <workflow_id>
View Workflow Details
torc workflows get <workflow_id>
Defining File Dependencies
Jobs often need to read input files and produce output files. Torc can automatically infer job dependencies from these file relationships using variable substitution:
files:
- name: raw_data
path: /data/raw.csv
- name: processed_data
path: /data/processed.csv
jobs:
- name: preprocess
command: "python preprocess.py -o ${files.output.raw_data}"
- name: analyze
command: "python analyze.py -i ${files.input.raw_data} -o ${files.output.processed_data}"
Key concepts:
${files.input.NAME}- References a file this job reads (creates a dependency on the job that outputs it)${files.output.NAME}- References a file this job writes (satisfies dependencies for downstream jobs)
In the example above, analyze automatically depends on preprocess because it needs raw_data as
input, which preprocess produces as output.
For a complete walkthrough, see Tutorial: Diamond Workflow.
Next Steps
- Tutorial: Diamond Workflow - Learn file-based dependencies with the fan-out/fan-in pattern
- Workflow Specification Formats - Detailed format reference
- Workflow Specification Reference - Complete field reference for all data models
- Job Parameterization - Generate multiple jobs from templates
- Tutorial: Many Independent Jobs - Your first workflow
Workflow Specification Formats
Torc supports three workflow specification formats: YAML, JSON5, and KDL. All formats provide the same functionality with different syntaxes to suit different preferences and use cases.
Format Overview
| Feature | YAML | JSON5 | KDL |
|---|---|---|---|
| Parameter Expansion | ✓ | ✓ | ✓ |
| Comments | ✓ | ✓ | ✓ |
| Trailing Commas | ✗ | ✓ | N/A |
| Human-Readable | ✓✓✓ | ✓✓ | ✓✓✓ |
| Programmatic Generation | ✓✓ | ✓✓✓ | ✓ |
| Industry Standard | ✓✓✓ | ✓✓ | ✓ |
| Jobs, Files, Resources | ✓ | ✓ | ✓ |
| User Data | ✓ | ✓ | ✓ |
| Workflow Actions | ✓ | ✓ | ✓ |
| Resource Monitoring | ✓ | ✓ | ✓ |
| Slurm Schedulers | ✓ | ✓ | ✓ |
YAML Format
Best for: Most workflows, especially those using multi-line commands.
File Extension: .yaml or .yml
Example:
name: data_processing_workflow
user: datauser
description: Multi-stage data processing pipeline
# File definitions
files:
- name: raw_data
path: /data/input/raw_data.csv
- name: processed_data
path: /data/output/processed_data.csv
# Resource requirements
resource_requirements:
- name: small_job
num_cpus: 2
num_gpus: 0
num_nodes: 1
memory: 4g
runtime: PT30M
# Jobs
jobs:
- name: download_data
command: wget https://example.com/data.csv -O ${files.output.raw_data}
resource_requirements: small_job
- name: process_data
command: python process.py ${files.input.raw_data} -o ${files.output.processed_data}
resource_requirements: small_job
depends_on:
- download_data
# Workflow actions
actions:
- trigger_type: on_workflow_start
action_type: run_commands
commands:
- mkdir -p /data/input /data/output
- echo "Workflow started"
Advantages:
- Most widely used configuration format
- Excellent for complex workflows with many jobs
- Clean, readable syntax without brackets
Disadvantages:
- Indentation-sensitive
- Can be verbose for deeply nested structures
JSON5 Format
Best for: Programmatic workflow generation and JSON compatibility.
File Extension: .json5
Example:
{
name: "data_processing_workflow",
user: "datauser",
description: "Multi-stage data processing pipeline",
// File definitions
files: [
{name: "raw_data", path: "/data/input/raw_data.csv"},
{name: "processed_data", path: "/data/output/processed_data.csv"},
],
// Resource requirements
resource_requirements: [
{
name: "small_job",
num_cpus: 2,
num_gpus: 0,
num_nodes: 1,
memory: "4g",
runtime: "PT30M",
},
],
// Jobs
jobs: [
{
name: "download_data",
command: "wget https://example.com/data.csv -O ${files.output.raw_data}",
resource_requirements: "small_job",
},
{
name: "process_data",
command: "python process.py ${files.input.raw_data} -o ${files.output.processed_data}",
resource_requirements: "small_job",
depends_on: ["download_data"],
},
],
// Workflow actions
actions: [
{
trigger_type: "on_workflow_start",
action_type: "run_commands",
commands: [
"mkdir -p /data/input /data/output",
"echo 'Workflow started'",
],
},
],
}
Advantages:
- JSON-compatible (easy programmatic manipulation)
- Supports comments and trailing commas
- Familiar to JavaScript/JSON users
Disadvantages:
- More verbose than YAML
- More brackets and commas than YAML
KDL Format
Best for: Simple to moderate workflows with clean syntax.
File Extension: .kdl
Example:
name "data_processing_workflow"
user "datauser"
description "Multi-stage data processing pipeline"
// File definitions
file "raw_data" path="/data/input/raw_data.csv"
file "processed_data" path="/data/output/processed_data.csv"
// Resource requirements
resource_requirements "small_job" {
num_cpus 2
num_gpus 0
num_nodes 1
memory "4g"
runtime "PT30M"
}
// Jobs
job "download_data" {
command "wget https://example.com/data.csv -O ${files.output.raw_data}"
resource_requirements "small_job"
}
job "process_data" {
command "python process.py ${files.input.raw_data} -o ${files.output.processed_data}"
resource_requirements "small_job"
depends_on_job "download_data"
}
// Workflow actions
action {
trigger_type "on_workflow_start"
action_type "run_commands"
command "mkdir -p /data/input /data/output"
command "echo 'Workflow started'"
}
Advantages:
- Clean, minimal syntax
- No indentation requirements
- Supports all core Torc features
Disadvantages:
- Less familiar to most users
- Boolean values use special syntax (
#true,#false)
KDL-Specific Syntax Notes
-
Boolean values: Use
#trueand#false(nottrueorfalse)resource_monitor { enabled #true generate_plots #false } -
Repeated child nodes: Use multiple statements
action { command "echo 'First command'" command "echo 'Second command'" } -
User data: Requires child nodes for properties
user_data "metadata" { is_ephemeral #true data "{\"key\": \"value\"}" }
Common Features Across All Formats
Variable Substitution
All formats support the same variable substitution syntax:
${files.input.NAME}- Input file path${files.output.NAME}- Output file path${user_data.input.NAME}- Input user data${user_data.output.NAME}- Output user data
Supported Fields
All formats support:
- Workflow metadata: name, user, description
- Jobs: name, command, dependencies, resource requirements
- Files: name, path, modification time
- User data: name, data (JSON), ephemeral flag
- Resource requirements: CPUs, GPUs, memory, runtime
- Slurm schedulers: account, partition, walltime, etc.
- Workflow actions: triggers, action types, commands
- Resource monitoring: enabled, granularity, sampling interval
Examples Directory
The Torc repository includes comprehensive examples in all three formats:
examples/
├── yaml/ # All workflows (15 examples)
├── json/ # All workflows (15 examples)
└── kdl/ # Non-parameterized workflows (9 examples)
Compare the same workflow in different formats to choose your preference:
See the examples directory for the complete collection.
Creating Workflows
All formats use the same command:
torc workflows create examples/yaml/sample_workflow.yaml
torc workflows create examples/json/sample_workflow.json5
torc workflows create examples/kdl/sample_workflow.kdl
Or use the quick execution commands:
# Create and run locally
torc run examples/yaml/sample_workflow.yaml
# Create and submit to scheduler
torc submit examples/yaml/workflow_actions_data_pipeline.yaml
Recommendations
Start with YAML if you're unsure.
Switch to JSON5 if you need to programmatically generate workflows or prefer JSON syntax.
Try KDL if you prefer minimal syntax.
All three formats are fully supported and maintained. Choose based on your workflow complexity and personal preference.
Visualizing Workflow Structure
Understanding how your workflow will execute—which jobs run in parallel, how dependencies create stages, and when Slurm allocations are requested—is essential for debugging and optimization. Torc provides several tools for visualizing workflow structure.
Execution Plan Command
The torc workflows execution-plan command analyzes a workflow and displays its execution stages,
showing how jobs are grouped and when schedulers allocate resources.
Basic Usage
# From a specification file
torc workflows execution-plan workflow.yaml
# From an existing workflow
torc workflows execution-plan <workflow_id>
Example Output
For a workflow with two independent processing pipelines that merge at the end:
Workflow: two_subgraph_pipeline
Total Jobs: 15
▶ Stage 1: Workflow Start
Scheduler Allocations:
• prep_sched (slurm) - 1 allocation(s)
Jobs Becoming Ready:
• prep_a
• prep_b
→ Stage 2: When jobs 'prep_a', 'prep_b' complete
Scheduler Allocations:
• work_a_sched (slurm) - 1 allocation(s)
• work_b_sched (slurm) - 1 allocation(s)
Jobs Becoming Ready:
• work_a_{1..5}
• work_b_{1..5}
→ Stage 3: When 10 jobs complete
Scheduler Allocations:
• post_a_sched (slurm) - 1 allocation(s)
• post_b_sched (slurm) - 1 allocation(s)
Jobs Becoming Ready:
• post_a
• post_b
→ Stage 4: When jobs 'post_a', 'post_b' complete
Scheduler Allocations:
• final_sched (slurm) - 1 allocation(s)
Jobs Becoming Ready:
• final
Total Stages: 4
What the Execution Plan Shows
- Stages: Groups of jobs that become ready at the same time based on dependency resolution
- Scheduler Allocations: Which Slurm schedulers request resources at each stage (for workflows with Slurm configuration)
- Jobs Becoming Ready: Which jobs transition to "ready" status at each stage
- Subgraphs: Independent branches of the workflow that can execute in parallel
Workflows Without Slurm Schedulers
For workflows without pre-defined Slurm schedulers, the execution plan shows the job stages without scheduler information:
torc workflows execution-plan workflow_no_slurm.yaml
Workflow: my_pipeline
Total Jobs: 10
▶ Stage 1: Workflow Start
Jobs Becoming Ready:
• preprocess
→ Stage 2: When job 'preprocess' completes
Jobs Becoming Ready:
• work_{1..5}
→ Stage 3: When 5 jobs complete
Jobs Becoming Ready:
• postprocess
Total Stages: 3
This helps you understand the workflow topology before adding Slurm configuration with
torc slurm generate.
Use Cases
- Validate workflow structure: Ensure dependencies create the expected execution order
- Identify parallelism: See which jobs can run concurrently
- Debug slow workflows: Find stages that serialize unnecessarily
- Plan Slurm allocations: Understand when resources will be requested
- Verify auto-generated schedulers: Check that
torc slurm generatecreated appropriate staging
DAG Visualization in the Dashboard
The web dashboard provides interactive DAG (Directed Acyclic Graph) visualization.
Viewing the DAG
- Navigate to the Details tab
- Select a workflow
- Click View DAG in the Visualization section
DAG Types
The dashboard supports three DAG visualization types:
| Type | Description |
|---|---|
| Job Dependencies | Shows explicit and implicit dependencies between jobs |
| Job-File Relations | Shows how jobs connect through input/output files |
| Job-UserData Relations | Shows how jobs connect through user data |
DAG Features
- Color-coded nodes: Jobs are colored by status (ready, running, completed, failed, etc.)
- Interactive: Zoom, pan, and click nodes for details
- Layout: Automatic hierarchical layout using Dagre algorithm
- Legend: Status color reference
TUI DAG View
The terminal UI (torc tui) also includes DAG visualization:
- Select a workflow
- Press
dto toggle the DAG view - Use arrow keys to navigate
Comparing Visualization Tools
| Tool | Best For |
|---|---|
execution-plan | Understanding execution stages, Slurm allocation timing |
| Dashboard DAG | Interactive exploration, status monitoring |
| TUI DAG | Quick terminal-based visualization |
Example: Analyzing a Complex Workflow
Consider a workflow with preprocessing, parallel work, and aggregation:
# First, view the execution plan
torc workflows execution-plan examples/subgraphs/subgraphs_workflow.yaml
# If no schedulers, generate them
torc slurm generate --account myproject examples/subgraphs/subgraphs_workflow_no_slurm.yaml
# View the plan again to see scheduler allocations
torc workflows execution-plan examples/subgraphs/subgraphs_workflow.yaml
The execution plan helps you verify that:
- Independent subgraphs are correctly identified
- Stages align with your expected execution order
- Slurm allocations are timed appropriately
See Also
- Web Dashboard — Full dashboard documentation
- Slurm Workflows — Understanding Slurm integration
- Workflow Actions — How actions trigger scheduler allocations
- Subgraphs Example — Complete example with multiple subgraphs
How to Export and Import Workflows
This guide shows how to export workflows to portable JSON files and import them into the same or different Torc servers. This is useful for:
- Backup and restore: Save workflow definitions for disaster recovery
- Migration: Move workflows between development, staging, and production environments
- Sharing: Share workflow templates with teammates or the community
- Duplication: Create copies of workflows for testing or experimentation
Exporting Workflows
Basic Export
Export a workflow to a JSON file:
torc workflows export 123 --output my_workflow.json
This creates a self-contained JSON document containing:
- Workflow metadata
- All jobs with their dependencies
- Files and user data
- Resource requirements
- Slurm and local schedulers
- Workflow actions
Export with Results
Include job results (stdout, stderr, return codes) in the export:
torc workflows export 123 --output my_workflow.json --include-results
Export with Events
Include workflow events (job status changes, scheduler events):
torc workflows export 123 --output my_workflow.json --include-events
Export with Everything
Include both results and events:
torc workflows export 123 --output my_workflow.json --include-results --include-events
Export to Stdout
Omit --output to write to stdout (useful for piping):
torc workflows export 123 > my_workflow.json
JSON Output Format
Use --format json for machine-readable output with export statistics:
torc workflows export 123 --output my_workflow.json --format json
Output:
{
"success": true,
"workflow_id": 123,
"workflow_name": "my_workflow",
"output_file": "my_workflow.json",
"jobs": 5,
"files": 3,
"user_data": 2,
"results": 0,
"events": 0
}
Importing Workflows
Basic Import
Import a workflow from a JSON file:
torc workflows import my_workflow.json
Output:
Successfully imported workflow:
Workflow ID: 456
Name: my_workflow
Jobs: 5
Files: 3
User data: 2
Import with Custom Name
Override the workflow name during import:
torc workflows import my_workflow.json --name "new_workflow_name"
Skip Results During Import
If the export includes results but you don't want to import them:
torc workflows import my_workflow.json --skip-results
Skip Events During Import
If the export includes events but you don't want to import them:
torc workflows import my_workflow.json --skip-events
JSON Output Format
Use --format json for machine-readable output:
torc workflows import my_workflow.json --format json
Output:
{
"success": true,
"workflow_id": 456,
"workflow_name": "my_workflow",
"jobs": 5,
"files": 3,
"user_data": 2
}
How Import Works
ID Remapping
When importing, all entity IDs are remapped to new IDs assigned by the target server. This ensures no conflicts with existing workflows. Cross-references between entities (e.g., job dependencies on files) are automatically updated to use the new IDs.
Job Status Reset
Imported jobs always start in the uninitialized status, regardless of their status in the exported
file. After import, you need to initialize and run the workflow:
# Initialize the imported workflow
torc workflows initialize 456
# Run locally
torc workflows run 456
# Or submit to scheduler
torc workflows submit 456
Default Resource Requirements
Each workflow automatically gets a "default" resource requirements entry. During import, the exported "default" resource requirements are mapped to the new workflow's default entry.
Export Format
The export format is a versioned JSON document. Here's the structure:
{
"export_version": "1.0",
"exported_at": "2024-01-15T10:30:00Z",
"workflow": { ... },
"files": [ ... ],
"user_data": [ ... ],
"resource_requirements": [ ... ],
"slurm_schedulers": [ ... ],
"local_schedulers": [ ... ],
"jobs": [ ... ],
"workflow_actions": [ ... ],
"results": [ ... ],
"events": [ ... ]
}
The results and events fields are only present when --include-results or --include-events
are specified.
Common Workflows
Backup All Active Workflows
for id in $(torc workflows list --format json | jq -r '.items[].id'); do
torc workflows export $id --output "backup_workflow_${id}.json"
done
Migrate to Another Server
# On source server
torc workflows export 123 --output workflow.json
# On target server (different TORC_API_URL)
export TORC_API_URL="http://new-server:8080/torc-service/v1"
torc workflows import workflow.json
Clone a Workflow for Testing
# Export existing workflow
torc workflows export 123 --output original.json
# Import as a new workflow with different name
torc workflows import original.json --name "test_copy"
Troubleshooting
Import Fails with "File not found"
Ensure the export file exists and the path is correct:
ls -la my_workflow.json
torc workflows import ./my_workflow.json
Import Fails with API Error
Check that:
- The Torc server is running and accessible
- You have permission to create workflows
- The export file is valid JSON (not corrupted)
Validate the export file:
python -m json.tool my_workflow.json > /dev/null && echo "Valid JSON"
Jobs Not Running After Import
Imported jobs start in uninitialized status. You must initialize the workflow:
torc workflows initialize 456
Then check job status:
torc jobs list 456
Jobs should now show ready or blocked status depending on their dependencies.
Archiving Workflows
Workflow archiving provides a way to hide completed or inactive workflows from default list views while preserving all workflow data and execution history. Archived workflows remain fully accessible but don't clutter everyday workflow management operations.
Purpose and Motivation
As projects mature and accumulate workflows over time, the list of active workflows can become difficult to navigate. Archiving addresses this by:
- Reducing visual clutter - Completed workflows no longer appear in default list views
- Preserving historical data - All workflow data, jobs, results, and logs remain accessible
- Improving usability - Users can focus on active workflows without losing access to past work
- Maintaining audit trails - Archived workflows can be retrieved for analysis, debugging, or compliance
Archiving is particularly useful for:
- Completed experiments that may need future reference
- Successful production runs that serve as historical records
- Development workflows that are no longer active but contain valuable examples
- Workflows from completed projects that need to be retained for documentation
How It Works
When you archive a workflow, it's marked with an "archived" flag. This flag controls whether the workflow appears in default list views:
- Active workflows (not archived): Appear in standard
workflows listcommands - Archived workflows: Hidden from default lists but accessible with the
--archived-onlyflag
The archive status is just metadata - it doesn't affect the workflow's data, results, or any other functionality.
Archiving Workflows
Use the workflows archive command to archive or unarchive workflows:
# Archive a specific workflow
torc workflows archive true <workflow_id>
# Archive multiple workflows
torc workflows archive true 123 456 789
# Interactive selection (prompts user to choose)
torc workflows archive true
# With JSON output
torc --format json workflows archive true <workflow_id>
The command will output confirmation messages:
Successfully archived workflow 123
Successfully archived workflow 456
Successfully archived workflow 789
Unarchiving Workflows
To restore an archived workflow to active status, use the same command with false:
# Unarchive a specific workflow
torc workflows archive false <workflow_id>
# Unarchive multiple workflows
torc workflows archive false 123 456 789
# Interactive selection
torc workflows archive false
Output:
Successfully unarchived workflow 123
Viewing Workflows
Default Behavior
By default, the workflows list command shows only non-archived workflows:
# Shows active (non-archived) workflows only
torc workflows list
# Shows active workflows for a specific user
torc workflows list --user alice
Viewing Archived Workflows
Use the --archived-only flag to see archived workflows:
# List only archived workflows for current user
torc workflows list --archived-only
Viewing All Workflows
Use the --include-archived flag to see all workflows:
torc workflows list --include-archived
Accessing Specific Workflows
You can always access a workflow directly by its ID, regardless of archive status:
# Get details of any workflow (archived or not)
torc workflows get <workflow_id>
# Check workflow status
torc workflows status <workflow_id>
Impact on Workflow Operations
Operations Restricted on Archived Workflows
Certain workflow operations are not allowed on archived workflows to prevent accidental modifications:
- ❌ Status reset: Cannot use
workflows reset-statuson archived workflows- Error message: "Cannot reset archived workflow status. Unarchive the workflow first."
- To reset status, unarchive the workflow first, then reset
Interactive Selection Behavior
When commands prompt for interactive workflow selection (when workflow ID is not specified), archived workflows are excluded by default:
# These will NOT show archived workflows in the interactive menu
torc-client workflows delete
torc-client workflows status
torc-client workflows initialize
This prevents accidentally operating on archived workflows while still allowing explicit access by ID.
Archive vs. Delete
Understanding when to archive versus delete workflows:
| Operation | Data Preserved | Reversible | Use Case |
|---|---|---|---|
| Archive | ✅ Yes | ✅ Yes | Completed workflows you may reference later |
| Delete | ❌ No | ❌ No | Failed experiments, test workflows, unwanted data |
Archive when:
- Workflow completed successfully and may need future reference
- Results should be preserved for reproducibility or compliance
- Workflow represents a milestone or important historical run
- You want to declutter lists but maintain data integrity
Delete when:
- Workflow failed and results are not useful
- Workflow was created for testing purposes only
- Data is no longer needed and storage space is a concern
- Workflow contains errors that would confuse future users
Common Use Cases
Completed Experiments
After completing an experiment and validating results:
# Archive the completed experiment
torc-client workflows archive true 123
# Later, if you need to reference it
torc-client workflows get 123
torc-client results list 123
Development Cleanup
Clean up development workflows while preserving examples:
# Delete test workflows
torc-client workflows delete 301 302 303
# Archive useful development examples
torc-client workflows archive true 304 305
Periodic Maintenance
Regularly archive old workflows to keep lists manageable:
# List workflows, identify completed ones
torc-client workflows list
# Archive workflows from completed projects
torc workflows archive true 401 402 403 404 405
Best Practices
When to Archive
- After successful completion - Archive workflows once they've completed successfully and been validated
- Project milestones - Archive workflows representing project phases or releases
- Regular cleanup - Establish periodic archiving of workflows older than a certain timeframe
- Before major changes - Archive working versions before making significant modifications
Summary
Workflow archiving provides a simple, reversible way to hide completed or inactive workflows from default views while preserving all data and functionality. It's designed for long-term workflow management in active projects where historical data is valuable but visual clutter is undesirable.
Key points:
- Archive workflows with:
torc workflows archive true <id> - Unarchive workflows with:
torc workflows archive false <id> - Archived workflows are hidden from default lists but remain fully functional
- View archived workflows with:
torc workflows list --archived-only - Archiving is reversible and does not affect data storage
- Use archiving for completed workflows; use deletion for unwanted data
How-Tos
Step-by-step guides for common tasks.
- Track Workflow Status - Monitoring workflow progress
- Cancel a Workflow - Stopping running workflows
- View Job Logs - Accessing job output
- Debug a Failed Job - Troubleshooting failures
- Check Resource Utilization - Monitoring resource usage
- View Resource Plots - Visualizing resource metrics
- Parameterize Jobs with Files - Using file-based parameters
How to Track Workflow Status
Monitor a running workflow's progress using the CLI, TUI, or dashboard.
Quick Status Check
torc reports summary <workflow_id>
Example output:
Workflow 42: data_pipeline
Jobs by Status:
Completed: 45
Running: 5
Ready: 10
Blocked: 40
Continuous Monitoring
Watch status update every 10 seconds:
watch -n 10 torc reports summary <workflow_id>
Interactive TUI
Launch the terminal UI for a visual dashboard:
torc tui
The TUI shows:
- Job status breakdown with progress bars
- Running job details
- Failed job information
- Real-time updates
List Individual Jobs
View job-level status:
# All jobs
torc jobs list <workflow_id>
# Filter by status
torc jobs list <workflow_id> --status running
torc jobs list <workflow_id> --status failed
Check Completion
Verify if a workflow has finished:
torc workflows is-complete <workflow_id>
For scripting:
if torc -f json workflows is-complete "$WORKFLOW_ID" | jq -e '.is_complete' > /dev/null; then
echo "Workflow complete"
fi
See Also
- Terminal UI (TUI) — Interactive monitoring
- Web Dashboard — Visual workflow management
- Workflow Reports — Generate summary reports
How to Cancel a Workflow
Stop a running workflow and terminate its jobs.
Cancel a Workflow
torc workflows cancel <workflow_id>
This:
- Marks the workflow as canceled
- Stops claiming new jobs
- Sends SIGKILL to all running processes
- Sends
scancelto all active or pending Slurm allocations
Check Cancellation Status
Verify the workflow was canceled:
torc workflows status <workflow_id>
Or check completion status:
torc workflows is-complete <workflow_id>
Output:
Workflow 42 completion status:
Is Complete: true
Is Canceled: true
Restart After Cancellation
To resume a canceled workflow:
# Reinitialize to reset canceled jobs
torc workflows reinitialize <workflow_id>
# Run again
torc workflows run <workflow_id>
torc workflows submit <workflow_id>
Jobs that completed before cancellation remain completed.
See Also
- Track Workflow Status — Monitor workflow progress
- Workflow Reinitialization — Resume after issues
How to View Job Logs
Find and read the stdout/stderr output from job execution.
Find Log File Paths
torc reports results <workflow_id>
torc reports results <workflow_id> --job-id 15
Output includes:
{
"job_id": 15,
"job_name": "work_2",
"status": "Failed",
"run_id": 1,
"return_code": 137,
"completion_time": "2026-01-06T20:30:00.200Z",
"exec_time_minutes": 0.09313333333333332,
"compute_node_id": 47,
"job_stdout": "output/job_stdio/job_wf43_j15_r1_a1.o",
"job_stderr": "output/job_stdio/job_wf43_j15_r1_a1.e",
"compute_node_type": "slurm"
},
Read Logs Directly
Once you have the path, view the logs:
# View stdout
cat output/job_stdio/job_wf43_j15_r1_a1.o
# View stderr
cat output/job_stdio/job_wf43_j15_r1_a1.e
# Follow logs in real-time (for running jobs)
tail -f output/job_stdio/job_wf43_j15_r1_a1.*
Default Log Location
By default, logs are stored in the output directory:
output/
└── job_stdio/
├── job_wf<workflow_id>_j<job_id>_r<run_id>_a<attempt_id>.o
├── job_wf<workflow_id>_j<job_id>_r<run_id>_a<attempt_id>.3
The output directory can be configured via the run/submit CLI options.
View Logs for Failed Jobs
Quickly find logs for failed jobs:
# Get failed job IDs
torc jobs list <workflow_id> --status failed
# Then view each job's logs
torc reports results <workflow_id> --job-id <failed_job_id>
View Logs in TUI or Dashboard
You can also view job logs interactively:
- TUI — Run
torc tuiand select a job to view its stdout/stderr in the interface. See Terminal UI. - Dashboard — The web dashboard displays job logs when you click on a job. See Web Dashboard.
See Also
- Working with Logs — Log configuration and management
- Debug a Failed Job — Full debugging workflow
How to Debug a Failed Job
Systematically diagnose why a job failed.
Step 1: Identify the Failed Job
torc jobs list <workflow_id> --status failed
Note the job ID and name.
Step 2: Check the Exit Code
torc results get <workflow_id> --job-id <job_id>
Common exit codes:
| Code | Meaning |
|---|---|
| 1 | General error |
| 2 | Misuse of shell command |
| 126 | Permission denied |
| 127 | Command not found |
| 137 | Killed (SIGKILL) — often OOM |
| 139 | Segmentation fault |
| 143 | Terminated (SIGTERM) |
Step 3: Read the Logs
# Get log paths
torc reports results <workflow_id> --job-id <job_id>
# View stderr (usually contains error messages)
cat output/job_stdio/job_wf43_j15_r1_a1.e
# View stdout
cat output/job_stdio/job_wf43_j15_r1_a1.o
Step 4: Check Resource Usage
Did the job exceed its resource limits?
torc reports check-resource-utilization <workflow_id>
Look for:
- Memory exceeded — Job was likely OOM-killed (exit code 137)
- Runtime exceeded — Job was terminated for running too long
Step 5: Reproduce Locally
Get the exact command that was run:
torc jobs get <job_id>
Try running it manually to see the error:
# Copy the command from the output and run it
python process.py --input data.csv
Common Fixes
| Problem | Solution |
|---|---|
| OOM killed | Increase memory in resource requirements |
| File not found | Verify input files exist, check dependencies |
| Permission denied | Check file permissions, execution bits |
| Timeout | Increase runtime in resource requirements |
Step 6: Fix and Retry
After fixing the issue:
# Reinitialize to reset failed jobs
torc workflows reset-status --failed --reinitialize <workflow_id>
# Run again
torc workflows run <workflow_id>
torc submit-slurm <workflow_id>
See Also
- View Job Logs — Finding log files
- Check Resource Utilization — Resource analysis
- Debugging Workflows — Comprehensive debugging guide
How to Check Resource Utilization
Compare actual resource usage against specified requirements to identify jobs that exceeded their limits.
Quick Start
torc reports check-resource-utilization <workflow_id>
Example output:
⚠ Found 2 resource over-utilization violations:
Job ID | Job Name | Resource | Specified | Peak Used | Over-Utilization
-------|-------------|----------|-----------|-----------|------------------
15 | train_model | Memory | 8.00 GB | 10.50 GB | +31.3%
15 | train_model | Runtime | 2h 0m 0s | 2h 45m 0s | +37.5%
Show All Jobs
Include jobs that stayed within limits:
torc reports check-resource-utilization <workflow_id> --all
Check a Specific Run
For workflows that have been reinitialized multiple times:
torc reports check-resource-utilization <workflow_id> --run-id 2
Adjusting Requirements
When jobs exceed their limits, update your workflow specification with a buffer:
resource_requirements:
- name: training
memory: 12g # 10.5 GB peak + 15% buffer
runtime: PT3H # 2h 45m actual + buffer
Guidelines:
- Memory: Add 10-20% above peak usage
- Runtime: Add 15-30% above actual duration
- CPU: Round up to next core count
See Also
- Resource Monitoring — Enable and configure monitoring
- Resource Requirements Reference — Specification format
How to View Resource Utilization Plots
Generate interactive visualizations of CPU and memory usage over time.
Prerequisites
Enable time series monitoring in your workflow specification:
resource_monitor:
granularity: "time_series"
sample_interval_seconds: 2
This creates a SQLite database with resource samples at regular intervals.
Generate Plots
After your workflow completes, generate plots from the collected data:
torc plot-resources output/resource_utilization/resource_metrics_*.db -o plots/
This creates:
- Individual job plots — CPU, memory, and process count over time for each job
- Overview plots — Comparison across all jobs
- Summary dashboard — Bar charts of peak and average usage
Plot Specific Jobs
Generate plots for only certain jobs:
torc plot-resources output/resource_utilization/resource_metrics_*.db \
-o plots/ \
--job-ids 15,16
View the Plots
Open the generated HTML files in your browser:
open plots/job_15_resources.html
Query Data Directly
For custom analysis, query the SQLite database:
sqlite3 -table output/resource_utilization/resource_metrics_1_1.db
-- View samples for a specific job
SELECT timestamp, cpu_percent, memory_bytes
FROM job_resource_samples
WHERE job_id = 1
ORDER BY timestamp;
See Also
- Resource Monitoring — Configuration options
- Resource Monitoring Database — Database schema reference
How to Parameterize Jobs with Files
Process multiple input files by combining parameterization with file path templating.
Basic Pattern
Use a parameter to generate jobs for each file:
name: process_files
jobs:
- name: process_{dataset}
command: python process.py --input data/{dataset}.csv --output results/{dataset}.json
parameters:
dataset: "[train, test, validation]"
This creates 3 jobs:
process_train→ processesdata/train.csvprocess_test→ processesdata/test.csvprocess_validation→ processesdata/validation.csv
With File Dependencies
Combine parameterization with explicit file definitions for dependency tracking:
name: file_pipeline
files:
- name: raw_{dataset}
path: data/{dataset}.csv
- name: processed_{dataset}
path: results/{dataset}.json
jobs:
- name: process_{dataset}
command: python process.py -i ${files.input.raw_{dataset}} -o ${files.output.processed_{dataset}}
parameters:
dataset: "[train, test, validation]"
- name: aggregate
command: python aggregate.py --input results/ --output summary.json
depends_on:
- process_{dataset}
parameters:
dataset: "[train, test, validation]"
The aggregate job automatically waits for all process_* jobs to complete.
Processing Numbered Files
Use range syntax for numbered file sequences:
jobs:
- name: convert_{i:03d}
command: ffmpeg -i video_{i:03d}.mp4 -o audio_{i:03d}.mp3
parameters:
i: "1:100"
Creates jobs for video_001.mp4 through video_100.mp4.
Multi-Dimensional Sweeps
Combine multiple parameters for Cartesian product expansion:
jobs:
- name: analyze_{region}_{year}
command: python analyze.py --region {region} --year {year} --output results/{region}_{year}.json
parameters:
region: "[north, south, east, west]"
year: "2020:2024"
Creates 20 jobs (4 regions × 5 years).
See Also
- Simple Parameterization — Basic parameter tutorial
- Advanced Parameterization — Multi-dimensional sweeps
- Job Parameterization Reference — Complete syntax
Tutorials
Hands-on tutorials to learn Torc by example.
- Many Independent Jobs - Running large batches of independent jobs
- Diamond Workflow - Building workflows with dependencies
- Simple Parameterization - Creating parameterized job sweeps
- Advanced Parameterization - Complex parameter combinations
- Multi-Stage Workflows with Barriers - Coordinating workflow stages
- User Data Dependencies - Passing data between jobs
Tutorial 1: Many Independent Jobs
This tutorial teaches you how to create and run a workflow with many independent parallel jobs using Torc's parameterization feature.
Learning Objectives
By the end of this tutorial, you will:
- Understand how to define parameterized jobs that expand into multiple instances
- Learn how Torc executes independent jobs in parallel
- Know how to monitor job execution and view results
Prerequisites
- Torc server running (see Installation)
- Basic familiarity with YAML syntax
Use Cases
This pattern is ideal for:
- Parameter sweeps: Testing different configurations
- Monte Carlo simulations: Running many independent trials
- Batch processing: Processing many files with the same logic
- Embarrassingly parallel workloads: Any task that can be split into independent units
Step 1: Start the Torc Server
First, ensure the Torc server is running:
torc-server run
By default, the server listens on port 8080, making the API URL
http://localhost:8080/torc-service/v1.
If you use a custom port, set the environment variable:
export TORC_API_URL="http://localhost:8100/torc-service/v1"
Step 2: Create the Workflow Specification
Save the following as hundred_jobs.yaml:
name: hundred_jobs_parallel
description: 100 independent jobs that can run in parallel
jobs:
- name: job_{i:03d}
command: |
echo "Running job {i}"
sleep $((RANDOM % 10 + 1))
echo "Job {i} completed"
resource_requirements: minimal
parameters:
i: "1:100"
resource_requirements:
- name: minimal
num_cpus: 1
num_gpus: 0
num_nodes: 1
memory: 1g
runtime: PT5M
Understanding the Specification
Let's break down the key elements:
-
name: job_{i:03d}: The{i:03d}is a parameter placeholder. The:03dformat specifier means "3-digit zero-padded integer", so jobs will be namedjob_001,job_002, ...,job_100. -
parameters: i: "1:100": This defines a parameterithat ranges from 1 to 100 (inclusive). Torc will create one job for each value. -
resource_requirements: minimal: Each job uses the "minimal" resource profile defined below.
When Torc processes this specification, it expands the single job definition into 100 separate jobs, each with its own parameter value substituted.
Step 3: Run the Workflow
Create and run the workflow in one command:
torc run hundred_jobs.yaml
This command:
- Creates the workflow on the server
- Expands the parameterized job into 100 individual jobs
- Initializes the dependency graph (in this case, no dependencies)
- Starts executing jobs in parallel
You'll see output showing the workflow ID and progress.
Step 4: Monitor Execution
While the workflow runs, you can monitor progress:
# Check workflow status
torc workflows status <workflow_id>
# List jobs and their states
torc jobs list <workflow_id>
# Or use the interactive TUI
torc tui
Since all 100 jobs are independent (no dependencies between them), Torc will run as many in parallel as your system resources allow.
Step 5: View Results
After completion, check the results:
torc results list <workflow_id>
This shows return codes, execution times, and resource usage for each job.
How It Works
When you run this workflow, Torc:
- Expands parameters: The single job definition becomes 100 jobs (
job_001throughjob_100) - Marks all as ready: Since there are no dependencies, all jobs start in the "ready" state
- Executes in parallel: The job runner claims and executes jobs based on available resources
- Tracks completion: Each job's return code and metrics are recorded
The job runner respects the resource requirements you specified. With num_cpus: 1 per job, if your
machine has 8 CPUs, approximately 8 jobs will run simultaneously.
What You Learned
In this tutorial, you learned how to:
- ✅ Use parameter expansion (
parameters: i: "1:100") to generate multiple jobs from one definition - ✅ Use format specifiers (
{i:03d}) for consistent naming - ✅ Run independent parallel jobs with
torc run - ✅ Monitor workflow progress and view results
Example Files
See hundred_jobs_parameterized.yaml for a ready-to-run version of this workflow.
Next Steps
- Tutorial 2: Diamond Workflow - Learn how to create job dependencies using files
- Tutorial 4: Simple Parameterization - Explore more parameter expansion options
- Multi-Stage Workflows with Barriers - Scale to thousands of jobs efficiently
Tutorial 2: Diamond Workflow with File Dependencies
This tutorial teaches you how to create workflows where job dependencies are automatically inferred from file inputs and outputs—a core concept in Torc called implicit dependencies.
Learning Objectives
By the end of this tutorial, you will:
- Understand how Torc infers job dependencies from file relationships
- Learn the "diamond" workflow pattern (fan-out and fan-in)
- Know how to use file variable substitution (
${files.input.*}and${files.output.*}) - See how jobs automatically unblock when their input files become available
Prerequisites
- Completed Tutorial 1: Many Independent Jobs
- Torc server running
The Diamond Pattern
The "diamond" pattern is a common workflow structure where:
- One job produces multiple outputs (fan-out)
- Multiple jobs process those outputs in parallel
- A final job combines all results (fan-in)
graph TD
Input(["input.txt"]) --> Preprocess["preprocess"]
Preprocess --> Int1(["intermediate1.txt"])
Preprocess --> Int2(["intermediate2.txt"])
Int1 --> Work1["work1"]
Int2 --> Work2["work2"]
Work1 --> Result1(["result1.txt"])
Work2 --> Result2(["result2.txt"])
Result1 --> Postprocess["postprocess"]
Result2 --> Postprocess
Postprocess --> Output(["output.txt"])
style Input fill:#d4edda,stroke:#28a745,color:#155724
style Int1 fill:#d4edda,stroke:#28a745,color:#155724
style Int2 fill:#d4edda,stroke:#28a745,color:#155724
style Result1 fill:#d4edda,stroke:#28a745,color:#155724
style Result2 fill:#d4edda,stroke:#28a745,color:#155724
style Output fill:#d4edda,stroke:#28a745,color:#155724
style Preprocess fill:#4a9eff,color:#fff
style Work1 fill:#4a9eff,color:#fff
style Work2 fill:#4a9eff,color:#fff
style Postprocess fill:#4a9eff,color:#fff
Notice that we never explicitly say "work1 depends on preprocess"—Torc figures this out
automatically because work1 needs intermediate1.txt as input, and preprocess produces it as
output.
Step 1: Create the Workflow Specification
Save as diamond.yaml:
name: diamond_workflow
description: Diamond workflow demonstrating fan-out and fan-in
jobs:
- name: preprocess
command: |
cat ${files.input.input_file} |
awk '{print $1}' > ${files.output.intermediate1}
cat ${files.input.input_file} |
awk '{print $2}' > ${files.output.intermediate2}
resource_requirements: small
- name: work1
command: |
cat ${files.input.intermediate1} |
sort | uniq > ${files.output.result1}
resource_requirements: medium
- name: work2
command: |
cat ${files.input.intermediate2} |
sort | uniq > ${files.output.result2}
resource_requirements: medium
- name: postprocess
command: |
paste ${files.input.result1} ${files.input.result2} > ${files.output.final_output}
resource_requirements: small
files:
- name: input_file
path: /tmp/input.txt
- name: intermediate1
path: /tmp/intermediate1.txt
- name: intermediate2
path: /tmp/intermediate2.txt
- name: result1
path: /tmp/result1.txt
- name: result2
path: /tmp/result2.txt
- name: final_output
path: /tmp/output.txt
resource_requirements:
- name: small
num_cpus: 1
num_gpus: 0
num_nodes: 1
memory: 1g
runtime: PT10M
- name: medium
num_cpus: 4
num_gpus: 0
num_nodes: 1
memory: 4g
runtime: PT30M
Understanding File Variable Substitution
The key concept here is file variable substitution:
${files.input.filename}- References a file this job reads (creates a dependency)${files.output.filename}- References a file this job writes (satisfies dependencies)
When Torc processes the workflow:
- It sees
preprocessoutputsintermediate1andintermediate2 - It sees
work1inputsintermediate1→ dependency created - It sees
work2inputsintermediate2→ dependency created - It sees
postprocessinputsresult1andresult2→ dependencies created
This is more maintainable than explicit depends_on declarations because:
- Dependencies are derived from actual data flow
- Adding a new intermediate step automatically updates dependencies
- The workflow specification documents the data flow
Step 2: Create Input Data
# Create test input file
echo -e "apple red\nbanana yellow\ncherry red\ndate brown" > /tmp/input.txt
Step 3: Create and Initialize the Workflow
# Create the workflow and capture the ID
WORKFLOW_ID=$(torc workflows create diamond.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"
# Ensure the input file timestamp is current
touch /tmp/input.txt
# Initialize the workflow (builds dependency graph)
torc workflows initialize-jobs $WORKFLOW_ID
The initialize-jobs command is where Torc:
- Analyzes file input/output relationships
- Builds the dependency graph
- Marks jobs with satisfied dependencies as "ready"
Step 4: Observe Dependency Resolution
# Check job statuses
torc jobs list $WORKFLOW_ID
Expected output:
╭────┬──────────────┬─────────┬────────╮
│ ID │ Name │ Status │ ... │
├────┼──────────────┼─────────┼────────┤
│ 1 │ preprocess │ ready │ ... │
│ 2 │ work1 │ blocked │ ... │
│ 3 │ work2 │ blocked │ ... │
│ 4 │ postprocess │ blocked │ ... │
╰────┴──────────────┴─────────┴────────╯
Only preprocess is ready because:
- Its only input (
input_file) already exists - The others are blocked waiting for files that don't exist yet
Step 5: Run the Workflow
torc run $WORKFLOW_ID
Watch the execution unfold:
preprocessruns first - Createsintermediate1.txtandintermediate2.txtwork1andwork2unblock - Their input files now existwork1andwork2run in parallel - They have no dependency on each otherpostprocessunblocks - Bothresult1.txtandresult2.txtexistpostprocessruns - Creates the final output
Step 6: Verify Results
cat /tmp/output.txt
You should see the combined, sorted, unique values from both columns of the input.
How Implicit Dependencies Work
Torc determines job order through file relationships:
| Job | Inputs | Outputs | Blocked By |
|---|---|---|---|
| preprocess | input_file | intermediate1, intermediate2 | (nothing) |
| work1 | intermediate1 | result1 | preprocess |
| work2 | intermediate2 | result2 | preprocess |
| postprocess | result1, result2 | final_output | work1, work2 |
The dependency graph is built automatically from these relationships. If you later add a validation
step between preprocess and work1, you only need to update the file references—the dependencies
adjust automatically.
What You Learned
In this tutorial, you learned:
- ✅ How to define files with
files:section and reference them in jobs - ✅ How
${files.input.*}creates implicit dependencies - ✅ How
${files.output.*}satisfies dependencies for downstream jobs - ✅ The diamond pattern: fan-out → parallel processing → fan-in
- ✅ How Torc automatically determines execution order from data flow
When to Use File Dependencies vs Explicit Dependencies
Use file dependencies when:
- Jobs actually read/write files
- Data flow defines the natural ordering
- You want self-documenting workflows
Use explicit depends_on when:
- Dependencies are logical, not data-based
- Jobs communicate through side effects
- You need precise control over ordering
Example Files
See the diamond workflow examples in all three formats:
A Python version is also available: diamond_workflow.py
Next Steps
- Tutorial 3: User Data Dependencies - Pass JSON data between jobs without files
- Tutorial 4: Simple Parameterization - Combine file dependencies with parameter expansion
Tutorial: Simple Job Parameterization
This tutorial teaches you how to create parameter sweeps—generating multiple related jobs from a single job definition using Torc's parameterization feature.
Learning Objectives
By the end of this tutorial, you will:
- Understand how parameterization expands one job definition into many jobs
- Learn the different parameter formats (lists, ranges)
- Know how to use format specifiers for consistent naming
Prerequisites
- Completed Tutorial 1: Many Independent Jobs
- Torc server running
Why Parameterization?
Without parameterization, running 10 similar jobs requires 10 separate definitions. With parameterization, you write one definition and Torc expands it:
# Without parameterization: 10 separate definitions
jobs:
- name: task_1
command: ./process.sh --index 1
- name: task_2
command: ./process.sh --index 2
# ... 8 more ...
# With parameterization: 1 definition
jobs:
- name: task_{i}
command: ./process.sh --index {i}
parameters:
i: "1:10"
Step 1: Create a Simple Parameterized Workflow
Save as parallel_tasks.yaml:
name: parallel_tasks
description: Run 10 parallel tasks
jobs:
- name: task_{i}
command: |
echo "Processing task {i}"
sleep 2
echo "Task {i} complete"
parameters:
i: "1:10"
This expands to 10 jobs: task_1, task_2, ..., task_10.
Step 2: Create and Run the Workflow
WORKFLOW_ID=$(torc workflows create parallel_tasks.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"
torc run $WORKFLOW_ID
All 10 jobs run in parallel since they have no dependencies.
Step 3: Verify the Expansion
torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[].name' | sort
Output:
task_1
task_10
task_2
task_3
task_4
task_5
task_6
task_7
task_8
task_9
Adding Dependencies Between Parameterized Jobs
Now let's create a workflow where parameterized jobs have dependencies:
name: process_and_aggregate
description: Process chunks then aggregate results
jobs:
- name: process_{i}
command: |
echo "Processing chunk {i}" > results/chunk_{i}.txt
parameters:
i: "1:5"
- name: aggregate
command: |
cat results/chunk_*.txt > results/final.txt
echo "Aggregated all chunks"
depends_on:
- process_{i}
parameters:
i: "1:5"
This creates:
- 5 parallel
process_*jobs - 1
aggregatejob that waits for all 5 to complete
The aggregate job lists all process_{i} jobs as dependencies, so Torc expands this to depend on
all 5 process jobs.
Understanding the Specification
Parameter Syntax:
{lr}- Simple substitution with the parameter value{lr:.4f}- Format specifier: 4 decimal places (e.g.,0.0010not0.001)
Parameter Values:
"[0.0001,0.0005,0.001,0.005,0.01]"- A list of 5 specific values
File Parameterization: Notice that both jobs AND files have parameters:. When Torc expands:
- Each
train_lr{lr:.4f}job gets a correspondingmodel_lr{lr:.4f}file - The file dependencies are matched by parameter value
Dependency Flow:
train_lr0.0001→ outputsmodel_lr0.0001→ unblocksevaluate_lr0.0001train_lr0.0005→ outputsmodel_lr0.0005→ unblocksevaluate_lr0.0005- (and so on for each learning rate)
- All
evaluate_*jobs → unblockcompare_results
Step 2: Create and Initialize the Workflow
WORKFLOW_ID=$(torc workflows create learning_rate_sweep.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"
torc workflows initialize-jobs $WORKFLOW_ID
Step 3: Verify the Expansion
# Count jobs (should be 11: 5 train + 5 evaluate + 1 compare)
torc jobs list $WORKFLOW_ID -f json | jq '.jobs | length'
List the job names:
torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[].name' | sort
Output:
compare_results
evaluate_lr0.0001
evaluate_lr0.0005
evaluate_lr0.0010
evaluate_lr0.0050
evaluate_lr0.0100
train_lr0.0001
train_lr0.0005
train_lr0.0010
train_lr0.0050
train_lr0.0100
Notice:
- One job per parameter value for
train_*andevaluate_* - Only one
compare_resultsjob (it has the parameter for dependencies, but doesn't expand because its name has no{lr})
Step 4: Check Dependencies
torc jobs list $WORKFLOW_ID
Expected statuses:
- All
train_*jobs: ready (no input dependencies) - All
evaluate_*jobs: blocked (waiting for corresponding model file) compare_results: blocked (waiting for all metrics files)
Step 5: Run the Workflow
torc run $WORKFLOW_ID
Execution flow:
- All 5 training jobs run in parallel - They have no dependencies on each other
- Each evaluation unblocks independently - When
train_lr0.0001finishes,evaluate_lr0.0001can start (doesn't wait for other training jobs) - Compare runs last - Only after all 5 evaluations complete
This is more efficient than a simple two-stage workflow because evaluations can start as soon as their specific training job completes.
Parameter Format Reference
List Format
Explicit list of values:
parameters:
lr: "[0.0001,0.0005,0.001,0.005,0.01]" # Numbers
opt: "['adam','sgd','rmsprop']" # Strings (note the quotes)
Range Format
For integer or float sequences:
parameters:
i: "1:100" # Integers 1 to 100 (inclusive)
i: "0:100:10" # Integers 0, 10, 20, ..., 100 (with step)
lr: "0.0:1.0:0.1" # Floats 0.0, 0.1, 0.2, ..., 1.0
Format Specifiers
Control how values appear in names:
| Specifier | Example Value | Result |
|---|---|---|
{i} | 5 | 5 |
{i:03d} | 5 | 005 |
{lr:.4f} | 0.001 | 0.0010 |
{lr:.2e} | 0.001 | 1.00e-03 |
How Parameterization and File Dependencies Interact
When both jobs and files are parameterized with the same parameter:
jobs:
- name: train_{i}
output_files: [model_{i}]
parameters:
i: "1:3"
- name: eval_{i}
input_files: [model_{i}]
parameters:
i: "1:3"
files:
- name: model_{i}
path: /models/model_{i}.pt
parameters:
i: "1:3"
Torc creates these relationships:
train_1→model_1→eval_1train_2→model_2→eval_2train_3→model_3→eval_3
Each chain is independent—eval_2 doesn't wait for train_1.
Parameter Modes: Product vs Zip
By default, multiple parameters create a Cartesian product (all combinations). For paired
parameters, use parameter_mode: zip:
jobs:
# Default (product): 3 × 3 = 9 jobs
- name: train_{dataset}_{model}
command: python train.py --dataset={dataset} --model={model}
parameters:
dataset: "['cifar10', 'mnist', 'imagenet']"
model: "['resnet', 'vgg', 'transformer']"
# Zip mode: 3 paired jobs (cifar10+resnet, mnist+vgg, imagenet+transformer)
- name: paired_{dataset}_{model}
command: python train.py --dataset={dataset} --model={model}
parameters:
dataset: "['cifar10', 'mnist', 'imagenet']"
model: "['resnet', 'vgg', 'transformer']"
parameter_mode: zip
Use zip mode when parameters have a 1:1 correspondence (e.g., input/output file pairs, pre-determined configurations).
See Parameterization Reference for details.
What You Learned
In this tutorial, you learned:
- ✅ How to use
parameters:to expand one job definition into many - ✅ List format (
"[a,b,c]") and range format ("1:100") - ✅ Format specifiers (
{i:03d},{lr:.4f}) for consistent naming - ✅ How parameterized files create one-to-one dependencies
- ✅ The efficiency of parameter-matched dependencies (each chain runs independently)
- ✅ The difference between product (default) and zip parameter modes
Next Steps
- Tutorial 5: Advanced Parameterization - Multi-dimensional grid searches
- Multi-Stage Workflows with Barriers - Scale to thousands of parameterized jobs
Tutorial 5: Advanced Multi-Dimensional Parameterization
This tutorial teaches you how to create multi-dimensional parameter sweeps—grid searches over multiple hyperparameters that generate all combinations automatically.
Learning Objectives
By the end of this tutorial, you will:
- Understand how multiple parameters create a Cartesian product (all combinations)
- Learn to structure complex workflows with data preparation, training, and aggregation stages
- Know how to combine parameterization with explicit dependencies
- See patterns for running large grid searches on HPC systems
Prerequisites
- Completed Tutorial 4: Simple Parameterization
- Torc server running
- Understanding of file dependencies
Multi-Dimensional Parameters: Cartesian Product
When a job has multiple parameters, Torc creates the Cartesian product—every combination of values:
parameters:
lr: "[0.001,0.01]" # 2 values
bs: "[16,32]" # 2 values
This generates 2 × 2 = 4 jobs:
lr=0.001, bs=16lr=0.001, bs=32lr=0.01, bs=16lr=0.01, bs=32
With three parameters:
parameters:
lr: "[0.0001,0.001,0.01]" # 3 values
bs: "[16,32,64]" # 3 values
opt: "['adam','sgd']" # 2 values
This generates 3 × 3 × 2 = 18 jobs.
Step 1: Create the Workflow Specification
Save as grid_search.yaml:
name: hyperparameter_grid_search
description: 3D grid search over learning rate, batch size, and optimizer
jobs:
# Data preparation (runs once, no parameters)
- name: prepare_data
command: python prepare_data.py --output=/data/processed.pkl
resource_requirements: data_prep
output_files:
- training_data
# Training jobs (one per parameter combination)
- name: train_lr{lr:.4f}_bs{bs}_opt{opt}
command: |
python train.py \
--data=/data/processed.pkl \
--learning-rate={lr} \
--batch-size={bs} \
--optimizer={opt} \
--output=/models/model_lr{lr:.4f}_bs{bs}_opt{opt}.pt \
--metrics=/results/metrics_lr{lr:.4f}_bs{bs}_opt{opt}.json
resource_requirements: gpu_training
input_files:
- training_data
output_files:
- model_lr{lr:.4f}_bs{bs}_opt{opt}
- metrics_lr{lr:.4f}_bs{bs}_opt{opt}
parameters:
lr: "[0.0001,0.001,0.01]"
bs: "[16,32,64]"
opt: "['adam','sgd']"
# Aggregate results (depends on ALL training jobs via file dependencies)
- name: aggregate_results
command: |
python aggregate.py \
--input-dir=/results \
--output=/results/summary.csv
resource_requirements: minimal
input_files:
- metrics_lr{lr:.4f}_bs{bs}_opt{opt}
parameters:
lr: "[0.0001,0.001,0.01]"
bs: "[16,32,64]"
opt: "['adam','sgd']"
# Find best model (explicit dependency, no parameters)
- name: select_best_model
command: |
python select_best.py \
--summary=/results/summary.csv \
--output=/results/best_config.json
resource_requirements: minimal
depends_on:
- aggregate_results
files:
- name: training_data
path: /data/processed.pkl
- name: model_lr{lr:.4f}_bs{bs}_opt{opt}
path: /models/model_lr{lr:.4f}_bs{bs}_opt{opt}.pt
parameters:
lr: "[0.0001,0.001,0.01]"
bs: "[16,32,64]"
opt: "['adam','sgd']"
- name: metrics_lr{lr:.4f}_bs{bs}_opt{opt}
path: /results/metrics_lr{lr:.4f}_bs{bs}_opt{opt}.json
parameters:
lr: "[0.0001,0.001,0.01]"
bs: "[16,32,64]"
opt: "['adam','sgd']"
resource_requirements:
- name: data_prep
num_cpus: 8
memory: 32g
runtime: PT1H
- name: gpu_training
num_cpus: 8
num_gpus: 1
memory: 16g
runtime: PT4H
- name: minimal
num_cpus: 1
memory: 2g
runtime: PT10M
Understanding the Structure
Four-stage workflow:
prepare_data(1 job) - No parameters, runs oncetrain_*(18 jobs) - Parameterized, all depend onprepare_dataaggregate_results(1 job) - Has parameters only for file dependency matchingselect_best_model(1 job) - Explicit dependency onaggregate_results
Key insight: Why aggregate_results has parameters
The aggregate_results job won't expand into multiple jobs (its name has no {}). However, it
needs parameters: to match the parameterized input_files. This tells Torc: "this job depends on
ALL 18 metrics files."
Step 2: Create and Initialize the Workflow
WORKFLOW_ID=$(torc workflows create grid_search.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"
torc workflows initialize-jobs $WORKFLOW_ID
Step 3: Verify the Expansion
Count the jobs:
torc jobs list $WORKFLOW_ID -f json | jq '.jobs | length'
Expected: 21 jobs (1 prepare + 18 training + 1 aggregate + 1 select)
List the training jobs:
torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[] | select(.name | startswith("train_")) | .name' | sort
Output (18 training jobs):
train_lr0.0001_bs16_optadam
train_lr0.0001_bs16_optsgd
train_lr0.0001_bs32_optadam
train_lr0.0001_bs32_optsgd
train_lr0.0001_bs64_optadam
train_lr0.0001_bs64_optsgd
train_lr0.0010_bs16_optadam
train_lr0.0010_bs16_optsgd
train_lr0.0010_bs32_optadam
train_lr0.0010_bs32_optsgd
train_lr0.0010_bs64_optadam
train_lr0.0010_bs64_optsgd
train_lr0.0100_bs16_optadam
train_lr0.0100_bs16_optsgd
train_lr0.0100_bs32_optadam
train_lr0.0100_bs32_optsgd
train_lr0.0100_bs64_optadam
train_lr0.0100_bs64_optsgd
Step 4: Examine the Dependency Graph
torc jobs list $WORKFLOW_ID
Initial states:
prepare_data: ready (no dependencies)- All
train_*: blocked (waiting fortraining_datafile) aggregate_results: blocked (waiting for all 18 metrics files)select_best_model: blocked (waiting foraggregate_results)
Step 5: Run the Workflow
For local execution:
torc run $WORKFLOW_ID
Execution flow:
prepare_dataruns and producestraining_data- All 18
train_*jobs unblock and run in parallel (resource-limited) aggregate_resultswaits for all training jobs, then runsselect_best_modelruns last
Step 6: Monitor Progress
# Check status summary
torc workflows status $WORKFLOW_ID
# Watch job completion in real-time
watch -n 10 'torc jobs list-by-status $WORKFLOW_ID'
# Or use the TUI
torc tui
Step 7: Retrieve Results
After completion:
# View best configuration
cat /results/best_config.json
# View summary of all runs
cat /results/summary.csv
Scaling Considerations
Job Count Growth
Multi-dimensional parameters grow exponentially:
| Dimensions | Values per Dimension | Total Jobs |
|---|---|---|
| 1 | 10 | 10 |
| 2 | 10 × 10 | 100 |
| 3 | 10 × 10 × 10 | 1,000 |
| 4 | 10 × 10 × 10 × 10 | 10,000 |
Dependency Count
Without barriers, dependencies also grow quickly. In this tutorial:
- 18 training jobs each depend on 1 file = 18 dependencies
- 1 aggregate job depends on 18 files = 18 dependencies
- Total: ~36 dependencies
For larger sweeps (1000+ jobs), consider the barrier pattern to reduce dependencies from O(n²) to O(n).
Common Patterns
Mixing Fixed and Parameterized Jobs
jobs:
# Fixed job (no parameters)
- name: setup
command: ./setup.sh
# Parameterized jobs depend on fixed job
- name: experiment_{i}
command: ./run.sh {i}
depends_on: [setup]
parameters:
i: "1:100"
Aggregating Parameterized Results
Use the file dependency pattern shown in this tutorial:
- name: aggregate
input_files:
- result_{i} # Matches all parameterized result files
parameters:
i: "1:100" # Same parameters as producer jobs
Nested Parameter Sweeps
For workflows with multiple independent sweeps:
jobs:
# Sweep 1
- name: sweep1_job_{a}
parameters:
a: "1:10"
# Sweep 2 (independent of sweep 1)
- name: sweep2_job_{b}
parameters:
b: "1:10"
What You Learned
In this tutorial, you learned:
- ✅ How multiple parameters create a Cartesian product of jobs
- ✅ How to structure multi-stage workflows (prep → train → aggregate → select)
- ✅ How to use parameters in file dependencies to collect all outputs
- ✅ How to mix parameterized and non-parameterized jobs
- ✅ Scaling considerations for large grid searches
Example Files
See these example files for hyperparameter sweep patterns:
- hyperparameter_sweep.yaml - Basic 3×3×2 grid search
- hyperparameter_sweep_shared_params.yaml - Grid search with shared parameter definitions
Next Steps
- Multi-Stage Workflows with Barriers - Essential for scaling to thousands of jobs
- Working with Slurm - Deploy grid searches on HPC clusters
- Resource Monitoring - Track resource usage across your sweep
Multi-Stage Workflows with Barriers
This tutorial teaches you how to efficiently structure workflows with multiple stages using the barrier pattern. This is essential for scaling workflows to thousands of jobs.
Learning Objectives
By the end of this tutorial, you will:
- Understand the quadratic dependency problem in multi-stage workflows
- Use barrier jobs to efficiently synchronize between stages
- Scale workflows to thousands of jobs with minimal overhead
- Know when to use barriers vs. direct dependencies
Prerequisites
- Basic understanding of Torc workflows
- Completed the Many Independent Jobs tutorial
- Completed the Simple Parameterization tutorial
The Problem: Quadratic Dependencies
Let's start with a common but inefficient pattern. Suppose you want to:
- Stage 1: Run 1000 preprocessing jobs in parallel
- Stage 2: Run 1000 analysis jobs, but only after ALL stage 1 jobs complete
- Stage 3: Run a final aggregation job
Naive Approach (DON'T DO THIS!)
name: "Inefficient Multi-Stage Workflow"
description: "This creates 1,000,000 dependencies!"
jobs:
# Stage 1: 1000 preprocessing jobs
- name: "preprocess_{i:03d}"
command: "python preprocess.py --id {i}"
parameters:
i: "0:999"
# Stage 2: Each analysis job waits for ALL preprocessing jobs
- name: "analyze_{i:03d}"
command: "python analyze.py --id {i}"
depends_on_regexes: ["^preprocess_.*"] # ⚠️ Creates 1,000,000 dependencies!
parameters:
i: "0:999"
# Stage 3: Final aggregation
- name: "final_report"
command: "python generate_report.py"
depends_on_regexes: ["^analyze_.*"] # ⚠️ Creates 1,000 more dependencies
Why This is Bad
When Torc expands this workflow:
- Each of the 1000
analyze_*jobs gets a dependency on each of the 1000preprocess_*jobs - Total dependencies: 1000 × 1000 = 1,000,000 relationships
- Workflow creation takes minutes instead of seconds
- Database becomes bloated with dependency records
- Job initialization is slow
The Solution: Barrier Jobs
A barrier job is a lightweight synchronization point that:
- Depends on all jobs from the previous stage (using a regex)
- Is depended upon by all jobs in the next stage
- Reduces dependencies from O(n²) to O(n)
flowchart LR
subgraph stage1["Stage 1"]
P1["preprocess_1"]
P2["preprocess_2"]
PN["preprocess_N"]
end
B1(["barrier"])
subgraph stage2["Stage 2"]
A1["analyze_1"]
A2["analyze_2"]
AN["analyze_N"]
end
P1 --> B1
P2 --> B1
PN --> B1
B1 --> A1
B1 --> A2
B1 --> AN
style P1 fill:#4a9eff,color:#fff
style P2 fill:#4a9eff,color:#fff
style PN fill:#4a9eff,color:#fff
style B1 fill:#28a745,color:#fff
style A1 fill:#ffc107,color:#000
style A2 fill:#ffc107,color:#000
style AN fill:#ffc107,color:#000
Instead of N×N dependencies (every stage 2 job depending on every stage 1 job), you get 2N dependencies (N into the barrier, N out of the barrier).
Efficient Approach (DO THIS!)
name: "Efficient Multi-Stage Workflow"
description: "Uses barrier pattern with only ~3000 dependencies"
jobs:
# ═══════════════════════════════════════════════════════════
# STAGE 1: Preprocessing (1000 parallel jobs)
# ═══════════════════════════════════════════════════════════
- name: "preprocess_{i:03d}"
command: "python preprocess.py --id {i} --output data/stage1_{i:03d}.json"
resource_requirements: "medium"
parameters:
i: "0:999"
# ═══════════════════════════════════════════════════════════
# BARRIER: Wait for ALL stage 1 jobs
# ═══════════════════════════════════════════════════════════
- name: "barrier_stage1_complete"
command: "echo 'Stage 1 complete: 1000 files preprocessed' && date"
resource_requirements: "tiny"
depends_on_regexes: ["^preprocess_.*"] # ✓ 1000 dependencies
# ═══════════════════════════════════════════════════════════
# STAGE 2: Analysis (1000 parallel jobs)
# ═══════════════════════════════════════════════════════════
- name: "analyze_{i:03d}"
command: "python analyze.py --input data/stage1_{i:03d}.json --output data/stage2_{i:03d}.csv"
resource_requirements: "large"
depends_on: ["barrier_stage1_complete"] # ✓ 1000 dependencies (one per job)
parameters:
i: "0:999"
# ═══════════════════════════════════════════════════════════
# BARRIER: Wait for ALL stage 2 jobs
# ═══════════════════════════════════════════════════════════
- name: "barrier_stage2_complete"
command: "echo 'Stage 2 complete: 1000 analyses finished' && date"
resource_requirements: "tiny"
depends_on_regexes: ["^analyze_.*"] # ✓ 1000 dependencies
# ═══════════════════════════════════════════════════════════
# STAGE 3: Final report (single job)
# ═══════════════════════════════════════════════════════════
- name: "final_report"
command: "python generate_report.py --output final_report.html"
resource_requirements: "medium"
depends_on: ["barrier_stage2_complete"] # ✓ 1 dependency
resource_requirements:
- name: "tiny"
num_cpus: 1
num_gpus: 0
num_nodes: 1
memory: "100m"
runtime: "PT1M"
- name: "medium"
num_cpus: 4
num_gpus: 0
num_nodes: 1
memory: "4g"
runtime: "PT30M"
- name: "large"
num_cpus: 16
num_gpus: 1
num_nodes: 1
memory: "32g"
runtime: "PT2H"
Dependency Breakdown
Without barriers:
- Stage 1 → Stage 2: 1000 × 1000 = 1,000,000 dependencies
- Stage 2 → Stage 3: 1000 = 1,000 dependencies
- Total: 1,001,000 dependencies
With barriers:
- Stage 1 → Barrier 1: 1,000 dependencies
- Barrier 1 → Stage 2: 1,000 dependencies
- Stage 2 → Barrier 2: 1,000 dependencies
- Barrier 2 → Stage 3: 1 dependency
- Total: 3,001 dependencies ← 333× improvement!
Step-by-Step: Creating Your First Barrier Workflow
Let's create a simple 2-stage workflow.
Step 1: Create the Workflow Spec
Create barrier_demo.yaml:
name: "Barrier Pattern Demo"
description: "Simple demonstration of the barrier pattern"
jobs:
# Stage 1: Generate 100 data files
- name: "generate_data_{i:02d}"
command: "echo 'Data file {i}' > output/data_{i:02d}.txt"
parameters:
i: "0:99"
# Barrier: Wait for all data generation
- name: "data_generation_complete"
command: "echo 'All 100 data files generated' && ls -l output/ | wc -l"
depends_on_regexes: ["^generate_data_.*"]
# Stage 2: Process each data file
- name: "process_data_{i:02d}"
command: "cat output/data_{i:02d}.txt | wc -w > output/processed_{i:02d}.txt"
depends_on: ["data_generation_complete"]
parameters:
i: "0:99"
# Final barrier and report
- name: "processing_complete"
command: "echo 'All 100 files processed' && cat output/processed_*.txt | awk '{sum+=$1} END {print sum}'"
depends_on_regexes: ["^process_data_.*"]
Step 2: Create the Output Directory
mkdir -p output
Step 3: Create the Workflow
torc workflows create barrier_demo.yaml
You should see output like:
Created workflow with ID: 1
- Created 100 stage 1 jobs
- Created 1 barrier job
- Created 100 stage 2 jobs
- Created 1 final barrier
Total: 202 jobs, 201 dependencies
Compare this to 10,000 dependencies without barriers!
Step 4: Run the Workflow
torc workflows run 1
Step 5: Monitor Progress
torc tui
You'll see:
- All 100
generate_data_*jobs run in parallel - Once they finish,
data_generation_completeexecutes - Then all 100
process_data_*jobs run in parallel - Finally,
processing_completeexecutes
Making Effective Barrier Jobs
1. Keep Barriers Lightweight
Barriers should be quick and cheap:
✓ GOOD - Lightweight logging
- name: "stage1_complete"
command: "echo 'Stage 1 done' && date"
resource_requirements: "tiny"
✗ BAD - Heavy computation
- name: "stage1_complete"
command: "python expensive_validation.py" # Don't do this!
resource_requirements: "large"
If you need validation, create a separate job:
# Barrier - lightweight
- name: "stage1_complete"
command: "echo 'Stage 1 done'"
resource_requirements: "tiny"
depends_on_regexes: ["^stage1_.*"]
# Validation - heavier
- name: "validate_stage1"
command: "python validate_all_outputs.py"
resource_requirements: "medium"
depends_on: ["stage1_complete"]
# Stage 2 depends on validation passing
- name: "stage2_job_{i}"
command: "python stage2.py {i}"
depends_on: ["validate_stage1"]
parameters:
i: "0:999"
2. Use Descriptive Names
Names should clearly indicate what stage completed:
✓ GOOD
- name: "barrier_preprocessing_complete"
- name: "barrier_training_complete"
- name: "all_simulations_finished"
✗ BAD
- name: "barrier1"
- name: "sync"
- name: "wait"
3. Add Useful Information
Make barriers informative:
- name: "preprocessing_complete"
command: |
echo "════════════════════════════════════════"
echo "Preprocessing Complete: $(date)"
echo "Files generated: $(ls output/stage1_*.json | wc -l)"
echo "Total size: $(du -sh output/)"
echo "Proceeding to analysis stage..."
echo "════════════════════════════════════════"
depends_on_regexes: ["^preprocess_.*"]
4. Be Careful with Regex Patterns
Ensure your regex matches exactly what you intend:
✓ GOOD - Anchored patterns
depends_on_regexes: ["^stage1_job_.*"] # Matches "stage1_job_001", "stage1_job_042"
depends_on_regexes: ["^preprocess_\\d+$"] # Matches "preprocess_0", "preprocess_999"
✗ BAD - Too broad
depends_on_regexes: ["stage1"] # Matches "my_stage1_test" (unintended!)
depends_on_regexes: [".*"] # Matches EVERYTHING (disaster!)
Test your regex before deploying:
# Python regex tester
python3 -c "import re; print(re.match(r'^stage1_job_.*', 'stage1_job_001'))"
When NOT to Use Barriers
Barriers are not always the right solution:
1. One-to-One Dependencies
When each job in stage 2 only needs its corresponding stage 1 job:
# DON'T use a barrier here
jobs:
- name: "preprocess_{i}"
command: "preprocess.py {i}"
output_files: ["data_{i}.json"]
parameters:
i: "0:99"
# Each analysis only needs its own preprocessed file
- name: "analyze_{i}"
command: "analyze.py {i}"
input_files: ["data_{i}.json"] # ✓ Automatic dependency via files
parameters:
i: "0:99"
The file dependency system already handles this efficiently!
2. Specific Dependencies in DAGs
When you have a directed acyclic graph (DAG) with specific paths:
# Diamond pattern - specific dependencies
jobs:
- name: "fetch_data"
command: "fetch.py"
- name: "process_weather"
command: "process_weather.py"
depends_on: ["fetch_data"]
- name: "process_traffic"
command: "process_traffic.py"
depends_on: ["fetch_data"]
- name: "generate_report"
command: "report.py"
depends_on: ["process_weather", "process_traffic"] # ✓ Specific dependencies
Don't force this into stages - the specific dependencies are clearer!
3. Small Workflows
For small workflows (< 100 jobs), the overhead of barriers isn't worth it:
# Only 10 jobs - barriers not needed
jobs:
- name: "job_{i}"
command: "process.py {i}"
depends_on_regexes: ["^prepare_.*"] # This is fine for 10 jobs
parameters:
i: "0:9"
Scaling to Thousands of Jobs
The barrier pattern scales beautifully. Let's compare performance:
| Stage 1 Jobs | Stage 2 Jobs | Without Barriers | With Barriers | Speedup |
|---|---|---|---|---|
| 100 | 100 | 10,000 deps (~1s) | 200 deps (<0.1s) | 10× |
| 1,000 | 1,000 | 1,000,000 deps (~45s) | 2,000 deps (~0.5s) | 90× |
| 10,000 | 10,000 | 100,000,000 deps (hours) | 20,000 deps (~5s) | 1000×+ |
As you can see, barriers become essential for large-scale workflows.
Complete Example
See multi_stage_barrier_pattern.yaml for a comprehensive example with:
- 3 distinct stages (1000 → 1000 → 100 jobs)
- Informative barrier jobs with progress logging
- Different resource requirements per stage
- Comments explaining the pattern
Summary
✓ Use barrier jobs when all jobs in one stage must complete before any job in the next stage starts
✓ Use file/data dependencies for one-to-one job relationships
✓ Use specific dependencies for DAG patterns with clear paths
✓ Keep barriers lightweight - just logging and simple checks
✓ Use descriptive names to track workflow progress
The barrier pattern is your key to scaling Torc workflows from hundreds to thousands of jobs efficiently!
Next Steps
- Try modifying the demo workflow to have 3 or more stages
- Experiment with adding validation logic to barrier jobs
- Check out Advanced Parameterization for creating complex multi-stage pipelines
- Learn about Workflow Actions for conditional execution between stages
Tutorial 3: User Data Dependencies
This tutorial teaches you how to pass structured data (JSON) between jobs using Torc's user_data feature—an alternative to file-based dependencies that stores data directly in the database.
Learning Objectives
By the end of this tutorial, you will:
- Understand what user_data is and when to use it instead of files
- Learn how to define user_data entries and reference them in jobs
- Know how to update user_data from within a job
- See how user_data creates implicit dependencies (like files)
Prerequisites
- Completed Tutorial 2: Diamond Workflow
- Torc server running
jqcommand-line tool installed (for JSON parsing)
What is User Data?
User data is Torc's mechanism for passing small, structured data between jobs without creating actual files. The data is stored in the Torc database and can be:
- JSON objects (configurations, parameters)
- Arrays
- Simple values (strings, numbers)
Like files, user_data creates implicit dependencies: a job that reads user_data will be blocked until the job that writes it completes.
User Data vs Files
| Feature | User Data | Files |
|---|---|---|
| Storage | Torc database | Filesystem |
| Size | Small (KB) | Any size |
| Format | JSON | Any format |
| Access | Via torc user-data CLI | Direct file I/O |
| Best for | Config, params, metadata | Datasets, binaries, logs |
Step 1: Create the Workflow Specification
Save as user_data_workflow.yaml:
name: config_pipeline
description: Jobs that pass configuration via user_data
jobs:
- name: generate_config
command: |
echo '{"learning_rate": 0.001, "batch_size": 32, "epochs": 10}' > /tmp/config.json
torc user-data update ${user_data.output.ml_config} \
--data "$(cat /tmp/config.json)"
resource_requirements: minimal
- name: train_model
command: |
echo "Training with config:"
torc user-data get ${user_data.input.ml_config} | jq '.data'
# In a real workflow: python train.py --config="${user_data.input.ml_config}"
resource_requirements: gpu_large
- name: evaluate_model
command: |
echo "Evaluating with config:"
torc user-data get ${user_data.input.ml_config} | jq '.data'
# In a real workflow: python evaluate.py --config="${user_data.input.ml_config}"
resource_requirements: gpu_small
user_data:
- name: ml_config
data: null # Will be populated by generate_config job
resource_requirements:
- name: minimal
num_cpus: 1
memory: 1g
runtime: PT5M
- name: gpu_small
num_cpus: 4
num_gpus: 1
memory: 16g
runtime: PT1H
- name: gpu_large
num_cpus: 8
num_gpus: 2
memory: 32g
runtime: PT4H
Understanding the Specification
Key elements:
user_data:section - Defines data entries, similar tofiles:data: null- Initial value; will be populated by a job${user_data.output.ml_config}- Job will write to this user_data (creates it)${user_data.input.ml_config}- Job reads from this user_data (creates dependency)
The dependency flow:
generate_configoutputsml_config→ runs firsttrain_modelandevaluate_modelinputml_config→ blocked until step 1 completes- After
generate_configfinishes, both become ready and can run in parallel
Step 2: Create and Initialize the Workflow
# Create the workflow
WORKFLOW_ID=$(torc workflows create user_data_workflow.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"
# Initialize jobs
torc workflows initialize-jobs $WORKFLOW_ID
Step 3: Check Initial State
Before running, examine the user_data:
# Check user_data - should be null
torc user-data list $WORKFLOW_ID
Output:
╭────┬───────────┬──────┬─────────────╮
│ ID │ Name │ Data │ Workflow ID │
├────┼───────────┼──────┼─────────────┤
│ 1 │ ml_config │ null │ 1 │
╰────┴───────────┴──────┴─────────────╯
Check job statuses:
torc jobs list $WORKFLOW_ID
You should see:
generate_config: ready (no input dependencies)train_model: blocked (waiting forml_config)evaluate_model: blocked (waiting forml_config)
Step 4: Run the Workflow
torc run $WORKFLOW_ID
Step 5: Observe the Data Flow
After generate_config completes, check the updated user_data:
torc user-data list $WORKFLOW_ID -f json | jq '.[] | {name, data}'
Output:
{
"name": "ml_config",
"data": {
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 10
}
}
The data is now stored in the database. At this point:
train_modelandevaluate_modelunblock- Both can read the configuration and run in parallel
Step 6: Verify Completion
After the workflow completes:
torc results list $WORKFLOW_ID
All three jobs should show return code 0.
How User Data Dependencies Work
The mechanism is identical to file dependencies:
| Syntax | Meaning | Effect |
|---|---|---|
${user_data.input.name} | Job reads this data | Creates dependency on producer |
${user_data.output.name} | Job writes this data | Satisfies dependencies |
Torc substitutes these variables with the actual user_data ID at runtime, and the torc user-data
CLI commands use that ID to read/write the data.
Accessing User Data in Your Code
From within a job, you can:
Read user_data:
# Get the full record
torc user-data get $USER_DATA_ID
# Get just the data field
torc user-data get $USER_DATA_ID | jq '.data'
# Save to a file for your application
torc user-data get $USER_DATA_ID | jq '.data' > config.json
Write user_data:
# Update with JSON data
torc user-data update $USER_DATA_ID --data '{"key": "value"}'
# Update from a file
torc user-data update $USER_DATA_ID --data "$(cat results.json)"
What You Learned
In this tutorial, you learned:
- ✅ What user_data is: structured data stored in the Torc database
- ✅ When to use it: configurations, parameters, metadata (not large files)
- ✅ How to define user_data entries with the
user_data:section - ✅ How
${user_data.input.*}and${user_data.output.*}create dependencies - ✅ How to read and write user_data from within jobs
Common Patterns
Dynamic Configuration Generation
jobs:
- name: analyze_data
command: |
# Analyze data and determine optimal parameters
OPTIMAL_LR=$(python analyze.py --find-optimal-lr)
torc user-data update ${user_data.output.optimal_params} \
--data "{\"learning_rate\": $OPTIMAL_LR}"
Collecting Results from Multiple Jobs
jobs:
- name: worker_{i}
command: |
RESULT=$(python process.py --id {i})
torc user-data update ${user_data.output.result_{i}} --data "$RESULT"
parameters:
i: "1:10"
- name: aggregate
command: |
# Collect all results
for i in $(seq 1 10); do
torc user-data get ${user_data.input.result_$i} >> all_results.json
done
python aggregate.py all_results.json
Next Steps
- Tutorial 4: Simple Parameterization - Create parameter sweeps
- Tutorial 5: Advanced Parameterization - Multi-dimensional grid searches
Monitoring & Debugging
Tools and techniques for monitoring workflows and debugging issues.
- Web Dashboard - Browser-based workflow monitoring
- Terminal UI (TUI) - Interactive terminal interface
- Workflow Reports - Generating workflow summaries
- Resource Monitoring - Tracking resource usage
- Debugging Workflows - Troubleshooting workflow issues
- Working with Logs - Accessing and analyzing logs
Web Dashboard (torc-dash)
The Torc Dashboard (torc-dash) provides a modern web-based interface for monitoring and managing
workflows, offering an intuitive alternative to the command-line interface.
Overview
torc-dash is a Rust-based web application that allows you to:
- Monitor workflows and jobs with real-time status updates
- Create and run workflows by uploading specification files (YAML, JSON, JSON5, KDL)
- Visualize workflow DAGs with interactive dependency graphs
- Debug failed jobs with integrated log file viewer
- Generate resource plots from time series monitoring data
- Manage torc-server start/stop in standalone mode
- Live event streaming via Server-Sent Events (SSE) for real-time job and compute node events
Installation
Building from Source
torc-dash is built as part of the Torc workspace:
# Build torc-dash
cargo build --release -p torc-dash
# Binary location
./target/release/torc-dash
Prerequisites
- A running
torc-server(or use--standalonemode to auto-start one) - The
torcCLI binary in your PATH (for workflow execution features)
Running the Dashboard
Quick Start (Standalone Mode)
The easiest way to get started is standalone mode, which automatically starts torc-server:
torc-dash --standalone
This will:
- Start
torc-serveron an automatically-detected free port - Start the dashboard on http://127.0.0.1:8090
- Configure the dashboard to connect to the managed server
Connecting to an Existing Server
If you already have torc-server running:
# Use default API URL (http://localhost:8080/torc-service/v1)
torc-dash
# Specify custom API URL
torc-dash --api-url http://myserver:9000/torc-service/v1
# Or use environment variable
export TORC_API_URL="http://myserver:9000/torc-service/v1"
torc-dash
Command-Line Options
Options:
-p, --port <PORT> Dashboard port [default: 8090]
--host <HOST> Dashboard host [default: 127.0.0.1]
-a, --api-url <API_URL> Torc server API URL [default: http://localhost:8080/torc-service/v1]
--torc-bin <PATH> Path to torc CLI binary [default: torc]
--torc-server-bin Path to torc-server binary [default: torc-server]
--standalone Auto-start torc-server alongside dashboard
--server-port <PORT> Server port in standalone mode (0 = auto-detect) [default: 0]
--database <PATH> Database path for standalone server
--completion-check-interval-secs <SECS> Server polling interval [default: 5]
Features
Workflows Tab
The main workflows view provides:
- Workflow list with ID, name, timestamp, user, and description
- Create Workflow button to upload new workflow specifications
- Quick actions for each workflow:
- View details and DAG visualization
- Initialize/reinitialize workflow
- Run locally or submit to scheduler
- Delete workflow
Creating Workflows
Click "Create Workflow" to open the creation dialog:
- Upload a file: Drag and drop or click to select a workflow specification file
- Supports YAML, JSON, JSON5, and KDL formats
- Or enter a file path: Specify a path on the server filesystem
- Click "Create" to register the workflow
Details Tab
Explore workflow components with interactive tables:
- Jobs: View all jobs with status, name, command, and dependencies
- Files: Input/output files with paths and timestamps
- User Data: Key-value data passed between jobs
- Results: Execution results with return codes and resource metrics
- Compute Nodes: Available compute resources
- Resource Requirements: CPU, memory, GPU specifications
- Schedulers: Slurm scheduler configurations
Features:
- Workflow selector: Filter by workflow
- Column sorting: Click headers to sort
- Row filtering: Type in filter boxes (supports
column:valuesyntax) - Auto-refresh: Toggle automatic updates
DAG Visualization
Click "View" on any workflow to see an interactive dependency graph:
- Nodes represent jobs, colored by status
- Edges show dependencies (file-based and explicit)
- Zoom, pan, and click nodes for details
- Legend shows status colors
Debugging Tab
Investigate failed jobs with the integrated debugger:
- Select a workflow
- Configure output directory (where logs are stored)
- Toggle "Show only failed jobs" to focus on problems
- Click "Generate Report" to fetch results
- Click any job row to view its log files:
- stdout: Standard output from the job
- stderr: Error output and stack traces
- Copy file paths with one click
Events Tab (SSE Live Streaming)
Monitor workflow activity in real-time using Server-Sent Events (SSE):
- Live event streaming - events appear instantly without polling
- Connection status indicator - shows Live/Reconnecting/Disconnected status
- Event types displayed:
job_started/job_completed/job_failed- Job lifecycle eventscompute_node_started/compute_node_stopped- Worker node lifecycleworkflow_started/workflow_reinitialized- Workflow initialization eventsscheduler_node_created- Slurm scheduler events
- Clear button to reset the event list
- Auto-reconnect on connection loss
Resource Plots Tab
Visualize CPU and memory usage over time:
- Enter a base directory containing resource database files
- Click "Scan for Databases" to find
.dbfiles - Select databases to plot
- Click "Generate Plots" for interactive Plotly charts
Requires workflows run with granularity: "time_series" in resource_monitor config.
Configuration Tab
Server Management
Start and stop torc-server directly from the dashboard:
- Server Port: Port to listen on (0 = auto-detect free port)
- Database Path: SQLite database file location
- Completion Check Interval: How often to check for job completions
- Log Level: Server logging verbosity
Click "Start Server" to launch, "Stop Server" to terminate.
API Configuration
- API URL: Torc server endpoint
- Test Connection: Verify connectivity
Settings are saved to browser local storage.
Common Usage Patterns
Running a Workflow
- Navigate to Workflows tab
- Click Create Workflow
- Upload your specification file
- Click Create
- Click Initialize on the new workflow
- Click Run Locally (or Submit for Slurm)
- Monitor progress in the Details tab or Events tab
Debugging a Failed Workflow
- Go to the Debugging tab
- Select the workflow
- Check "Show only failed jobs"
- Click Generate Report
- Click on a failed job row
- Review the stderr tab for error messages
- Check stdout for context
Monitoring Active Jobs
- Open Details tab
- Select "Jobs" and your workflow
- Enable Auto-refresh
- Watch job statuses update in real-time
Security Considerations
- Network Access: By default, binds to 127.0.0.1 (localhost only)
- Remote Access: Use
--host 0.0.0.0with caution; consider a reverse proxy with HTTPS - Authentication: Torc server supports htpasswd-based authentication (see Authentication)
Troubleshooting
Cannot Connect to Server
- Verify torc-server is running:
curl http://localhost:8080/torc-service/v1/workflows - Check the API URL in Configuration tab
- In standalone mode, check server output for startup errors
Workflow Creation Fails
- Ensure workflow specification is valid YAML/JSON/KDL
- Check file paths are accessible from the server
- Review browser console for error details
Resource Plots Not Showing
- Verify workflow used
granularity: "time_series"mode - Confirm
.dbfiles exist in the specified directory - Check that database files contain data
Standalone Mode Server Won't Start
- Verify
torc-serverbinary is in PATH or specify--torc-server-bin - Check if the port is already in use
- Review console output for error messages
Architecture
torc-dash is a self-contained Rust binary with:
- Axum web framework for HTTP server
- Embedded static assets (HTML, CSS, JavaScript)
- API proxy to forward requests to torc-server
- CLI integration for workflow operations
The frontend uses vanilla JavaScript with:
- Cytoscape.js for DAG visualization
- Plotly.js for resource charts
- Custom components for tables and forms
Next Steps
- Dashboard Deployment Tutorial - Detailed deployment scenarios
- Authentication - Secure your deployment
- Server Deployment - Production server configuration
Terminal User Interface (TUI)
The Torc TUI provides a full-featured terminal interface for managing workflows, designed for HPC users working in terminal-over-SSH environments.
Quick Start
# Option 1: Connect to an existing server
torc-server run & # Start server in background
torc tui # Launch the TUI
# Option 2: Standalone mode (auto-starts server)
torc tui --standalone
# Option 3: Start TUI without server (manual connection)
torc tui # Shows warning, use 'S' to start server
Standalone Mode
Use --standalone for single-machine development or testing:
# Basic standalone mode
torc tui --standalone
# Custom port
torc tui --standalone --port 8090
# Custom database location
torc tui --standalone --database /path/to/workflows.db
In standalone mode, the TUI automatically starts a torc-server process with the specified
configuration.
Features
- Workflow Management: Create, initialize, run, submit, cancel, reset, and delete workflows
- Job Management: View details, logs, cancel, terminate, or retry jobs
- Real-time Monitoring: Auto-refresh, manual refresh, color-coded status
- Live Event Streaming: Server-Sent Events (SSE) for real-time job and compute node events
- Server Management: Start/stop torc-server directly from the TUI
- File Viewing: Preview workflow files with search and navigation
- DAG Visualization: Text-based dependency graph
Interface Overview
When the TUI starts, you'll see:
┌─ Torc Management Console ────────────────────────────────────────┐
│ ?: help | n: new | i: init | I: reinit | R: reset | x: run ... │
└──────────────────────────────────────────────────────────────────┘
┌─ Server ─────────────────────────────────────────────────────────┐
│ http://localhost:8080/torc-service/v1 S: start | K: stop | O: output │
└──────────────────────────────────────────────────────────────────┘
┌─ User Filter ────────────────────────────────────────────────────┐
│ Current: yourname (press 'w' to change, 'a' for all users) │
└──────────────────────────────────────────────────────────────────┘
┌─ Workflows [FOCUSED] ────────────────────────────────────────────┐
│ >> 1 | my-workflow | yourname | Example workflow │
│ 2 | data-pipeline | yourname | Data processing pipeline │
└──────────────────────────────────────────────────────────────────┘
Basic Navigation
| Key | Action |
|---|---|
↑ / ↓ | Move up/down in the current table |
← / → | Switch focus between Workflows and Details panes |
Tab | Switch between detail tabs (Jobs → Files → Events → Results → DAG) |
Enter | Load details for selected workflow |
q | Quit (or close popup/dialog) |
? | Show help popup with all keybindings |
Workflow Actions
Select a workflow and use these keys:
| Key | Action | Description |
|---|---|---|
n | New | Create workflow from spec file |
i | Initialize | Set up job dependencies, mark ready jobs |
I | Re-initialize | Reset and re-initialize (prompts if output files exist) |
R | Reset | Reset all job statuses |
x | Run | Run workflow locally (shows real-time output) |
s | Submit | Submit to HPC scheduler (Slurm) |
C | Cancel | Cancel running workflow |
d | Delete | Delete workflow (destructive!) |
All destructive actions show a confirmation dialog.
Handling Existing Output Files
When initializing or re-initializing a workflow, if existing output files are detected, the TUI will
show a confirmation dialog listing the files that will be deleted. Press y to proceed with
--force or n to cancel.
Job Management
Navigate to the Jobs tab (→ then Tab if needed) to manage individual jobs:
| Key | Action |
|---|---|
Enter | View job details |
l | View job logs (stdout/stderr) |
c | Cancel job |
t | Terminate job |
y | Retry failed job |
f | Filter jobs by column |
Job Status Colors
- Green: Completed
- Yellow: Running
- Red: Failed
- Magenta: Canceled/Terminated
- Blue: Pending/Scheduled
- Cyan: Ready
- Gray: Blocked
Log Viewer
Press l on a job to view its logs:
| Key | Action |
|---|---|
Tab | Switch between stdout and stderr |
↑ / ↓ | Scroll one line |
PgUp / PgDn | Scroll 20 lines |
g / G | Jump to top / bottom |
/ | Start search |
n / N | Next / previous search match |
q | Close log viewer |
File Viewer
Navigate to the Files tab and press Enter on a file to view its contents. The file viewer
supports:
- Files up to 1MB
- Binary files show a hex dump preview
- Same navigation keys as the log viewer
Events Tab (SSE Live Streaming)
The Events tab shows real-time events from the server using Server-Sent Events (SSE). When you select a workflow and navigate to the Events tab, the TUI establishes a live connection to receive events as they occur.
Event types displayed:
job_started/job_completed/job_failed- Job lifecycle eventscompute_node_started/compute_node_stopped- Worker node lifecycleworkflow_started/workflow_reinitialized- Workflow initialization eventsscheduler_node_created- Slurm scheduler events
Features:
- Events appear instantly without polling
- Newest events shown at the top
- Filter by Event Type or Data content using
f - Title shows "[SSE Live]" indicator when connected
Server Management
The TUI can start and manage a torc-server instance:
| Key | Action |
|---|---|
S | Start torc-server |
K | Stop/Kill server |
O | Show server output |
The server status indicator in the connection bar shows:
●(green): Server is running (managed by TUI)○(yellow): Server was started but has stopped- No indicator: External server (not managed by TUI)
Connection Settings
| Key | Action |
|---|---|
u | Change server URL |
w | Change user filter |
a | Toggle show all users |
Auto-Refresh
Press A to toggle auto-refresh (30-second interval). When enabled, the workflow list and details
refresh automatically.
Configuration
The TUI respects Torc's layered configuration system:
- Interactive changes in TUI (press
uto change server URL) - Environment variables (
TORC_CLIENT__API_URL) - Local config file (
./torc.toml) - User config file (
~/.config/torc/config.toml) - System config file (
/etc/torc/config.toml) - Default values
Troubleshooting
"Could not connect to server"
- Ensure the Torc server is running:
torc-server run - Check the server URL: press
uto update if needed - Verify network connectivity
"No log content available"
Logs may not be available if:
- The job hasn't run yet
- You're on a different machine than where jobs ran
- The output directory is in a different location
Screen rendering issues
- Ensure your terminal supports UTF-8 and 256 colors
- Try resizing your terminal window
- Press
rto force a refresh
TUI vs Web Dashboard
| Feature | TUI (torc tui) | Web (torc-dash) |
|---|---|---|
| Environment | Terminal/SSH | Web browser |
| Startup | Instant | ~2 seconds |
| Dependencies | None (single binary) | None (single binary) |
| Workflow actions | Yes | Yes |
| Job actions | Yes | Yes |
| Log viewing | Yes | Yes |
| DAG visualization | Text-based | Interactive graph |
| Resource plots | Planned | Yes |
Choose the TUI for: SSH sessions, HPC environments, quick operations, low-bandwidth connections.
Choose torc-dash for: Rich visualizations, resource plots, team dashboards.
How to Generate Workflow Reports
This guide shows how to check workflow completion status and generate summary reports.
Check if a Workflow is Complete
Before generating reports, verify that your workflow has finished:
torc workflows is-complete <workflow_id>
If you omit the workflow ID, you'll be prompted to select from your workflows:
torc workflows is-complete
Example output:
Workflow 42 completion status:
Is Complete: true
Is Canceled: false
Needs Completion Script: false
For JSON output:
torc -f json workflows is-complete <workflow_id>
Generate a Workflow Summary
Once a workflow is complete, generate a summary report:
torc reports summary <workflow_id>
If you omit the workflow ID, you'll be prompted to select from your workflows:
torc reports summary
Example output:
Workflow Summary
================
Workflow ID: 42
Name: data_processing_pipeline
User: jsmith
Job Status (total: 100):
Completed: 95 ✓
Failed: 5 ✗
Total Execution Time: 2h 30m 15s
Walltime: 3h 15m 42s
If all jobs succeeded:
Workflow Summary
================
Workflow ID: 42
Name: simulation_run
User: jsmith
Job Status (total: 50):
Completed: 50 ✓
Total Execution Time: 45m 30s
✓ All jobs completed successfully!
Only non-zero status counts are displayed.
Continuous Monitoring
This command can be very convenient, but be mindful of your workflow size (number of jobs) and network load if you are using a shared server.
watch -n 10 torc reports summary <workflow_id>
JSON Output
This is useful for scripts:
torc -f json reports summary <workflow_id>
{
"workflow_id": 42,
"workflow_name": "data_processing_pipeline",
"workflow_user": "jsmith",
"total_jobs": 100,
"jobs_by_status": {
"uninitialized": 0,
"blocked": 0,
"ready": 0,
"pending": 0,
"running": 0,
"completed": 95,
"failed": 5,
"canceled": 0,
"terminated": 0,
"disabled": 0
},
"total_exec_time_minutes": 150.25,
"total_exec_time_formatted": "2h 30m 15s",
"walltime_seconds": 11742.0,
"walltime_formatted": "3h 15m 42s"
}
Use in Scripts
Combine these commands in automation scripts:
#!/bin/bash
WORKFLOW_ID=$1
# Check completion status
if torc -f json workflows is-complete "$WORKFLOW_ID" | jq -e '.is_complete' > /dev/null; then
echo "Workflow complete, generating summary..."
torc -f json reports summary "$WORKFLOW_ID" > "summary_${WORKFLOW_ID}.json"
else
echo "Workflow not yet complete"
exit 1
fi
Check Resource Utilization
After a workflow completes, check if any jobs exceeded their resource limits:
torc reports check-resource-utilization <workflow_id>
Example output when jobs stayed within limits:
Resource Utilization Report for Workflow 42
===========================================
All 50 jobs completed within resource limits.
Example output when jobs exceeded limits:
Resource Utilization Report for Workflow 42
===========================================
Jobs exceeding resource limits:
Job ID Name Memory Limit Peak Memory Status
------ ------------- ------------ ----------- ------
123 train_model_1 16g 18.2g EXCEEDED
124 train_model_2 16g 17.8g EXCEEDED
Recommendation: Increase memory allocation for affected jobs.
This helps identify jobs that may have been killed due to out-of-memory conditions or that are at risk of failure in future runs.
Related Commands
torc workflows status <id>- View current job status countstorc results list <id>- List individual job resultstorc reports check-resource-utilization <id>- Check for resource violationstorc reports results <id>- Generate detailed results with log file paths
Next Steps
- Resource Monitoring - Track CPU and memory usage
- Debugging Workflows - Troubleshoot failed jobs
How to Monitor Resource Usage
This guide shows how to track CPU and memory usage of your workflow jobs and identify resource requirement mismatches.
Enable Resource Monitoring
Resource monitoring is enabled by default for all workflows. To explicitly configure it, add a
resource_monitor section to your workflow specification:
name: "My Workflow"
resource_monitor:
enabled: true
granularity: "summary" # or "time_series"
sample_interval_seconds: 5
jobs:
# ... your jobs
To disable monitoring when creating a workflow:
torc workflows create my_workflow.yaml --no-resource-monitoring
View Summary Metrics
For workflows using summary mode (default), view resource metrics with:
torc results list <workflow_id>
The output includes columns for peak and average CPU and memory usage.
Check for Resource Violations
Use check-resource-utilization to identify jobs that exceeded their specified requirements:
# Check latest run
torc reports check-resource-utilization <workflow_id>
# Check a specific run
torc reports check-resource-utilization <workflow_id> --run-id <run_id>
# Show all jobs, not just violations
torc reports check-resource-utilization <workflow_id> --all
Example output:
⚠ Found 3 resource over-utilization violations:
Job ID | Job Name | Resource | Specified | Peak Used | Over-Utilization
-------|------------------|----------|-----------|-----------|------------------
15 | train_model | Memory | 8.00 GB | 10.50 GB | +31.3%
15 | train_model | Runtime | 2h 0m 0s | 2h 45m 0s | +37.5%
16 | large_preprocess | CPU | 800% | 950.5% | +18.8%
Adjust Resource Requirements
After identifying violations, update your workflow specification:
# Before: job used 10.5 GB but was allocated 8 GB
resource_requirements:
- name: training
memory: 8g
runtime: PT2H
# After: increased with buffer
resource_requirements:
- name: training
memory: 12g # 10.5 GB peak + 15% buffer
runtime: PT3H # 2h 45m actual + buffer
Guidelines for buffers:
- Memory: Add 10-20% above peak usage
- Runtime: Add 15-30% above actual duration
- CPU: Round up to next core count
Enable Time Series Monitoring
For detailed resource analysis over time, switch to time series mode:
resource_monitor:
granularity: "time_series"
sample_interval_seconds: 2
This creates a SQLite database with samples at regular intervals.
Generate Resource Plots
Create interactive visualizations from time series data:
# Generate all plots
torc plot-resources output/resource_utilization/resource_metrics_*.db \
-o plots/
# Generate plots for specific jobs
torc plot-resources output/resource_utilization/resource_metrics_*.db \
-o plots/ \
--job-ids 15,16
The tool generates:
- Individual job plots showing CPU, memory, and process count over time
- Overview plots comparing all jobs
- Summary dashboard with bar charts
Query Time Series Data Directly
Access the SQLite database for custom analysis:
sqlite3 -table output/resource_utilization/resource_metrics_1_1.db
-- View samples for a specific job
SELECT job_id, timestamp, cpu_percent, memory_bytes, num_processes
FROM job_resource_samples
WHERE job_id = 1
ORDER BY timestamp;
-- View job metadata
SELECT * FROM job_metadata;
Troubleshooting
No metrics recorded
- Check that monitoring wasn't disabled with
--no-resource-monitoring - Ensure jobs run long enough for at least one sample (default: 5 seconds)
Time series database not created
- Verify the output directory is writable
- Confirm
granularity: "time_series"is set in the workflow spec
Missing child process metrics
- Decrease
sample_interval_secondsto catch short-lived processes
Next Steps
- Resource Monitoring Reference - Configuration options and database schema
- Managing Resources - Define job resource requirements
Debugging Workflows
When workflows fail or produce unexpected results, Torc provides comprehensive debugging tools to help you identify and resolve issues. The primary debugging tools are:
torc results list: Prints a table of return codes for each job execution (non-zero means failure)torc reports results: Generates a detailed JSON report containing job results and all associated log file pathstorc logs analyze <output-dir>: Analyzes log files for known error patterns (see Working with Logs)- torc-dash Debug tab: Interactive web interface for visual debugging with log file viewer
Overview
Torc automatically captures return codes and multiple log files for each job execution:
- Job stdout/stderr: Output from your job commands
- Job runner logs: Internal logs from the Torc job runner
- Slurm logs: Additional logs when using Slurm scheduler (see Debugging Slurm Workflows)
The reports results command consolidates all this information into a single JSON report, making it
easy to locate and examine relevant log files for debugging.
Quick Start
View the job return codes in a table:
torc results list <workflow_id>
Results for workflow ID 2:
╭────┬────────┬───────┬────────┬─────────────┬───────────┬──────────┬────────────┬──────────────────────────┬────────╮
│ ID │ Job ID │ WF ID │ Run ID │ Return Code │ Exec Time │ Peak Mem │ Peak CPU % │ Completion Time │ Status │
├────┼────────┼───────┼────────┼─────────────┼───────────┼──────────┼────────────┼──────────────────────────┼────────┤
│ 4 │ 6 │ 2 │ 1 │ 1 │ 1.01 │ 73.8MB │ 21.9% │ 2025-11-13T13:35:43.289Z │ Done │
│ 5 │ 4 │ 2 │ 1 │ 0 │ 1.01 │ 118.1MB │ 301.3% │ 2025-11-13T13:35:43.393Z │ Done │
│ 6 │ 5 │ 2 │ 1 │ 0 │ 1.01 │ 413.6MB │ 19.9% │ 2025-11-13T13:35:43.499Z │ Done │
╰────┴────────┴───────┴────────┴─────────────┴───────────┴──────────┴────────────┴──────────────────────────┴────────╯
Total: 3 results
View only failed jobs:
torc results list <workflow_id> --failed
Results for workflow ID 2:
╭────┬────────┬───────┬────────┬─────────────┬───────────┬──────────┬────────────┬──────────────────────────┬────────╮
│ ID │ Job ID │ WF ID │ Run ID │ Return Code │ Exec Time │ Peak Mem │ Peak CPU % │ Completion Time │ Status │
├────┼────────┼───────┼────────┼─────────────┼───────────┼──────────┼────────────┼──────────────────────────┼────────┤
│ 4 │ 6 │ 2 │ 1 │ 1 │ 1.01 │ 73.8MB │ 21.9% │ 2025-11-13T13:35:43.289Z │ Done │
╰────┴────────┴───────┴────────┴─────────────┴───────────┴──────────┴────────────┴──────────────────────────┴────────╯
Generate a debugging report for a workflow:
# Generate report for a specific workflow
torc reports results <workflow_id>
# Specify custom output directory (default: "output")
torc reports results <workflow_id> --output-dir /path/to/output
# Include all workflow runs (default: only latest run)
torc reports results <workflow_id> --all-runs
# Interactive workflow selection (if workflow_id omitted)
torc reports results
The command outputs a comprehensive JSON report to stdout. Redirect it to a file for easier analysis:
torc reports results <workflow_id> > debug_report.json
Report Structure
Top-Level Fields
The JSON report includes workflow-level information:
{
"workflow_id": 123,
"workflow_name": "my_pipeline",
"workflow_user": "researcher",
"all_runs": false,
"total_results": 5,
"results": [...]
}
Fields:
workflow_id: Unique identifier for the workflowworkflow_name: Human-readable workflow nameworkflow_user: Owner of the workflowall_runs: Whether report includes all historical runs or just the latesttotal_results: Number of job results in the reportresults: Array of individual job result records
Job Result Records
Each entry in the results array contains detailed information about a single job execution:
{
"job_id": 456,
"job_name": "preprocess_data",
"status": "Done",
"run_id": 1,
"return_code": 0,
"completion_time": "2024-01-15T14:30:00.000Z",
"exec_time_minutes": 5.2,
"compute_node_id": 789,
"compute_node_type": "local",
"job_stdout": "output/job_stdio/job_456.o",
"job_stderr": "output/job_stdio/job_456.e",
"job_runner_log": "output/job_runner_hostname_123_1.log"
}
Core Fields:
job_id: Unique identifier for the jobjob_name: Human-readable job name from workflow specstatus: Job status (Done, Terminated, Failed, etc.)run_id: Workflow run number (increments on reinitialization)return_code: Exit code from job command (0 = success)completion_time: ISO 8601 timestamp when job completedexec_time_minutes: Duration of job execution in minutes
Compute Node Fields:
compute_node_id: ID of the compute node that executed the jobcompute_node_type: Type of compute node ("local" or "slurm")
Log File Paths
The report includes paths to all log files associated with each job. The specific files depend on the compute node type.
Local Runner Log Files
For jobs executed by the local job runner (compute_node_type: "local"):
{
"job_stdout": "output/job_stdio/job_456.o",
"job_stderr": "output/job_stdio/job_456.e",
"job_runner_log": "output/job_runner_hostname_123_1.log"
}
Log File Descriptions:
-
job_stdout (
output/job_stdio/job_<workflow_id>_<job_id>_<run_id>.o):- Standard output from your job command
- Contains print statements, normal program output
- Use for: Checking expected output, debugging logic errors
-
job_stderr (
output/job_stdio/job_<workflow_id>_<job_id>_<run_id>.e):- Standard error from your job command
- Contains error messages, warnings, stack traces
- Use for: Investigating crashes, exceptions, error messages
-
job_runner_log (
output/job_runner_<hostname>_<workflow_id>_<run_id>.log):- Internal Torc job runner logging
- Shows job lifecycle events, resource monitoring, process management
- Use for: Understanding Torc's job execution behavior, timing issues
Log path format conventions:
- Job stdio logs use job ID in filename
- Runner logs use hostname, workflow ID, and run ID
- All paths are relative to the specified
--output-dir
Slurm Runner Log Files
For jobs executed via Slurm scheduler (compute_node_type: "slurm"), additional log files are
available including Slurm stdout/stderr, environment logs, and dmesg logs.
See Debugging Slurm Workflows for detailed information about Slurm-specific log files and debugging tools.
Using the torc-dash Debugging Tab
The torc-dash web interface provides an interactive Debugging tab for visual debugging of workflow jobs. This is often the quickest way to investigate failed jobs without using command-line tools.
Accessing the Debugging Tab
-
Start torc-dash (standalone mode recommended for quick setup):
torc-dash --standalone -
Open your browser to
http://localhost:8090 -
Select a workflow from the dropdown in the sidebar
-
Click the Debugging tab in the navigation
Features
Job Results Report
The Debug tab provides a report generator with the following options:
-
Output Directory: Specify where job logs are stored (default:
output). This must match the directory used during workflow execution. -
Include all runs: Check this to see results from all workflow runs, not just the latest. Useful for comparing job behavior across reinitializations.
-
Show only failed jobs: Filter to display only jobs with non-zero return codes. This is checked by default to help you focus on problematic jobs.
Click Generate Report to fetch job results from the server.
Job Results Table
After generating a report, the Debug tab displays an interactive table showing:
- Job ID: Unique identifier for the job
- Job Name: Human-readable name from the workflow spec
- Status: Job completion status (Done, Terminated, etc.)
- Return Code: Exit code (0 = success, non-zero = failure)
- Execution Time: Duration in minutes
- Run ID: Which workflow run the result is from
Click any row to select a job and view its log files.
Log File Viewer
When you select a job from the table, the Log File Viewer displays:
-
stdout tab: Standard output from the job command
- Shows print statements and normal program output
- Useful for checking expected behavior and debugging logic
-
stderr tab: Standard error from the job command
- Shows error messages, warnings, and stack traces
- Primary location for investigating crashes and exceptions
Each tab includes:
- Copy Path button: Copy the full file path to clipboard
- File path display: Shows where the log file is located
- Scrollable content viewer: Dark-themed viewer for easy reading
Quick Debugging Workflow with torc-dash
- Open torc-dash and select your workflow from the sidebar
- Go to the Debugging tab
- Ensure "Show only failed jobs" is checked
- Click Generate Report
- Click on a failed job in the results table
- Review the stderr tab for error messages
- Check the stdout tab for context about what the job was doing
When to Use torc-dash vs CLI
Use torc-dash Debugging tab when:
- You want a visual, interactive debugging experience
- You need to quickly scan multiple failed jobs
- You're investigating jobs and want to easily switch between stdout/stderr
- You prefer not to construct
jqqueries manually
Use CLI tools (torc reports results) when:
- You need to automate failure detection in CI/CD
- You want to save reports for archival or version control
- You're working on a remote server without browser access
- You need to process results programmatically
Common Debugging Workflows
Investigating Failed Jobs
When a job fails, follow these steps:
-
Generate the debug report:
torc reports results <workflow_id> > debug_report.json -
Find the failed job using
jqor similar tool:# Find jobs with non-zero return codes jq '.results[] | select(.return_code != 0)' debug_report.json # Find jobs with specific status jq '.results[] | select(.status == "Done")' debug_report.json -
Check the job's stderr for error messages:
# Extract stderr path for a specific job STDERR_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_stderr' debug_report.json) # View the error output cat "$STDERR_PATH" -
Review job stdout for context:
STDOUT_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_stdout' debug_report.json) cat "$STDOUT_PATH" -
Check runner logs for execution issues:
LOG_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_runner_log' debug_report.json) cat "$LOG_PATH"
Searching Log Files with Grep
Torc's log messages use a structured key=value format that makes them easy to search with grep.
This is especially useful for tracing specific jobs or workflows across multiple log files.
Search for all log entries related to a specific workflow:
# Find all log lines for workflow 123
grep -r "workflow_id=123" output/
# Find all log lines for workflow 123 in job runner logs only
grep -r "workflow_id=123" output/job_runner_*.log
Search for a specific job:
# Find all log lines for job 456
grep -r "job_id=456" output/
# Find log lines for job 456 with more context (2 lines before/after)
grep -r -C 2 "job_id=456" output/
Combine workflow and job searches:
# Find log lines for job 456 in workflow 123
grep -r "workflow_id=123" output/ | grep "job_id=456"
# Alternative using extended regex
grep -rE "workflow_id=123.*job_id=456" output/
Search for specific runs or attempts:
# Find all log lines for run 2 of workflow 123
grep -r "workflow_id=123" output/ | grep "run_id=2"
# Find retry attempts for a specific job
grep -r "job_id=456" output/ | grep "attempt_id="
# Find entries for a specific compute node
grep -r "compute_node_id=789" output/
Common log message patterns to search for:
# Find job start events
grep -r "Job started workflow_id=" output/
# Find job completion events
grep -r "Job completed workflow_id=" output/
# Find failed jobs
grep -r "status=failed" output/
# Find all job process completions with return codes
grep -r "Job process completed" output/ | grep "return_code="
Tip: Redirect grep output to a file for easier analysis of large result sets:
grep -r "workflow_id=123" output/ > workflow_123_logs.txt
Example: Complete Debugging Session
# 1. Generate report
torc reports results 123 > report.json
# 2. Check overall success/failure counts
echo "Total jobs: $(jq '.total_results' report.json)"
echo "Failed jobs: $(jq '[.results[] | select(.return_code != 0)] | length' report.json)"
# 3. List all failed jobs with their names
jq -r '.results[] | select(.return_code != 0) | "\(.job_id): \(.job_name) (exit code: \(.return_code))"' report.json
# Output:
# 456: process_batch_2 (exit code: 1)
# 789: validate_results (exit code: 2)
# 4. Examine stderr for first failure
jq -r '.results[] | select(.job_id == 456) | .job_stderr' report.json | xargs cat
# Output might show:
# FileNotFoundError: [Errno 2] No such file or directory: 'input/batch_2.csv'
# 5. Check if job dependencies completed successfully
# (The missing file might be an output from a previous job)
jq -r '.results[] | select(.job_name == "generate_batch_2") | "\(.status) (exit code: \(.return_code))"' report.json
Debugging Across Multiple Runs
When a workflow has been reinitialized multiple times, compare runs to identify regressions:
# Generate report with all historical runs
torc reports results <workflow_id> --all-runs > full_history.json
# Compare return codes across runs for a specific job
jq -r '.results[] | select(.job_name == "flaky_job") | "Run \(.run_id): exit code \(.return_code)"' full_history.json
# Output:
# Run 1: exit code 0
# Run 2: exit code 1
# Run 3: exit code 0
# Run 4: exit code 1
# Extract stderr paths for failed runs
jq -r '.results[] | select(.job_name == "flaky_job" and .return_code != 0) | "Run \(.run_id): \(.job_stderr)"' full_history.json
Log File Missing Warnings
The reports results command automatically checks for log file existence and prints warnings to
stderr if files are missing:
Warning: job stdout log file does not exist for job 456: output/job_stdio/job_456.o
Warning: job runner log file does not exist for job 456: output/job_runner_host1_123_1.log
Common causes of missing log files:
- Wrong output directory: Ensure
--output-dirmatches the directory used during workflow execution - Logs not yet written: Job may still be running or failed to start
- Logs cleaned up: Files may have been manually deleted
- Path mismatch: Output directory moved or renamed after execution
Solution: Verify the output directory and ensure it matches what was passed to torc run or
torc slurm schedule-nodes.
Output Directory Management
The --output-dir parameter must match the directory used during workflow execution:
Local Runner
# Execute workflow with specific output directory
torc run <workflow_id> /path/to/my_output
# Generate report using the same directory
torc reports results <workflow_id> --output-dir /path/to/my_output
Slurm Scheduler
# Submit jobs to Slurm with output directory
torc slurm schedule-nodes <workflow_id> --output-dir /path/to/my_output
# Generate report using the same directory
torc reports results <workflow_id> --output-dir /path/to/my_output
Default behavior: If --output-dir is not specified, both the runner and reports command
default to ./output.
Best Practices
-
Generate reports after each run: Create a debug report immediately after workflow execution for easier troubleshooting
-
Archive reports with logs: Store the JSON report alongside log files for future reference
torc reports results "$WF_ID" > "output/report_${WF_ID}_$(date +%Y%m%d_%H%M%S).json" -
Use version control: Commit debug reports for important workflow runs to track changes over time
-
Automate failure detection: Use the report in CI/CD pipelines to automatically detect and report failures
-
Check warnings: Pay attention to warnings about missing log files - they often indicate configuration issues
-
Combine with resource monitoring: Use
reports resultsfor log files andreports check-resource-utilizationfor performance issues# Check if job failed due to resource constraints torc reports check-resource-utilization "$WF_ID" torc reports results "$WF_ID" > report.json -
Filter large reports: For workflows with many jobs, filter the report to focus on relevant jobs
# Only include failed jobs in filtered report jq '{workflow_id, workflow_name, results: [.results[] | select(.return_code != 0)]}' report.json
Troubleshooting Common Issues
"Output directory does not exist" Error
Cause: The specified --output-dir path doesn't exist.
Solution: Verify the directory exists and the path is correct:
ls -ld output/ # Check if directory exists
torc reports results <workflow_id> --output-dir "$(pwd)/output"
Empty Results Array
Cause: No job results exist for the workflow (jobs not yet executed or initialized).
Solution: Check workflow status and ensure jobs have been completed:
torc workflows status <workflow_id>
torc results list <workflow_id> # Verify results exist
All Log Paths Show Warnings
Cause: Output directory mismatch between execution and report generation.
Solution: Verify the output directory used during execution:
# Check where logs actually are
find . -name "job_*.o" -o -name "job_runner_*.log"
# Use correct output directory in report
torc reports results <workflow_id> --output-dir <correct_path>
Related Commands
torc results list: View summary of job results in table formattorc workflows status: Check overall workflow statustorc reports results: Generate debug report with all log file pathstorc reports check-resource-utilization: Analyze resource usage and find over-utilized jobstorc jobs list: View all jobs and their current statustorc-dash: Launch web interface with interactive Debugging tabtorc tui: Launch terminal UI for workflow monitoring
See Also
- Working with Logs — Bundling and analyzing logs
- Debugging Slurm Workflows — Slurm-specific debugging tools
Working with Logs
Torc provides tools for bundling and analyzing workflow logs. These are useful for:
- Sharing logs with colleagues for help debugging
- Archiving completed workflow logs for later reference
- Scanning for errors across all log files at once
Log File Overview
Torc generates several types of log files during workflow execution:
| Log Type | Path Pattern | Contents |
|---|---|---|
| Job stdout | output/job_stdio/job_wf<id>_j<job>_r<run>.o | Standard output from job commands |
| Job stderr | output/job_stdio/job_wf<id>_j<job>_r<run>.e | Error output, stack traces |
| Job runner | output/job_runner_*.log | Torc job runner internal logs |
| Slurm stdout | output/slurm_output_wf<id>_sl<slurm_id>.o | Slurm job allocation output |
| Slurm stderr | output/slurm_output_wf<id>_sl<slurm_id>.e | Slurm-specific errors |
| Slurm env | output/slurm_env_*.log | Slurm environment variables |
| dmesg | output/dmesg_slurm_*.log | Kernel messages (on failure) |
For detailed information about log file contents, see Debugging Workflows and Debugging Slurm Workflows.
Bundling Logs
The torc logs bundle command packages all logs for a workflow into a compressed tarball:
# Bundle all logs for a workflow
torc logs bundle <workflow_id>
# Specify custom output directory (where logs are located)
torc logs bundle <workflow_id> --output-dir /path/to/output
# Save bundle to a specific directory
torc logs bundle <workflow_id> --bundle-dir /path/to/bundles
This creates a wf<id>.tar.gz file containing:
- All job stdout/stderr files (
job_wf*_j*_r*.o/e) - Job runner logs (
job_runner_*.log) - Slurm output files (
slurm_output_wf*_sl*.o/e) - Slurm environment logs (
slurm_env_wf*_sl*.log) - dmesg logs (
dmesg_slurm_wf*_sl*.log) - Bundle metadata (workflow info, collection timestamp)
Example: Sharing Logs
# Bundle workflow logs
torc logs bundle 123 --bundle-dir ./bundles
# Share the bundle
ls ./bundles/
# wf123.tar.gz
# Recipient can extract and analyze
tar -xzf wf123.tar.gz
torc logs analyze wf123/
Analyzing Logs
The torc logs analyze command scans log files for known error patterns:
# Analyze a log bundle tarball
torc logs analyze wf123.tar.gz
# Analyze a log directory directly (auto-detects workflow if only one present)
torc logs analyze output/
# Analyze a directory with multiple workflows (specify which one)
torc logs analyze output/ --workflow-id 123
Detected Error Patterns
The analyzer scans for common failure patterns including:
Memory Errors:
- Out of memory, OOM kills
std::bad_alloc(C++)MemoryError(Python)
Slurm Errors:
- Time limit exceeded
- Node failures
- Preemption
GPU/CUDA Errors:
- CUDA out of memory
- GPU memory exceeded
Crashes:
- Segmentation faults
- Bus errors
- Signal kills
Python Errors:
- Tracebacks
- Import errors
File System Errors:
- No space left on device
- Permission denied
Network Errors:
- Connection refused/timed out
Example Output
Log Analysis Results
====================
Analyzing: output/
Files with detected errors:
output/job_stdio/job_wf123_j456_r1.e
Line 42: MemoryError: Unable to allocate 8.00 GiB
Severity: critical
Type: Python Memory Error
output/slurm_output_wf123_sl789.e
Line 15: slurmstepd: error: Detected 1 oom-kill event(s)
Severity: critical
Type: Out of Memory (OOM) Kill
Summary:
Total files scanned: 24
Files with errors: 2
Error types found: MemoryError, OOM Kill
Excluding Files
Environment variable files (slurm_env_*.log) are automatically excluded from error analysis since
they contain configuration data, not error logs.
Workflow: Bundle and Share
A common pattern when asking for help:
# 1. Bundle the workflow logs
torc logs bundle <workflow_id>
# 2. Analyze locally first to understand the issue
torc logs analyze wf<id>.tar.gz
# 3. Share the bundle with your colleague/support
# They can extract and analyze:
tar -xzf wf<id>.tar.gz
torc logs analyze wf<id>/
Related Commands
torc reports results: Generate JSON report with all log file pathstorc results list: View summary table of job return codestorc slurm parse-logs: Parse Slurm logs for error patterns (Slurm-specific)torc slurm sacct: Collect Slurm accounting data
See Also
- Debugging Workflows — General debugging workflow and log file details
- Debugging Slurm Workflows — Slurm-specific debugging tools
Reference
Comprehensive reference documentation.
- CLI Reference - Command-line interface documentation
- Job Parameterization - Parameter syntax and options
- Resource Requirements - Specifying job resources
- Environment Variables - Configuration via environment
- Resource Monitoring Database - Metrics database schema
- Configuration - Configuration file reference
- OpenAPI Specification - REST API documentation
CLI Reference
This documentation is automatically generated from the CLI help text.
To regenerate, run:
cargo run --bin generate-cli-docs --features "client,tui,plot_resources"
Command-Line Help for torc
This document contains the help content for the torc command-line program.
Command Overview:
- CLI Reference
- Command-Line Help for
torctorc- Subcommands: - Options:torc run- Arguments: - Options:torc submit- Arguments: - Options:torc submit-slurm- Arguments: - Options:torc watch- Arguments: - Options:torc workflows- Subcommands:torc workflows create- Arguments: - Options:torc workflows create-slurm- Arguments: - Options:torc workflows new- Options:torc workflows list- Options:torc workflows get- Arguments: - Options:torc workflows update- Arguments: - Options:torc workflows cancel- Arguments:torc workflows delete- Arguments: - Options:torc workflows archive- Arguments:torc workflows submit- Arguments: - Options:torc workflows run- Arguments: - Options:torc workflows initialize- Arguments: - Options:torc workflows reinitialize- Arguments:torc workflows status- Arguments: - Options:torc workflows reset-status- Arguments:torc workflows execution-plan- Arguments:torc workflows list-actions- Arguments:torc workflows is-complete- Arguments:torc workflows sync-status- Arguments: - Options:torc compute-nodes- Subcommands:torc compute-nodes get- Arguments:torc compute-nodes list- Arguments: - Options:torc files- Subcommands:torc files create- Arguments: - Options:torc files list- Arguments: - Options:torc files get- Arguments:torc files update- Arguments: - Options:torc files delete- Arguments:torc files list-required-existing- Arguments:torc jobs- Subcommands:torc jobs create- Arguments: - Options:torc jobs create-from-file- Arguments: - Options:torc jobs list- Arguments: - Options:torc jobs get- Arguments:torc jobs update- Arguments: - Options:torc jobs delete- Arguments:torc jobs delete-all- Arguments:torc jobs list-resource-requirements- Arguments: - Options:torc job-dependencies- Subcommands:torc job-dependencies job-job- Arguments: - Options:torc job-dependencies job-file- Arguments: - Options:torc job-dependencies job-user-data- Arguments: - Options:torc resource-requirements- Subcommands:torc resource-requirements create- Arguments: - Options:torc resource-requirements list- Arguments: - Options:torc resource-requirements get- Arguments:torc resource-requirements update- Arguments: - Options:torc resource-requirements delete- Arguments:torc events- Subcommands:torc events create- Arguments: - Options:torc events list- Arguments: - Options:torc events monitor- Arguments: - Options:torc events get-latest-event- Arguments:torc events delete- Arguments:torc results- Subcommands:torc results list- Arguments: - Options:torc results get- Arguments:torc results delete- Arguments:torc user-data- Subcommands:torc user-data create- Arguments: - Options:torc user-data list- Arguments: - Options:torc user-data get- Arguments:torc user-data update- Arguments: - Options:torc user-data delete- Arguments:torc user-data delete-all- Arguments:torc user-data list-missing- Arguments:torc slurm- Subcommands:torc slurm create- Arguments: - Options:torc slurm update- Arguments: - Options:torc slurm list- Arguments: - Options:torc slurm get- Arguments:torc slurm delete- Arguments:torc slurm schedule-nodes- Arguments: - Options:torc slurm parse-logs- Arguments: - Options:torc slurm sacct- Arguments: - Options:torc slurm generate- Arguments: - Options:torc slurm regenerate- Arguments: - Options:torc remote- Subcommands:torc remote add-workers- Arguments:torc remote add-workers-from-file- Arguments:torc remote remove-worker- Arguments:torc remote list-workers- Arguments:torc remote run- Arguments: - Options:torc remote status- Arguments: - Options:torc remote stop- Arguments: - Options:torc remote collect-logs- Arguments: - Options:torc remote delete-logs- Arguments: - Options:torc scheduled-compute-nodes- Subcommands:torc scheduled-compute-nodes get- Arguments:torc scheduled-compute-nodes list- Arguments: - Options:torc scheduled-compute-nodes list-jobs- Arguments:torc hpc- Subcommands:torc hpc listtorc hpc detecttorc hpc show- Arguments:torc hpc partitions- Arguments: - Options:torc hpc match- Options:torc reports- Subcommands:torc reports check-resource-utilization- Arguments: - Options:torc reports results- Arguments: - Options:torc reports summary- Arguments:torc config- Subcommands:torc config show- Options:torc config pathstorc config init- Options:torc config validatetorc tui- Options:torc plot-resources- Arguments: - Options:torc completions- Arguments:
torc
Torc workflow orchestration system
Usage: torc [OPTIONS] <COMMAND>
Subcommands:
run— Run a workflow locally (create from spec file or run existing workflow by ID)submit— Submit a workflow to scheduler (create from spec file or submit existing workflow by ID)submit-slurm— Submit a workflow to Slurm with auto-generated schedulerswatch— Watch a workflow and automatically recover from failuresworkflows— Workflow management commandscompute-nodes— Compute node management commandsfiles— File management commandsjobs— Job management commandsjob-dependencies— Job dependency and relationship queriesresource-requirements— Resource requirements management commandsevents— Event management commandsresults— Result management commandsuser-data— User data management commandsslurm— Slurm scheduler commandsremote— Remote worker execution commands (SSH-based distributed execution)scheduled-compute-nodes— Scheduled compute node management commandshpc— HPC system profiles and partition informationreports— Generate reports and analyticsconfig— Manage configuration files and settingstui— Interactive terminal UI for managing workflowsplot-resources— Generate interactive HTML plots from resource monitoring datacompletions— Generate shell completions
Options:
-
--log-level <LOG_LEVEL>— Log level (error, warn, info, debug, trace) -
-f,--format <FORMAT>— Output format (table or json) -
--log-level <LOG_LEVEL>— Log level (error, warn, info, debug, trace) -
-f,--format <FORMAT>— Output format (table or json)Default value:
table -
--url <URL>— URL of torc server -
--username <USERNAME>— Username for basic authentication -
--password <PASSWORD>— Password for basic authentication (will prompt if username provided but password not) -
--url <URL>— URL of torc server -
--username <USERNAME>— Username for basic authentication -
--password <PASSWORD>— Password for basic authentication (will prompt if username provided but password not)
torc run
Run a workflow locally (create from spec file or run existing workflow by ID)
Usage: torc run [OPTIONS] <WORKFLOW_SPEC_OR_ID>
Arguments:
<WORKFLOW_SPEC_OR_ID>— Path to workflow spec file (JSON/JSON5/YAML) or workflow ID<WORKFLOW_SPEC_OR_ID>— Path to workflow spec file (JSON/JSON5/YAML) or workflow ID
Options:
-
--max-parallel-jobs <MAX_PARALLEL_JOBS>— Maximum number of parallel jobs to run concurrently -
--num-cpus <NUM_CPUS>— Number of CPUs available -
--memory-gb <MEMORY_GB>— Memory in GB -
--num-gpus <NUM_GPUS>— Number of GPUs available -
-p,--poll-interval <POLL_INTERVAL>— Job completion poll interval in seconds -
-o,--output-dir <OUTPUT_DIR>— Output directory for jobs -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with caution -
--max-parallel-jobs <MAX_PARALLEL_JOBS>— Maximum number of parallel jobs to run concurrently -
--num-cpus <NUM_CPUS>— Number of CPUs available -
--memory-gb <MEMORY_GB>— Memory in GB -
--num-gpus <NUM_GPUS>— Number of GPUs available -
-p,--poll-interval <POLL_INTERVAL>— Job completion poll interval in seconds -
-o,--output-dir <OUTPUT_DIR>— Output directory for jobs -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with cautionDefault value:
false
torc submit
Submit a workflow to scheduler (create from spec file or submit existing workflow by ID)
Requires workflow to have an on_workflow_start action with schedule_nodes. For Slurm workflows
without pre-configured schedulers, use submit-slurm instead. Requires workflow to have an
on_workflow_start action with schedule_nodes. For Slurm workflows without pre-configured schedulers,
use submit-slurm instead.
Usage: torc submit [OPTIONS] <WORKFLOW_SPEC_OR_ID>
Arguments:
<WORKFLOW_SPEC_OR_ID>— Path to workflow spec file (JSON/JSON5/YAML) or workflow ID<WORKFLOW_SPEC_OR_ID>— Path to workflow spec file (JSON/JSON5/YAML) or workflow ID
Options:
-
-i,--ignore-missing-data— Ignore missing data (defaults to false) -
-i,--ignore-missing-data— Ignore missing data (defaults to false)Default value:
false -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with caution -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with cautionDefault value:
false
torc submit-slurm
Submit a workflow to Slurm with auto-generated schedulers
Automatically generates Slurm schedulers based on job resource requirements and HPC profile.
WARNING: This command uses heuristics to generate schedulers and workflow actions. For complex workflows with unusual dependency patterns, the generated configuration may not be optimal and could waste allocation time. WARNING: This command uses heuristics to generate schedulers and workflow actions. For complex workflows with unusual dependency patterns, the generated configuration may not be optimal and could waste allocation time.
RECOMMENDED: Preview the generated configuration first with:
torc slurm generate --account
Review the schedulers and actions to ensure they are appropriate for your workflow before submitting. You can save the output and submit manually: Review the schedulers and actions to ensure they are appropriate for your workflow before submitting. You can save the output and submit manually:
torc slurm generate --account
Usage: torc submit-slurm [OPTIONS] --account <ACCOUNT> <WORKFLOW_SPEC>
Arguments:
<WORKFLOW_SPEC>— Path to workflow spec file (JSON/JSON5/YAML/KDL)<WORKFLOW_SPEC>— Path to workflow spec file (JSON/JSON5/YAML/KDL)
Options:
-
--account <ACCOUNT>— Slurm account to use for allocations -
--hpc-profile <HPC_PROFILE>— HPC profile to use (auto-detected if not specified) -
--single-allocation— Bundle all nodes into a single Slurm allocation per scheduler -
--account <ACCOUNT>— Slurm account to use for allocations -
--hpc-profile <HPC_PROFILE>— HPC profile to use (auto-detected if not specified) -
--single-allocation— Bundle all nodes into a single Slurm allocation per schedulerBy default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance. By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance.
With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch.
-
-i,--ignore-missing-data— Ignore missing data (defaults to false) With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch. -
-i,--ignore-missing-data— Ignore missing data (defaults to false)Default value:
false -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with caution -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with cautionDefault value:
false
torc watch
Watch a workflow and automatically recover from failures.
Monitors a workflow until completion. With --recover, automatically diagnoses failures, adjusts
resource requirements, and resubmits jobs.
Usage: torc watch [OPTIONS] <WORKFLOW_ID>
Usage Modes
-
Basic monitoring (no recovery):
torc watch 123Reports failures and exits. Use for manual intervention or AI-assisted recovery.
-
With automatic recovery (
--recover):torc watch 123 --recoverAutomatically diagnoses OOM/timeout failures, adjusts resources, and retries. Runs until all jobs complete or max retries exceeded.
-
With auto-scheduling (
--auto-schedule):torc watch 123 --auto-scheduleAutomatically submits new Slurm allocations when retry jobs are waiting. Essential for workflows using failure handlers that create retry jobs.
Arguments
<WORKFLOW_ID>— Workflow ID to watch
Options
Polling:
-p,--poll-interval <POLL_INTERVAL>— Poll interval in seconds (default:60)-o,--output-dir <OUTPUT_DIR>— Output directory for job files (default:output)-s,--show-job-counts— Show job counts by status during polling. WARNING: Can cause high server load for large workflows.
Recovery:
-r,--recover— Enable automatic failure recovery-m,--max-retries <MAX_RETRIES>— Maximum number of recovery attempts (default:3)--memory-multiplier <MEMORY_MULTIPLIER>— Memory multiplier for OOM failures (default:1.5)--runtime-multiplier <RUNTIME_MULTIPLIER>— Runtime multiplier for timeout failures (default:1.5)--retry-unknown— Also retry jobs with unknown failure causes (not just OOM or timeout)--recovery-hook <RECOVERY_HOOK>— Custom recovery script for unknown failures. The workflow ID is passed as an argument and viaTORC_WORKFLOW_IDenvironment variable.
Auto-scheduling:
--auto-schedule— Automatically schedule new compute nodes when needed--auto-schedule-threshold <N>— Minimum retry jobs before auto-scheduling when schedulers exist (default:5)--auto-schedule-cooldown <SECONDS>— Cooldown between auto-schedule attempts (default:1800/ 30 min)--auto-schedule-stranded-timeout <SECONDS>— Schedule stranded jobs after this timeout even if below threshold (default:7200/ 2 hrs). Set to0to disable.
Auto-Scheduling Behavior
When --auto-schedule is enabled:
- No schedulers available: Immediately submits new allocations if ready jobs exist.
- Threshold exceeded: If retry jobs (attempt_id > 1) exceed
--auto-schedule-thresholdwhile schedulers are running, submits additional allocations after cooldown. - Stranded jobs: If retry jobs are below threshold but waiting longer than
--auto-schedule-stranded-timeout, schedules anyway to prevent indefinite waiting.
Examples
# Basic: watch until completion, report failures
torc watch 123
# Recovery: automatically fix OOM/timeout failures
torc watch 123 --recover
# Recovery with aggressive resource increases
torc watch 123 --recover --memory-multiplier 2.0 --runtime-multiplier 2.0
# Recovery including unknown failures (transient errors)
torc watch 123 --recover --retry-unknown
# Auto-schedule: ensure retry jobs get scheduled
torc watch 123 --auto-schedule
# Full production setup: recovery + auto-scheduling
torc watch 123 --recover --auto-schedule
# Custom auto-schedule settings
torc watch 123 --auto-schedule \
--auto-schedule-threshold 10 \
--auto-schedule-cooldown 3600 \
--auto-schedule-stranded-timeout 14400
See Also
torc recover— One-shot recovery (no continuous monitoring)- Automatic Failure Recovery — Detailed guide
torc recover
Recover a Slurm workflow from failures (one-shot).
Diagnoses job failures (OOM, timeout), adjusts resource requirements, and resubmits jobs. Use after
a workflow has completed with failures. For continuous monitoring, use torc watch --recover
instead.
Usage: torc recover [OPTIONS] <WORKFLOW_ID>
Arguments
<WORKFLOW_ID>— Workflow ID to recover
Options
-o,--output-dir <OUTPUT_DIR>— Output directory for job files (default:output)--memory-multiplier <MEMORY_MULTIPLIER>— Memory multiplier for OOM failures (default:1.5)--runtime-multiplier <RUNTIME_MULTIPLIER>— Runtime multiplier for timeout failures (default:1.4)--retry-unknown— Also retry jobs with unknown failure causes--recovery-hook <RECOVERY_HOOK>— Custom recovery script for unknown failures--dry-run— Show what would be done without making any changes
When to Use
Use torc recover for:
- One-shot recovery after a workflow has completed with failures
- Manual investigation before retrying (use
--dry-runfirst) - Workflows where you want to inspect failures before retrying
Use torc watch --recover instead for:
- Continuous monitoring of long-running workflows
- Fully automated recovery without manual intervention
- Production workflows that should self-heal
Examples
# Basic recovery
torc recover 123
# Dry run to preview changes without modifying anything
torc recover 123 --dry-run
# Custom resource multipliers
torc recover 123 --memory-multiplier 2.0 --runtime-multiplier 1.5
# Also retry unknown failures (not just OOM/timeout)
torc recover 123 --retry-unknown
# With custom recovery hook for domain-specific fixes
torc recover 123 --recovery-hook 'bash fix-cluster.sh'
See Also
torc watch --recover— Continuous monitoring with automatic recovery- Automatic Failure Recovery — Detailed guide
torc workflows
Workflow management commands
Usage: torc workflows <COMMAND>
Subcommands:
create— Create a workflow from a specification file (supports JSON, JSON5, YAML, and KDL formats)create-slurm— Create a workflow with auto-generated Slurm schedulersnew— Create a new empty workflowlist— List workflowsget— Get a specific workflow by IDupdate— Update an existing workflowcancel— Cancel a workflow and all associated Slurm jobsdelete— Delete one or more workflowsarchive— Archive or unarchive one or more workflowssubmit— Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodesrun— Run a workflow locally on the current nodeinitialize— Initialize a workflow, including all job statusesreinitialize— Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changedstatus— Get workflow statusreset-status— Reset workflow and job statusexecution-plan— Show the execution plan for a workflow specification or existing workflowlist-actions— List workflow actions and their statuses (useful for debugging action triggers)is-complete— Check if a workflow is completecreate— Create a workflow from a specification file (supports JSON, JSON5, YAML, and KDL formats)create-slurm— Create a workflow with auto-generated Slurm schedulersnew— Create a new empty workflowlist— List workflowsget— Get a specific workflow by IDupdate— Update an existing workflowcancel— Cancel a workflow and all associated Slurm jobsdelete— Delete one or more workflowsarchive— Archive or unarchive one or more workflowssubmit— Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodesrun— Run a workflow locally on the current nodeinitialize— Initialize a workflow, including all job statusesreinitialize— Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changedstatus— Get workflow statusreset-status— Reset workflow and job statusexecution-plan— Show the execution plan for a workflow specification or existing workflowlist-actions— List workflow actions and their statuses (useful for debugging action triggers)is-complete— Check if a workflow is completeexport— Export a workflow to a portable JSON fileimport— Import a workflow from an exported JSON filesync-status— Synchronize job statuses with Slurm (detect and fail orphaned jobs)
torc workflows create
Create a workflow from a specification file (supports JSON, JSON5, YAML, and KDL formats)
Usage: torc workflows create [OPTIONS] --user <USER> <FILE>
Arguments:
-
<FILE>— Path to specification file containing WorkflowSpec -
<FILE>— Path to specification file containing WorkflowSpecSupported formats: - JSON (.json): Standard JSON format - JSON5 (.json5): JSON with comments and trailing commas - YAML (.yaml, .yml): Human-readable YAML format - KDL (.kdl): KDL document format Supported formats: - JSON (.json): Standard JSON format - JSON5 (.json5): JSON with comments and trailing commas - YAML (.yaml, .yml): Human-readable YAML format - KDL (.kdl): KDL document format
Format is auto-detected from file extension, with fallback parsing attempted Format is auto-detected from file extension, with fallback parsing attempted
Options:
-
-u,--user <USER>— User that owns the workflow (defaults to USER environment variable) -
--no-resource-monitoring— Disable resource monitoring (default: enabled with summary granularity and 5s sample rate) -
-u,--user <USER>— User that owns the workflow (defaults to USER environment variable) -
--no-resource-monitoring— Disable resource monitoring (default: enabled with summary granularity and 5s sample rate)Default value:
false -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with caution -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with cautionDefault value:
false -
--dry-run— Validate the workflow specification without creating it (dry-run mode) Returns a summary of what would be created including job count after parameter expansion -
--dry-run— Validate the workflow specification without creating it (dry-run mode) Returns a summary of what would be created including job count after parameter expansion
torc workflows create-slurm
Create a workflow with auto-generated Slurm schedulers
Automatically generates Slurm schedulers based on job resource requirements and HPC profile. For Slurm workflows without pre-configured schedulers. Automatically generates Slurm schedulers based on job resource requirements and HPC profile. For Slurm workflows without pre-configured schedulers.
Usage: torc workflows create-slurm [OPTIONS] --account <ACCOUNT> --user <USER> <FILE>
Arguments:
<FILE>— Path to specification file containing WorkflowSpec<FILE>— Path to specification file containing WorkflowSpec
Options:
-
--account <ACCOUNT>— Slurm account to use for allocations -
--hpc-profile <HPC_PROFILE>— HPC profile to use (auto-detected if not specified) -
--single-allocation— Bundle all nodes into a single Slurm allocation per scheduler -
--account <ACCOUNT>— Slurm account to use for allocations -
--hpc-profile <HPC_PROFILE>— HPC profile to use (auto-detected if not specified) -
--single-allocation— Bundle all nodes into a single Slurm allocation per schedulerBy default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance. By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance.
With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch.
-
-u,--user <USER>— User that owns the workflow (defaults to USER environment variable) -
--no-resource-monitoring— Disable resource monitoring (default: enabled with summary granularity and 5s sample rate) With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch. -
-u,--user <USER>— User that owns the workflow (defaults to USER environment variable) -
--no-resource-monitoring— Disable resource monitoring (default: enabled with summary granularity and 5s sample rate)Default value:
false -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with caution -
--skip-checks— Skip validation checks (e.g., scheduler node requirements). Use with cautionDefault value:
false -
--dry-run— Validate the workflow specification without creating it (dry-run mode) -
--dry-run— Validate the workflow specification without creating it (dry-run mode)
torc workflows new
Create a new empty workflow
Usage: torc workflows new [OPTIONS] --name <NAME> --user <USER>
Options:
-n,--name <NAME>— Name of the workflow-d,--description <DESCRIPTION>— Description of the workflow-u,--user <USER>— User that owns the workflow (defaults to USER environment variable)-n,--name <NAME>— Name of the workflow-d,--description <DESCRIPTION>— Description of the workflow-u,--user <USER>— User that owns the workflow (defaults to USER environment variable)
torc workflows list
List workflows
Usage: torc workflows list [OPTIONS]
Options:
-
-u,--user <USER>— User to filter by (defaults to USER environment variable) -
--all-users— List workflows for all users (overrides --user) -
-l,--limit <LIMIT>— Maximum number of workflows to return -
-u,--user <USER>— User to filter by (defaults to USER environment variable) -
--all-users— List workflows for all users (overrides --user) -
-l,--limit <LIMIT>— Maximum number of workflows to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0 -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--archived-only— Show only archived workflows -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--archived-only— Show only archived workflowsDefault value:
false -
--include-archived— Include both archived and non-archived workflows -
--include-archived— Include both archived and non-archived workflowsDefault value:
false
torc workflows get
Get a specific workflow by ID
Usage: torc workflows get [OPTIONS] [ID]
Arguments:
<ID>— ID of the workflow to get (optional - will prompt if not provided)<ID>— ID of the workflow to get (optional - will prompt if not provided)
Options:
-u,--user <USER>— User to filter by (defaults to USER environment variable)-u,--user <USER>— User to filter by (defaults to USER environment variable)
torc workflows update
Update an existing workflow
Usage: torc workflows update [OPTIONS] [ID]
Arguments:
<ID>— ID of the workflow to update (optional - will prompt if not provided)<ID>— ID of the workflow to update (optional - will prompt if not provided)
Options:
-n,--name <NAME>— Name of the workflow-d,--description <DESCRIPTION>— Description of the workflow--owner-user <OWNER_USER>— User that owns the workflow-n,--name <NAME>— Name of the workflow-d,--description <DESCRIPTION>— Description of the workflow--owner-user <OWNER_USER>— User that owns the workflow
torc workflows cancel
Cancel a workflow and all associated Slurm jobs
Usage: torc workflows cancel [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to cancel (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow to cancel (optional - will prompt if not provided)
torc workflows delete
Delete one or more workflows
Usage: torc workflows delete [OPTIONS] [IDS]...
Arguments:
<IDS>— IDs of workflows to remove (optional - will prompt if not provided)<IDS>— IDs of workflows to remove (optional - will prompt if not provided)
Options:
--no-prompts— Skip confirmation prompt--force— Force deletion even if workflow belongs to a different user--no-prompts— Skip confirmation prompt--force— Force deletion even if workflow belongs to a different user
torc workflows archive
Archive or unarchive one or more workflows
Usage: torc workflows archive <IS_ARCHIVED> [WORKFLOW_IDS]...
Arguments:
<IS_ARCHIVED>— Set to true to archive, false to unarchive<WORKFLOW_IDS>— IDs of workflows to archive/unarchive (if empty, will prompt for selection)<IS_ARCHIVED>— Set to true to archive, false to unarchive<WORKFLOW_IDS>— IDs of workflows to archive/unarchive (if empty, will prompt for selection)
torc workflows submit
Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodes Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodes
Usage: torc workflows submit [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to submit (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow to submit (optional - will prompt if not provided)
Options:
-
--force— If false, fail the operation if missing data is present (defaults to false) -
--force— If false, fail the operation if missing data is present (defaults to false)Default value:
false
torc workflows run
Run a workflow locally on the current node
Usage: torc workflows run [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to run (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow to run (optional - will prompt if not provided)
Options:
-
-p,--poll-interval <POLL_INTERVAL>— Poll interval in seconds for checking job completion -
-p,--poll-interval <POLL_INTERVAL>— Poll interval in seconds for checking job completionDefault value:
5.0 -
--max-parallel-jobs <MAX_PARALLEL_JOBS>— Maximum number of parallel jobs to run (defaults to available CPUs) -
--output-dir <OUTPUT_DIR>— Output directory for job logs and results -
--max-parallel-jobs <MAX_PARALLEL_JOBS>— Maximum number of parallel jobs to run (defaults to available CPUs) -
--output-dir <OUTPUT_DIR>— Output directory for job logs and resultsDefault value:
output
torc workflows initialize
Initialize a workflow, including all job statuses
Usage: torc workflows initialize [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to start (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow to start (optional - will prompt if not provided)
Options:
-
--force— If false, fail the operation if missing data is present (defaults to false) -
--force— If false, fail the operation if missing data is present (defaults to false)Default value:
false -
--no-prompts— Skip confirmation prompt -
--dry-run— Perform a dry run without making changes -
--no-prompts— Skip confirmation prompt -
--dry-run— Perform a dry run without making changes
torc workflows reinitialize
Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changed Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changed
Usage: torc workflows reinitialize [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to reinitialize (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow to reinitialize (optional - will prompt if not provided)
Options:
-
--force— If false, fail the operation if missing data is present (defaults to false) -
--force— If false, fail the operation if missing data is present (defaults to false)Default value:
false -
--dry-run— Perform a dry run without making changes -
--dry-run— Perform a dry run without making changes
torc workflows status
Get workflow status
Usage: torc workflows status [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to get status for (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow to get status for (optional - will prompt if not provided)
Options:
-u,--user <USER>— User to filter by (defaults to USER environment variable)-u,--user <USER>— User to filter by (defaults to USER environment variable)
torc workflows reset-status
Reset workflow and job status
Usage: torc workflows reset-status [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to reset status for (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow to reset status for (optional - will prompt if not provided)
Options:
-
--failed-only— Only reset failed jobs -
--failed-only— Only reset failed jobsDefault value:
false -
-r,--reinitialize— Reinitialize the workflow after resetting status -
-r,--reinitialize— Reinitialize the workflow after resetting statusDefault value:
false -
--force— Force reset even if there are active jobs (ignores running/pending jobs check) -
--force— Force reset even if there are active jobs (ignores running/pending jobs check)Default value:
false -
--no-prompts— Skip confirmation prompt -
--no-prompts— Skip confirmation prompt
torc workflows execution-plan
Show the execution plan for a workflow specification or existing workflow
Usage: torc workflows execution-plan <SPEC_OR_ID>
Arguments:
<SPEC_OR_ID>— Path to specification file OR workflow ID<SPEC_OR_ID>— Path to specification file OR workflow ID
torc workflows list-actions
List workflow actions and their statuses (useful for debugging action triggers)
Usage: torc workflows list-actions [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to show actions for (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow to show actions for (optional - will prompt if not provided)
Options:
-u,--user <USER>— User to filter by when selecting workflow interactively (defaults to USER environment variable)-u,--user <USER>— User to filter by when selecting workflow interactively (defaults to USER environment variable)
torc workflows is-complete
Check if a workflow is complete
Usage: torc workflows is-complete [ID]
Arguments:
<ID>— ID of the workflow to check (optional - will prompt if not provided)<ID>— ID of the workflow to check (optional - will prompt if not provided)
torc workflows export
Export a workflow to a portable JSON file
Creates a self-contained export that can be imported into the same or different torc-server instance. All entity IDs are preserved in the export and remapped during import.
Usage: torc workflows export [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to export (optional - will prompt if not provided)
Options:
-o,--output <OUTPUT>— Output file path (default: stdout)--include-results— Include job results in export--include-events— Include events (workflow history) in export
Examples:
# Export workflow to stdout
torc workflows export 123
# Export to a file
torc workflows export 123 -o workflow.json
# Include job results in export
torc workflows export 123 --include-results -o backup.json
# Export with all optional data
torc workflows export 123 --include-results --include-events -o complete.json
torc workflows import
Import a workflow from an exported JSON file
Imports a workflow that was previously exported. All entity IDs are remapped to new IDs assigned by the server. By default, all job statuses are reset to uninitialized for a fresh start.
Usage: torc workflows import [OPTIONS] <FILE>
Arguments:
<FILE>— Path to the exported workflow JSON file (use '-' for stdin)
Options:
--name <NAME>— Override the workflow name--skip-results— Skip importing results even if present in export--skip-events— Skip importing events even if present in export
Examples:
# Import a workflow (resets job statuses by default)
torc workflows import workflow.json
# Import from stdin
cat workflow.json | torc workflows import -
# Import with a different name
torc workflows import workflow.json --name 'my-copy'
# Skip importing results even if present in file
torc workflows import workflow.json --skip-results
torc workflows sync-status
Synchronize job statuses with Slurm (detect and fail orphaned jobs)
This command detects jobs that are stuck in "running" status because their Slurm allocation terminated unexpectedly (e.g., due to timeout, node failure, or admin intervention). It marks these orphaned jobs as failed so the workflow can be recovered or restarted.
Use this when:
torc recoverreports "there are active Slurm allocations" butsqueueshows none- Jobs appear stuck in "running" status after a Slurm allocation ended
- You want to clean up workflow state before running
torc recover
Usage: torc workflows sync-status [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow to sync (optional - will prompt if not provided)
Options:
--dry-run— Preview changes without applying them
Examples:
# Preview what would be cleaned up
torc workflows sync-status 123 --dry-run
# Clean up orphaned jobs
torc workflows sync-status 123
# Get JSON output for scripting
torc -f json workflows sync-status 123
torc compute-nodes
Compute node management commands
Usage: torc compute-nodes <COMMAND>
Subcommands:
get— Get a specific compute node by IDlist— List compute nodes for a workflowget— Get a specific compute node by IDlist— List compute nodes for a workflow
torc compute-nodes get
Get a specific compute node by ID
Usage: torc compute-nodes get <ID>
Arguments:
<ID>— ID of the compute node<ID>— ID of the compute node
torc compute-nodes list
List compute nodes for a workflow
Usage: torc compute-nodes list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— List compute nodes for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— List compute nodes for this workflow (optional - will prompt if not provided)
Options:
-
-l,--limit <LIMIT>— Maximum number of compute nodes to return -
-l,--limit <LIMIT>— Maximum number of compute nodes to returnDefault value:
10000 -
-o,--offset <OFFSET>— Offset for pagination (0-based) -
-o,--offset <OFFSET>— Offset for pagination (0-based)Default value:
0 -
-s,--sort-by <SORT_BY>— Field to sort by -
-r,--reverse-sort— Reverse sort order -
-s,--sort-by <SORT_BY>— Field to sort by -
-r,--reverse-sort— Reverse sort orderDefault value:
false -
--scheduled-compute-node <SCHEDULED_COMPUTE_NODE>— Filter by scheduled compute node ID -
--scheduled-compute-node <SCHEDULED_COMPUTE_NODE>— Filter by scheduled compute node ID
torc files
File management commands
Usage: torc files <COMMAND>
Subcommands:
create— Create a new filelist— List filesget— Get a specific file by IDupdate— Update an existing filedelete— Delete a filelist-required-existing— List required existing files for a workflowcreate— Create a new filelist— List filesget— Get a specific file by IDupdate— Update an existing filedelete— Delete a filelist-required-existing— List required existing files for a workflow
torc files create
Create a new file
Usage: torc files create --name <NAME> --path <PATH> [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Create the file in this workflow<WORKFLOW_ID>— Create the file in this workflow
Options:
-n,--name <NAME>— Name of the job-p,--path <PATH>— Path of the file-n,--name <NAME>— Name of the job-p,--path <PATH>— Path of the file
torc files list
List files
Usage: torc files list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— List files for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— List files for this workflow (optional - will prompt if not provided)
Options:
-
--produced-by-job-id <PRODUCED_BY_JOB_ID>— Filter by job ID that produced the files -
-l,--limit <LIMIT>— Maximum number of files to return -
--produced-by-job-id <PRODUCED_BY_JOB_ID>— Filter by job ID that produced the files -
-l,--limit <LIMIT>— Maximum number of files to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0 -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order
torc files get
Get a specific file by ID
Usage: torc files get <ID>
Arguments:
<ID>— ID of the file to get<ID>— ID of the file to get
torc files update
Update an existing file
Usage: torc files update [OPTIONS] <ID>
Arguments:
<ID>— ID of the file to update<ID>— ID of the file to update
Options:
-n,--name <NAME>— Name of the file-p,--path <PATH>— Path of the file-n,--name <NAME>— Name of the file-p,--path <PATH>— Path of the file
torc files delete
Delete a file
Usage: torc files delete <ID>
Arguments:
<ID>— ID of the file to remove<ID>— ID of the file to remove
torc files list-required-existing
List required existing files for a workflow
Usage: torc files list-required-existing [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— List required existing files for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— List required existing files for this workflow (optional - will prompt if not provided)
torc jobs
Job management commands
Usage: torc jobs <COMMAND>
Subcommands:
create— Create a new jobcreate-from-file— Create multiple jobs from a text file containing one command per linelist— List jobsget— Get a specific job by IDupdate— Update an existing jobdelete— Delete one or more jobsdelete-all— Delete all jobs for a workflowlist-resource-requirements— List jobs with their resource requirementscreate— Create a new jobcreate-from-file— Create multiple jobs from a text file containing one command per linelist— List jobsget— Get a specific job by IDupdate— Update an existing jobdelete— Delete one or more jobsdelete-all— Delete all jobs for a workflowlist-resource-requirements— List jobs with their resource requirements
torc jobs create
Create a new job
Usage: torc jobs create [OPTIONS] --name <NAME> --command <COMMAND> [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Create the job in this workflow<WORKFLOW_ID>— Create the job in this workflow
Options:
-n,--name <NAME>— Name of the job-c,--command <COMMAND>— Command to execute-r,--resource-requirements-id <RESOURCE_REQUIREMENTS_ID>— Resource requirements ID for this job-b,--blocking-job-ids <BLOCKING_JOB_IDS>— Job IDs that block this job-i,--input-file-ids <INPUT_FILE_IDS>— Input files needed by this job-o,--output-file-ids <OUTPUT_FILE_IDS>— Output files produced by this job-n,--name <NAME>— Name of the job-c,--command <COMMAND>— Command to execute-r,--resource-requirements-id <RESOURCE_REQUIREMENTS_ID>— Resource requirements ID for this job-b,--blocking-job-ids <BLOCKING_JOB_IDS>— Job IDs that block this job-i,--input-file-ids <INPUT_FILE_IDS>— Input files needed by this job-o,--output-file-ids <OUTPUT_FILE_IDS>— Output files produced by this job
torc jobs create-from-file
Create multiple jobs from a text file containing one command per line
This command reads a text file where each line contains a job command. Lines starting with '#' are treated as comments and ignored. Empty lines are also ignored. This command reads a text file where each line contains a job command. Lines starting with '#' are treated as comments and ignored. Empty lines are also ignored.
Jobs will be named sequentially as job1, job2, job3, etc., starting from the current job count + 1 to avoid naming conflicts. Jobs will be named sequentially as job1, job2, job3, etc., starting from the current job count + 1 to avoid naming conflicts.
All jobs created will share the same resource requirements, which are automatically created and assigned. All jobs created will share the same resource requirements, which are automatically created and assigned.
Example: torc jobs create-from-file 123 batch_jobs.txt --cpus-per-job 4 --memory-per-job 8g
Usage: torc jobs create-from-file [OPTIONS] <WORKFLOW_ID> <FILE>
Arguments:
-
<WORKFLOW_ID>— Workflow ID to create jobs for -
<FILE>— Path to text file containing job commands (one per line) -
<WORKFLOW_ID>— Workflow ID to create jobs for -
<FILE>— Path to text file containing job commands (one per line)File format: - One command per line - Lines starting with # are comments (ignored) - Empty lines are ignored File format: - One command per line - Lines starting with # are comments (ignored) - Empty lines are ignored
Example file content: # Data processing jobs python process.py --batch 1 python process.py --batch 2 python process.py --batch 3 Example file content: # Data processing jobs python process.py --batch 1 python process.py --batch 2 python process.py --batch 3
Options:
-
--cpus-per-job <CPUS_PER_JOB>— Number of CPUs per job -
--cpus-per-job <CPUS_PER_JOB>— Number of CPUs per jobDefault value:
1 -
--memory-per-job <MEMORY_PER_JOB>— Memory per job (e.g., "1m", "2g", "16g") -
--memory-per-job <MEMORY_PER_JOB>— Memory per job (e.g., "1m", "2g", "16g")Default value:
1m -
--runtime-per-job <RUNTIME_PER_JOB>— Runtime per job (ISO 8601 duration format) -
--runtime-per-job <RUNTIME_PER_JOB>— Runtime per job (ISO 8601 duration format)Examples: P0DT1M = 1 minute P0DT30M = 30 minutes P0DT2H = 2 hours P1DT0H = 1 day Examples: P0DT1M = 1 minute P0DT30M = 30 minutes P0DT2H = 2 hours P1DT0H = 1 day
Default value:
P0DT1M
torc jobs list
List jobs
Usage: torc jobs list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— List jobs for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— List jobs for this workflow (optional - will prompt if not provided)
Options:
-
-s,--status <STATUS>— User to filter by (defaults to USER environment variable) -
--upstream-job-id <UPSTREAM_JOB_ID>— Filter by upstream job ID (jobs that depend on this job) -
-l,--limit <LIMIT>— Maximum number of jobs to return -
-s,--status <STATUS>— User to filter by (defaults to USER environment variable) -
--upstream-job-id <UPSTREAM_JOB_ID>— Filter by upstream job ID (jobs that depend on this job) -
-l,--limit <LIMIT>— Maximum number of jobs to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0 -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--include-relationships— Include job relationships (depends_on_job_ids, input/output file/user_data IDs) - slower but more complete -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--include-relationships— Include job relationships (depends_on_job_ids, input/output file/user_data IDs) - slower but more complete
torc jobs get
Get a specific job by ID
Usage: torc jobs get <ID>
Arguments:
<ID>— ID of the job to get<ID>— ID of the job to get
torc jobs update
Update an existing job
Usage: torc jobs update [OPTIONS] <ID>
Arguments:
<ID>— ID of the job to update<ID>— ID of the job to update
Options:
-n,--name <NAME>— Name of the job-c,--command <COMMAND>— Command to execute-n,--name <NAME>— Name of the job-c,--command <COMMAND>— Command to execute
torc jobs delete
Delete one or more jobs
Usage: torc jobs delete [IDS]...
Arguments:
<IDS>— IDs of the jobs to remove<IDS>— IDs of the jobs to remove
torc jobs delete-all
Delete all jobs for a workflow
Usage: torc jobs delete-all [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID to delete all jobs from (optional - will prompt if not provided)<WORKFLOW_ID>— Workflow ID to delete all jobs from (optional - will prompt if not provided)
torc jobs list-resource-requirements
List jobs with their resource requirements
Usage: torc jobs list-resource-requirements [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID to list jobs from (optional - will prompt if not provided)<WORKFLOW_ID>— Workflow ID to list jobs from (optional - will prompt if not provided)
Options:
-j,--job-id <JOB_ID>— Filter by specific job ID-j,--job-id <JOB_ID>— Filter by specific job ID
torc job-dependencies
Job dependency and relationship queries
Usage: torc job-dependencies <COMMAND>
Subcommands:
job-job— List job-to-job dependencies for a workflowjob-file— List job-file relationships for a workflowjob-user-data— List job-user_data relationships for a workflowjob-job— List job-to-job dependencies for a workflowjob-file— List job-file relationships for a workflowjob-user-data— List job-user_data relationships for a workflow
torc job-dependencies job-job
List job-to-job dependencies for a workflow
Usage: torc job-dependencies job-job [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow (optional - will prompt if not provided)
Options:
-
-l,--limit <LIMIT>— Maximum number of dependencies to return -
-l,--limit <LIMIT>— Maximum number of dependencies to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0
torc job-dependencies job-file
List job-file relationships for a workflow
Usage: torc job-dependencies job-file [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow (optional - will prompt if not provided)
Options:
-
-l,--limit <LIMIT>— Maximum number of relationships to return -
-l,--limit <LIMIT>— Maximum number of relationships to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0
torc job-dependencies job-user-data
List job-user_data relationships for a workflow
Usage: torc job-dependencies job-user-data [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— ID of the workflow (optional - will prompt if not provided)<WORKFLOW_ID>— ID of the workflow (optional - will prompt if not provided)
Options:
-
-l,--limit <LIMIT>— Maximum number of relationships to return -
-l,--limit <LIMIT>— Maximum number of relationships to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0
torc resource-requirements
Resource requirements management commands
Usage: torc resource-requirements <COMMAND>
Subcommands:
create— Create new resource requirementslist— List resource requirementsget— Get a specific resource requirement by IDupdate— Update existing resource requirementsdelete— Delete resource requirementscreate— Create new resource requirementslist— List resource requirementsget— Get a specific resource requirement by IDupdate— Update existing resource requirementsdelete— Delete resource requirements
torc resource-requirements create
Create new resource requirements
Usage: torc resource-requirements create [OPTIONS] --name <NAME> [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Create resource requirements in this workflow<WORKFLOW_ID>— Create resource requirements in this workflow
Options:
-
-n,--name <NAME>— Name of the resource requirements -
--num-cpus <NUM_CPUS>— Number of CPUs required -
-n,--name <NAME>— Name of the resource requirements -
--num-cpus <NUM_CPUS>— Number of CPUs requiredDefault value:
1 -
--num-gpus <NUM_GPUS>— Number of GPUs required -
--num-gpus <NUM_GPUS>— Number of GPUs requiredDefault value:
0 -
--num-nodes <NUM_NODES>— Number of nodes required -
--num-nodes <NUM_NODES>— Number of nodes requiredDefault value:
1 -
-m,--memory <MEMORY>— Amount of memory required (e.g., "20g") -
-m,--memory <MEMORY>— Amount of memory required (e.g., "20g")Default value:
1m -
-r,--runtime <RUNTIME>— Maximum runtime in ISO 8601 duration format (e.g., "P0DT1H") -
-r,--runtime <RUNTIME>— Maximum runtime in ISO 8601 duration format (e.g., "P0DT1H")Default value:
P0DT1M
torc resource-requirements list
List resource requirements
Usage: torc resource-requirements list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— List resource requirements for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— List resource requirements for this workflow (optional - will prompt if not provided)
Options:
-
-l,--limit <LIMIT>— Maximum number of resource requirements to return -
-l,--limit <LIMIT>— Maximum number of resource requirements to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0 -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order
torc resource-requirements get
Get a specific resource requirement by ID
Usage: torc resource-requirements get <ID>
Arguments:
<ID>— ID of the resource requirement to get<ID>— ID of the resource requirement to get
torc resource-requirements update
Update existing resource requirements
Usage: torc resource-requirements update [OPTIONS] <ID>
Arguments:
<ID>— ID of the resource requirement to update<ID>— ID of the resource requirement to update
Options:
-n,--name <NAME>— Name of the resource requirements--num-cpus <NUM_CPUS>— Number of CPUs required--num-gpus <NUM_GPUS>— Number of GPUs required--num-nodes <NUM_NODES>— Number of nodes required--memory <MEMORY>— Amount of memory required (e.g., "20g")--runtime <RUNTIME>— Maximum runtime (e.g., "1h", "30m")-n,--name <NAME>— Name of the resource requirements--num-cpus <NUM_CPUS>— Number of CPUs required--num-gpus <NUM_GPUS>— Number of GPUs required--num-nodes <NUM_NODES>— Number of nodes required--memory <MEMORY>— Amount of memory required (e.g., "20g")--runtime <RUNTIME>— Maximum runtime (e.g., "1h", "30m")
torc resource-requirements delete
Delete resource requirements
Usage: torc resource-requirements delete <ID>
Arguments:
<ID>— ID of the resource requirement to remove<ID>— ID of the resource requirement to remove
torc events
Event management commands
Usage: torc events <COMMAND>
Subcommands:
create— Create a new eventlist— List events for a workflowmonitor— Monitor events for a workflow in real-timeget-latest-event— Get the latest event for a workflowdelete— Delete an eventcreate— Create a new eventlist— List events for a workflowmonitor— Monitor events for a workflow in real-timeget-latest-event— Get the latest event for a workflowdelete— Delete an event
torc events create
Create a new event
Usage: torc events create --data <DATA> [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Create the event in this workflow<WORKFLOW_ID>— Create the event in this workflow
Options:
-d,--data <DATA>— JSON data for the event-d,--data <DATA>— JSON data for the event
torc events list
List events for a workflow
Usage: torc events list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— List events for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— List events for this workflow (optional - will prompt if not provided)
Options:
-
-c,--category <CATEGORY>— Filter events by category -
-l,--limit <LIMIT>— Maximum number of events to return -
-c,--category <CATEGORY>— Filter events by category -
-l,--limit <LIMIT>— Maximum number of events to returnDefault value:
10000 -
-o,--offset <OFFSET>— Offset for pagination (0-based) -
-o,--offset <OFFSET>— Offset for pagination (0-based)Default value:
0 -
-s,--sort-by <SORT_BY>— Field to sort by -
-r,--reverse-sort— Reverse sort order -
-s,--sort-by <SORT_BY>— Field to sort by -
-r,--reverse-sort— Reverse sort orderDefault value:
false
torc events monitor
Monitor events for a workflow in real-time
Usage: torc events monitor [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Monitor events for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— Monitor events for this workflow (optional - will prompt if not provided)
Options:
-
-d,--duration <DURATION>— Duration to monitor in minutes (default: infinite) -
-p,--poll-interval <POLL_INTERVAL>— Poll interval in seconds (default: 60) -
-d,--duration <DURATION>— Duration to monitor in minutes (default: infinite) -
-p,--poll-interval <POLL_INTERVAL>— Poll interval in seconds (default: 60)Default value:
60 -
-c,--category <CATEGORY>— Filter events by category -
-c,--category <CATEGORY>— Filter events by category
torc events get-latest-event
Get the latest event for a workflow
Usage: torc events get-latest-event [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Get the latest event for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— Get the latest event for this workflow (optional - will prompt if not provided)
torc events delete
Delete an event
Usage: torc events delete <ID>
Arguments:
<ID>— ID of the event to remove<ID>— ID of the event to remove
torc results
Result management commands
Usage: torc results <COMMAND>
Subcommands:
list— List resultsget— Get a specific result by IDdelete— Delete a resultlist— List resultsget— Get a specific result by IDdelete— Delete a result
torc results list
List results
Usage: torc results list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— List results for this workflow (optional - will prompt if not provided). By default, only lists results for the latest run of the workflow<WORKFLOW_ID>— List results for this workflow (optional - will prompt if not provided). By default, only lists results for the latest run of the workflow
Options:
-
-j,--job-id <JOB_ID>— List results for this job -
-r,--run-id <RUN_ID>— List results for this run_id -
--return-code <RETURN_CODE>— Filter by return code -
--failed— Show only failed jobs (non-zero return code) -
-s,--status <STATUS>— Filter by job status (uninitialized, blocked, canceled, terminated, done, ready, scheduled, running, pending, disabled) -
-l,--limit <LIMIT>— Maximum number of results to return -
-j,--job-id <JOB_ID>— List results for this job -
-r,--run-id <RUN_ID>— List results for this run_id -
--return-code <RETURN_CODE>— Filter by return code -
--failed— Show only failed jobs (non-zero return code) -
-s,--status <STATUS>— Filter by job status (uninitialized, blocked, canceled, terminated, done, ready, scheduled, running, pending, disabled) -
-l,--limit <LIMIT>— Maximum number of results to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0 -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--all-runs— Show all historical results (default: false, only shows current results) -
--compute-node <COMPUTE_NODE>— Filter by compute node ID -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--all-runs— Show all historical results (default: false, only shows current results) -
--compute-node <COMPUTE_NODE>— Filter by compute node ID
torc results get
Get a specific result by ID
Usage: torc results get <ID>
Arguments:
<ID>— ID of the result to get<ID>— ID of the result to get
torc results delete
Delete a result
Usage: torc results delete <ID>
Arguments:
<ID>— ID of the result to remove<ID>— ID of the result to remove
torc user-data
User data management commands
Usage: torc user-data <COMMAND>
Subcommands:
create— Create a new user data recordlist— List user data recordsget— Get a specific user data recordupdate— Update a user data recorddelete— Delete a user data recorddelete-all— Delete all user data records for a workflowlist-missing— List missing user data for a workflowcreate— Create a new user data recordlist— List user data recordsget— Get a specific user data recordupdate— Update a user data recorddelete— Delete a user data recorddelete-all— Delete all user data records for a workflowlist-missing— List missing user data for a workflow
torc user-data create
Create a new user data record
Usage: torc user-data create [OPTIONS] --name <NAME> [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
Options:
-n,--name <NAME>— Name of the data object-d,--data <DATA>— JSON data content--ephemeral— Whether the data is ephemeral (cleared between runs)--consumer-job-id <CONSUMER_JOB_ID>— Consumer job ID (optional)--producer-job-id <PRODUCER_JOB_ID>— Producer job ID (optional)-n,--name <NAME>— Name of the data object-d,--data <DATA>— JSON data content--ephemeral— Whether the data is ephemeral (cleared between runs)--consumer-job-id <CONSUMER_JOB_ID>— Consumer job ID (optional)--producer-job-id <PRODUCER_JOB_ID>— Producer job ID (optional)
torc user-data list
List user data records
Usage: torc user-data list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID (if not provided, will be selected interactively)<WORKFLOW_ID>— Workflow ID (if not provided, will be selected interactively)
Options:
-
-l,--limit <LIMIT>— Maximum number of records to return -
-l,--limit <LIMIT>— Maximum number of records to returnDefault value:
50 -
-o,--offset <OFFSET>— Number of records to skip -
-o,--offset <OFFSET>— Number of records to skipDefault value:
0 -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--name <NAME>— Filter by name -
--is-ephemeral <IS_EPHEMERAL>— Filter by ephemeral status -
--sort-by <SORT_BY>— Field to sort by -
--reverse-sort— Reverse sort order -
--name <NAME>— Filter by name -
--is-ephemeral <IS_EPHEMERAL>— Filter by ephemeral statusPossible values:
true,false -
--consumer-job-id <CONSUMER_JOB_ID>— Filter by consumer job ID -
--producer-job-id <PRODUCER_JOB_ID>— Filter by producer job ID -
--consumer-job-id <CONSUMER_JOB_ID>— Filter by consumer job ID -
--producer-job-id <PRODUCER_JOB_ID>— Filter by producer job ID
torc user-data get
Get a specific user data record
Usage: torc user-data get <ID>
Arguments:
<ID>— User data record ID<ID>— User data record ID
torc user-data update
Update a user data record
Usage: torc user-data update [OPTIONS] <ID>
Arguments:
<ID>— User data record ID<ID>— User data record ID
Options:
-
-n,--name <NAME>— New name for the data object -
-d,--data <DATA>— New JSON data content -
--ephemeral <EPHEMERAL>— Update ephemeral status -
-n,--name <NAME>— New name for the data object -
-d,--data <DATA>— New JSON data content -
--ephemeral <EPHEMERAL>— Update ephemeral statusPossible values:
true,false
torc user-data delete
Delete a user data record
Usage: torc user-data delete <ID>
Arguments:
<ID>— User data record ID<ID>— User data record ID
torc user-data delete-all
Delete all user data records for a workflow
Usage: torc user-data delete-all <WORKFLOW_ID>
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
torc user-data list-missing
List missing user data for a workflow
Usage: torc user-data list-missing <WORKFLOW_ID>
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
torc slurm
Slurm scheduler commands
Usage: torc slurm <COMMAND>
Subcommands:
create— Add a Slurm config to the databaseupdate— Modify a Slurm config in the databaselist— Show the current Slurm configs in the databaseget— Get a specific Slurm config by IDdelete— Delete a Slurm config by IDschedule-nodes— Schedule compute nodes using Slurmparse-logs— Parse Slurm log files for known error messagessacct— Call sacct for scheduled compute nodes and display summarygenerate— Generate Slurm schedulers for a workflow based on job resource requirementsregenerate— Regenerate Slurm schedulers for an existing workflow based on pending jobscreate— Add a Slurm config to the databaseupdate— Modify a Slurm config in the databaselist— Show the current Slurm configs in the databaseget— Get a specific Slurm config by IDdelete— Delete a Slurm config by IDschedule-nodes— Schedule compute nodes using Slurmparse-logs— Parse Slurm log files for known error messagessacct— Call sacct for scheduled compute nodes and display summarygenerate— Generate Slurm schedulers for a workflow based on job resource requirementsregenerate— Regenerate Slurm schedulers for an existing workflow based on pending jobs
torc slurm create
Add a Slurm config to the database
Usage: torc slurm create [OPTIONS] --name <NAME> --account <ACCOUNT> [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
Options:
-
-n,--name <NAME>— Name of config -
-a,--account <ACCOUNT>— HPC account -
-g,--gres <GRES>— Request nodes that have at least this number of GPUs. Ex: 'gpu:2' -
-m,--mem <MEM>— Request nodes that have at least this amount of memory. Ex: '180G' -
-N,--nodes <NODES>— Number of nodes to use for each job -
-n,--name <NAME>— Name of config -
-a,--account <ACCOUNT>— HPC account -
-g,--gres <GRES>— Request nodes that have at least this number of GPUs. Ex: 'gpu:2' -
-m,--mem <MEM>— Request nodes that have at least this amount of memory. Ex: '180G' -
-N,--nodes <NODES>— Number of nodes to use for each jobDefault value:
1 -
-p,--partition <PARTITION>— HPC partition. Default is determined by the scheduler -
-q,--qos <QOS>— Controls priority of the jobs -
-p,--partition <PARTITION>— HPC partition. Default is determined by the scheduler -
-q,--qos <QOS>— Controls priority of the jobsDefault value:
normal -
-t,--tmp <TMP>— Request nodes that have at least this amount of storage scratch space -
-W,--walltime <WALLTIME>— Slurm job walltime -
-t,--tmp <TMP>— Request nodes that have at least this amount of storage scratch space -
-W,--walltime <WALLTIME>— Slurm job walltimeDefault value:
04:00:00 -
-e,--extra <EXTRA>— Add extra Slurm parameters, for example --extra='--reservation=my-reservation' -
-e,--extra <EXTRA>— Add extra Slurm parameters, for example --extra='--reservation=my-reservation'
torc slurm update
Modify a Slurm config in the database
Usage: torc slurm update [OPTIONS] <SCHEDULER_ID>
Arguments:
<SCHEDULER_ID><SCHEDULER_ID>
Options:
-N,--name <NAME>— Name of config-a,--account <ACCOUNT>— HPC account-g,--gres <GRES>— Request nodes that have at least this number of GPUs. Ex: 'gpu:2'-m,--mem <MEM>— Request nodes that have at least this amount of memory. Ex: '180G'-n,--nodes <NODES>— Number of nodes to use for each job-p,--partition <PARTITION>— HPC partition-q,--qos <QOS>— Controls priority of the jobs-t,--tmp <TMP>— Request nodes that have at least this amount of storage scratch space--walltime <WALLTIME>— Slurm job walltime-e,--extra <EXTRA>— Add extra Slurm parameters-N,--name <NAME>— Name of config-a,--account <ACCOUNT>— HPC account-g,--gres <GRES>— Request nodes that have at least this number of GPUs. Ex: 'gpu:2'-m,--mem <MEM>— Request nodes that have at least this amount of memory. Ex: '180G'-n,--nodes <NODES>— Number of nodes to use for each job-p,--partition <PARTITION>— HPC partition-q,--qos <QOS>— Controls priority of the jobs-t,--tmp <TMP>— Request nodes that have at least this amount of storage scratch space--walltime <WALLTIME>— Slurm job walltime-e,--extra <EXTRA>— Add extra Slurm parameters
torc slurm list
Show the current Slurm configs in the database
Usage: torc slurm list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
Options:
-
-l,--limit <LIMIT>— Maximum number of configs to return -
-l,--limit <LIMIT>— Maximum number of configs to returnDefault value:
10000 -
--offset <OFFSET>— Offset for pagination (0-based) -
--offset <OFFSET>— Offset for pagination (0-based)Default value:
0
torc slurm get
Get a specific Slurm config by ID
Usage: torc slurm get <ID>
Arguments:
<ID>— ID of the Slurm config to get<ID>— ID of the Slurm config to get
torc slurm delete
Delete a Slurm config by ID
Usage: torc slurm delete <ID>
Arguments:
<ID>— ID of the Slurm config to delete<ID>— ID of the Slurm config to delete
torc slurm schedule-nodes
Schedule compute nodes using Slurm
Usage: torc slurm schedule-nodes [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
Options:
-
-j,--job-prefix <JOB_PREFIX>— Job prefix for the Slurm job names -
-j,--job-prefix <JOB_PREFIX>— Job prefix for the Slurm job namesDefault value:
worker -
--keep-submission-scripts— Keep submission scripts after job submission -
--keep-submission-scripts— Keep submission scripts after job submissionDefault value:
false -
-m,--max-parallel-jobs <MAX_PARALLEL_JOBS>— Maximum number of parallel jobs -
-n,--num-hpc-jobs <NUM_HPC_JOBS>— Number of HPC jobs to submit -
-m,--max-parallel-jobs <MAX_PARALLEL_JOBS>— Maximum number of parallel jobs -
-n,--num-hpc-jobs <NUM_HPC_JOBS>— Number of HPC jobs to submitDefault value:
1 -
-o,--output <OUTPUT>— Output directory for job output files -
-o,--output <OUTPUT>— Output directory for job output filesDefault value:
output -
-p,--poll-interval <POLL_INTERVAL>— Poll interval in seconds -
-p,--poll-interval <POLL_INTERVAL>— Poll interval in secondsDefault value:
60 -
--scheduler-config-id <SCHEDULER_CONFIG_ID>— Scheduler config ID -
--start-one-worker-per-node— Start one worker per node -
--scheduler-config-id <SCHEDULER_CONFIG_ID>— Scheduler config ID -
--start-one-worker-per-node— Start one worker per nodeDefault value:
false
torc slurm parse-logs
Parse Slurm log files for known error messages
Usage: torc slurm parse-logs [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
Options:
-
-o,--output-dir <OUTPUT_DIR>— Output directory containing Slurm log files -
-o,--output-dir <OUTPUT_DIR>— Output directory containing Slurm log filesDefault value:
output -
--errors-only— Only show errors (skip warnings) -
--errors-only— Only show errors (skip warnings)Default value:
false
torc slurm sacct
Call sacct for scheduled compute nodes and display summary
Usage: torc slurm sacct [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
Options:
-
-o,--output-dir <OUTPUT_DIR>— Output directory for sacct JSON files (only used with --save-json) -
-o,--output-dir <OUTPUT_DIR>— Output directory for sacct JSON files (only used with --save-json)Default value:
output -
--save-json— Save full JSON output to files in addition to displaying summary -
--save-json— Save full JSON output to files in addition to displaying summaryDefault value:
false
torc slurm generate
Generate Slurm schedulers for a workflow based on job resource requirements
Usage: torc slurm generate [OPTIONS] --account <ACCOUNT> <WORKFLOW_FILE>
Arguments:
<WORKFLOW_FILE>— Path to workflow specification file (YAML, JSON, JSON5, or KDL)<WORKFLOW_FILE>— Path to workflow specification file (YAML, JSON, JSON5, or KDL)
Options:
-
--account <ACCOUNT>— Slurm account to use -
--profile <PROFILE>— HPC profile to use (if not specified, tries to detect current system) -
-o,--output <OUTPUT>— Output file path (if not specified, prints to stdout) -
--single-allocation— Bundle all nodes into a single Slurm allocation per scheduler -
--account <ACCOUNT>— Slurm account to use -
--profile <PROFILE>— HPC profile to use (if not specified, tries to detect current system) -
-o,--output <OUTPUT>— Output file path (if not specified, prints to stdout) -
--single-allocation— Bundle all nodes into a single Slurm allocation per schedulerBy default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance. By default, creates one Slurm allocation per node (N×1 mode), which allows jobs to start as nodes become available and provides better fault tolerance.
With this flag, creates one large allocation with all nodes (1×N mode), which requires all nodes to be available simultaneously but uses a single sbatch.
-
--group-by <GROUP_BY>— Strategy for grouping jobs into schedulersPossible values:
resource-requirements(default),partition -
--walltime-strategy <STRATEGY>— Strategy for determining Slurm job walltimePossible values:
max-job-runtime(default),max-partition-timemax-job-runtime: Uses the maximum job runtime multiplied by--walltime-multiplier. This typically results in shorter walltime requests, improving queue priority.max-partition-time: Uses the partition's maximum allowed walltime. More conservative but may negatively impact queue scheduling.
-
--walltime-multiplier <MULTIPLIER>— Multiplier for job runtime when using--walltime-strategy=max-job-runtimeDefault value:
1.5The maximum job runtime is multiplied by this value to provide a safety margin. For example, 1.5 means requesting 50% more time than the longest job estimate.
-
--no-actions— Don't add workflow actions for scheduling nodes -
--overwrite— Overwrite existing schedulers in the workflow -
--dry-run— Show what would be generated without writing to output
torc slurm regenerate
Regenerate Slurm schedulers for an existing workflow based on pending jobs
Analyzes jobs that are uninitialized, ready, or blocked and generates new Slurm schedulers to run them. Uses existing scheduler configurations as defaults for account, partition, and other settings. Analyzes jobs that are uninitialized, ready, or blocked and generates new Slurm schedulers to run them. Uses existing scheduler configurations as defaults for account, partition, and other settings.
This is useful for recovery after job failures: update job resources, reset failed jobs, then regenerate schedulers to submit new allocations. This is useful for recovery after job failures: update job resources, reset failed jobs, then regenerate schedulers to submit new allocations.
Usage: torc slurm regenerate [OPTIONS] <WORKFLOW_ID>
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKFLOW_ID>— Workflow ID
Options:
-
--account <ACCOUNT>— Slurm account to use (defaults to account from existing schedulers) -
--profile <PROFILE>— HPC profile to use (if not specified, tries to detect current system) -
--single-allocation— Bundle all nodes into a single Slurm allocation per scheduler -
--submit— Submit the generated allocations immediately -
-o,--output-dir <OUTPUT_DIR>— Output directory for job output files (used when submitting) -
--account <ACCOUNT>— Slurm account to use (defaults to account from existing schedulers) -
--profile <PROFILE>— HPC profile to use (if not specified, tries to detect current system) -
--single-allocation— Bundle all nodes into a single Slurm allocation per scheduler -
--submit— Submit the generated allocations immediately -
-o,--output-dir <OUTPUT_DIR>— Output directory for job output files (used when submitting)Default value:
output -
-p,--poll-interval <POLL_INTERVAL>— Poll interval in seconds (used when submitting)Default value:
60 -
--group-by <GROUP_BY>— Strategy for grouping jobs into schedulersPossible values:
resource-requirements(default),partition -
--walltime-strategy <STRATEGY>— Strategy for determining Slurm job walltimePossible values:
max-job-runtime(default),max-partition-time -
--walltime-multiplier <MULTIPLIER>— Multiplier for job runtime when using--walltime-strategy=max-job-runtimeDefault value:
1.5 -
--dry-run— Show what would be created without making changes -
--include-job-ids <JOB_IDS>— Include specific job IDs in planning regardless of their status (useful for recovery dry-run to include failed jobs)
torc remote
Remote worker execution commands (SSH-based distributed execution)
Usage: torc remote <COMMAND>
Subcommands:
add-workers— Add one or more remote workers to a workflowadd-workers-from-file— Add remote workers to a workflow from a fileremove-worker— Remove a remote worker from a workflowlist-workers— List remote workers stored in the database for a workflowrun— Run workers on remote machines via SSHstatus— Check status of remote workersstop— Stop workers on remote machinescollect-logs— Collect logs from remote workersdelete-logs— Delete logs from remote workers
torc remote add-workers
Add one or more remote workers to a workflow
Workers are stored in the database and used by subsequent commands. Format: [user@]hostname[:port]
Usage: torc remote add-workers <WORKFLOW_ID> <WORKERS>...
Arguments:
<WORKFLOW_ID>— Workflow ID<WORKERS>— Worker addresses (format: [user@]hostname[:port])
torc remote add-workers-from-file
Add remote workers to a workflow from a file
Each line in the file should be a worker address. Lines starting with # are comments.
Usage: torc remote add-workers-from-file <WORKER_FILE> [WORKFLOW_ID]
Arguments:
<WORKER_FILE>— Path to worker file listing remote machines<WORKFLOW_ID>— Workflow ID (optional - will prompt if not provided)
torc remote remove-worker
Remove a remote worker from a workflow
Usage: torc remote remove-worker <WORKER> [WORKFLOW_ID]
Arguments:
<WORKER>— Worker address to remove<WORKFLOW_ID>— Workflow ID (optional - will prompt if not provided)
torc remote list-workers
List remote workers stored in the database for a workflow
Usage: torc remote list-workers [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID (optional - will prompt if not provided)
torc remote run
Run workers on remote machines via SSH
SSH into each stored worker and start a torc worker process. Workers run detached (via nohup) and survive SSH disconnection. Use add-workers first, or provide --workers to add and run in one step.
Usage: torc remote run [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID to run (optional - will prompt if not provided)
Options:
-
-w,--workers <WORKERS>— Path to worker file (optional - adds workers before running) -
-o,--output-dir <OUTPUT_DIR>— Output directory on remote machines (relative to home)Default value:
torc_output -
--max-parallel-ssh <MAX_PARALLEL_SSH>— Maximum parallel SSH connectionsDefault value:
10 -
-p,--poll-interval <POLL_INTERVAL>— Poll interval in seconds for workersDefault value:
5.0 -
--max-parallel-jobs <MAX_PARALLEL_JOBS>— Maximum number of parallel jobs per worker -
--num-cpus <NUM_CPUS>— Number of CPUs per worker (auto-detect if not specified) -
--memory-gb <MEMORY_GB>— Memory in GB per worker (auto-detect if not specified) -
--num-gpus <NUM_GPUS>— Number of GPUs per worker (auto-detect if not specified) -
--skip-version-check— Skip version check (not recommended)Default value:
false
torc remote status
Check status of remote workers
Usage: torc remote status [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID (optional - will prompt if not provided)
Options:
-
--output-dir <OUTPUT_DIR>— Remote output directory (must match what was used in run)Default value:
torc_output -
--max-parallel-ssh <MAX_PARALLEL_SSH>— Maximum parallel SSH connectionsDefault value:
10
torc remote stop
Stop workers on remote machines
Usage: torc remote stop [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID (optional - will prompt if not provided)
Options:
-
--output-dir <OUTPUT_DIR>— Remote output directory (must match what was used in run)Default value:
torc_output -
--max-parallel-ssh <MAX_PARALLEL_SSH>— Maximum parallel SSH connectionsDefault value:
10 -
--force— Force kill (SIGKILL instead of SIGTERM)Default value:
false
torc remote collect-logs
Collect logs from remote workers
Usage: torc remote collect-logs [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID (optional - will prompt if not provided)
Options:
-
-l,--local-output-dir <LOCAL_OUTPUT_DIR>— Local directory to save collected logsDefault value:
remote_logs -
--remote-output-dir <REMOTE_OUTPUT_DIR>— Remote output directory (must match what was used in run)Default value:
torc_output -
--max-parallel-ssh <MAX_PARALLEL_SSH>— Maximum parallel SSH connectionsDefault value:
10 -
--delete— Delete remote logs after successful collectionDefault value:
false
torc remote delete-logs
Delete logs from remote workers
Removes the output directory from all remote workers. Use collect-logs --delete to safely collect before deleting.
Usage: torc remote delete-logs [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID (optional - will prompt if not provided)
Options:
-
--remote-output-dir <REMOTE_OUTPUT_DIR>— Remote output directory to delete (must match what was used in run)Default value:
torc_output -
--max-parallel-ssh <MAX_PARALLEL_SSH>— Maximum parallel SSH connectionsDefault value:
10
torc scheduled-compute-nodes
Scheduled compute node management commands
Usage: torc scheduled-compute-nodes <COMMAND>
Subcommands:
get— Get a scheduled compute node by IDlist— List scheduled compute nodes for a workflowlist-jobs— List jobs that ran under a scheduled compute nodeget— Get a scheduled compute node by IDlist— List scheduled compute nodes for a workflowlist-jobs— List jobs that ran under a scheduled compute node
torc scheduled-compute-nodes get
Get a scheduled compute node by ID
Usage: torc scheduled-compute-nodes get <ID>
Arguments:
<ID>— ID of the scheduled compute node<ID>— ID of the scheduled compute node
torc scheduled-compute-nodes list
List scheduled compute nodes for a workflow
Usage: torc scheduled-compute-nodes list [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— List scheduled compute nodes for this workflow (optional - will prompt if not provided)<WORKFLOW_ID>— List scheduled compute nodes for this workflow (optional - will prompt if not provided)
Options:
-
-l,--limit <LIMIT>— Maximum number of scheduled compute nodes to return -
-l,--limit <LIMIT>— Maximum number of scheduled compute nodes to returnDefault value:
10000 -
-o,--offset <OFFSET>— Offset for pagination (0-based) -
-o,--offset <OFFSET>— Offset for pagination (0-based)Default value:
0 -
-s,--sort-by <SORT_BY>— Field to sort by -
-r,--reverse-sort— Reverse sort order -
-s,--sort-by <SORT_BY>— Field to sort by -
-r,--reverse-sort— Reverse sort orderDefault value:
false -
--scheduler-id <SCHEDULER_ID>— Filter by scheduler ID -
--scheduler-config-id <SCHEDULER_CONFIG_ID>— Filter by scheduler config ID -
--status <STATUS>— Filter by status -
--scheduler-id <SCHEDULER_ID>— Filter by scheduler ID -
--scheduler-config-id <SCHEDULER_CONFIG_ID>— Filter by scheduler config ID -
--status <STATUS>— Filter by status
torc scheduled-compute-nodes list-jobs
List jobs that ran under a scheduled compute node
Usage: torc scheduled-compute-nodes list-jobs <ID>
Arguments:
<ID>— ID of the scheduled compute node<ID>— ID of the scheduled compute node
torc hpc
HPC system profiles and partition information
Usage: torc hpc <COMMAND>
Subcommands:
list— List known HPC system profilesdetect— Detect the current HPC systemshow— Show details of an HPC profilepartitions— Show partitions for an HPC profilematch— Find partitions matching resource requirementslist— List known HPC system profilesdetect— Detect the current HPC systemshow— Show details of an HPC profilepartitions— Show partitions for an HPC profilematch— Find partitions matching resource requirements
torc hpc list
List known HPC system profiles
Usage: torc hpc list
torc hpc detect
Detect the current HPC system
Usage: torc hpc detect
torc hpc show
Show details of an HPC profile
Usage: torc hpc show <NAME>
Arguments:
<NAME>— Profile name (e.g., "kestrel")<NAME>— Profile name (e.g., "kestrel")
torc hpc partitions
Show partitions for an HPC profile
Usage: torc hpc partitions [OPTIONS] [NAME]
Arguments:
<NAME>— Profile name (e.g., "kestrel"). If not specified, tries to detect current system<NAME>— Profile name (e.g., "kestrel"). If not specified, tries to detect current system
Options:
--gpu— Filter to GPU partitions only--cpu— Filter to CPU-only partitions--shared— Filter to shared partitions--gpu— Filter to GPU partitions only--cpu— Filter to CPU-only partitions--shared— Filter to shared partitions
torc hpc match
Find partitions matching resource requirements
Usage: torc hpc match [OPTIONS]
Options:
-
--cpus <CPUS>— Number of CPUs required -
--cpus <CPUS>— Number of CPUs requiredDefault value:
1 -
--memory <MEMORY>— Memory required (e.g., "100g", "512m", or MB as number) -
--memory <MEMORY>— Memory required (e.g., "100g", "512m", or MB as number)Default value:
1g -
--walltime <WALLTIME>— Wall time required (e.g., "4:00:00", "2-00:00:00") -
--walltime <WALLTIME>— Wall time required (e.g., "4:00:00", "2-00:00:00")Default value:
1:00:00 -
--gpus <GPUS>— Number of GPUs required -
--profile <PROFILE>— Profile name (if not specified, tries to detect current system) -
--gpus <GPUS>— Number of GPUs required -
--profile <PROFILE>— Profile name (if not specified, tries to detect current system)
torc reports
Generate reports and analytics
Usage: torc reports <COMMAND>
Subcommands:
check-resource-utilization— Check resource utilization and report jobs that exceeded their specified requirementsresults— Generate a comprehensive JSON report of job results including all log file pathssummary— Generate a summary of workflow results (requires workflow to be complete)check-resource-utilization— Check resource utilization and report jobs that exceeded their specified requirementsresults— Generate a comprehensive JSON report of job results including all log file pathssummary— Generate a summary of workflow results (requires workflow to be complete)
torc reports check-resource-utilization
Check resource utilization and report jobs that exceeded their specified requirements
Usage: torc reports check-resource-utilization [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID to analyze (optional - will prompt if not provided)<WORKFLOW_ID>— Workflow ID to analyze (optional - will prompt if not provided)
Options:
-r,--run-id <RUN_ID>— Run ID to analyze (optional - analyzes latest run if not provided)-a,--all— Show all jobs (default: only show jobs that exceeded requirements)--include-failed— Include failed and terminated jobs in the analysis (for recovery diagnostics)-r,--run-id <RUN_ID>— Run ID to analyze (optional - analyzes latest run if not provided)-a,--all— Show all jobs (default: only show jobs that exceeded requirements)--include-failed— Include failed and terminated jobs in the analysis (for recovery diagnostics)
torc reports results
Generate a comprehensive JSON report of job results including all log file paths
Usage: torc reports results [OPTIONS] [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID to analyze (optional - will prompt if not provided)<WORKFLOW_ID>— Workflow ID to analyze (optional - will prompt if not provided)
Options:
-
-o,--output-dir <OUTPUT_DIR>— Output directory (where job logs are stored, passed intorc runandtorc submit) -
-o,--output-dir <OUTPUT_DIR>— Output directory (where job logs are stored, passed intorc runandtorc submit)Default value:
output -
--all-runs— Include all runs for each job (default: only latest run) -
--all-runs— Include all runs for each job (default: only latest run)
torc reports summary
Generate a summary of workflow results (requires workflow to be complete)
Usage: torc reports summary [WORKFLOW_ID]
Arguments:
<WORKFLOW_ID>— Workflow ID to summarize (optional - will prompt if not provided)<WORKFLOW_ID>— Workflow ID to summarize (optional - will prompt if not provided)
torc config
Manage configuration files and settings
Usage: torc config <COMMAND>
Subcommands:
show— Show the effective configuration (merged from all sources)paths— Show configuration file pathsinit— Initialize a configuration file with defaultsvalidate— Validate the current configurationshow— Show the effective configuration (merged from all sources)paths— Show configuration file pathsinit— Initialize a configuration file with defaultsvalidate— Validate the current configuration
torc config show
Show the effective configuration (merged from all sources)
Usage: torc config show [OPTIONS]
Options:
-
-f,--format <FORMAT>— Output format (toml or json) -
-f,--format <FORMAT>— Output format (toml or json)Default value:
toml
torc config paths
Show configuration file paths
Usage: torc config paths
torc config init
Initialize a configuration file with defaults
Usage: torc config init [OPTIONS]
Options:
--system— Create system-wide config (/etc/torc/config.toml)--user— Create user config (~/.config/torc/config.toml)--local— Create project-local config (./torc.toml)-f,--force— Force overwrite if file exists--system— Create system-wide config (/etc/torc/config.toml)--user— Create user config (~/.config/torc/config.toml)--local— Create project-local config (./torc.toml)-f,--force— Force overwrite if file exists
torc config validate
Validate the current configuration
Usage: torc config validate
torc tui
Interactive terminal UI for managing workflows
Usage: torc tui [OPTIONS]
Options:
-
--standalone— Start in standalone mode: automatically start a torc-server -
--port <PORT>— Port for the server in standalone mode (default: 8080) -
--standalone— Start in standalone mode: automatically start a torc-server -
--port <PORT>— Port for the server in standalone mode (default: 8080)Default value:
8080 -
--database <DATABASE>— Database path for standalone mode -
--database <DATABASE>— Database path for standalone mode
torc plot-resources
Generate interactive HTML plots from resource monitoring data
Usage: torc plot-resources [OPTIONS] <DB_PATHS>...
Arguments:
<DB_PATHS>— Path to the resource metrics database file(s)<DB_PATHS>— Path to the resource metrics database file(s)
Options:
-
-o,--output-dir <OUTPUT_DIR>— Output directory for generated plots (default: current directory) -
-o,--output-dir <OUTPUT_DIR>— Output directory for generated plots (default: current directory)Default value:
. -
-j,--job-ids <JOB_IDS>— Only plot specific job IDs (comma-separated) -
-p,--prefix <PREFIX>— Prefix for output filenames -
-j,--job-ids <JOB_IDS>— Only plot specific job IDs (comma-separated) -
-p,--prefix <PREFIX>— Prefix for output filenamesDefault value:
resource_plot -
-f,--format <FORMAT>— Output format: html or json -
-f,--format <FORMAT>— Output format: html or jsonDefault value:
html
torc completions
Generate shell completions
Usage: torc completions <SHELL>
Arguments:
-
<SHELL>— The shell to generate completions for -
<SHELL>— The shell to generate completions forPossible values:
bash,elvish,fish,powershell,zsh
This document was generated automatically by
clap-markdown.
This document was generated automatically by
clap-markdown.
Workflow Specification Reference
This page documents all data models used in workflow specification files. Workflow specs can be written in YAML, JSON, JSON5, or KDL formats.
WorkflowSpec
The top-level container for a complete workflow definition.
| Name | Type | Default | Description |
|---|---|---|---|
name | string | required | Name of the workflow |
user | string | current user | User who owns this workflow |
description | string | none | Description of the workflow |
parameters | map<string, string> | none | Shared parameters that can be used by jobs and files via use_parameters |
jobs | [JobSpec] | required | Jobs that make up this workflow |
files | [FileSpec] | none | Files associated with this workflow |
user_data | [UserDataSpec] | none | User data associated with this workflow |
resource_requirements | [ResourceRequirementsSpec] | none | Resource requirements available for this workflow |
failure_handlers | [FailureHandlerSpec] | none | Failure handlers available for this workflow |
slurm_schedulers | [SlurmSchedulerSpec] | none | Slurm schedulers available for this workflow |
slurm_defaults | SlurmDefaultsSpec | none | Default Slurm parameters to apply to all schedulers |
resource_monitor | ResourceMonitorConfig | none | Resource monitoring configuration |
actions | [WorkflowActionSpec] | none | Actions to execute based on workflow/job state transitions |
use_pending_failed | boolean | false | Use PendingFailed status for failed jobs (enables AI-assisted recovery) |
compute_node_expiration_buffer_seconds | integer | none | Shut down compute nodes this many seconds before expiration |
compute_node_wait_for_new_jobs_seconds | integer | none | Compute nodes wait for new jobs this long before exiting |
compute_node_ignore_workflow_completion | boolean | false | Compute nodes hold allocations even after workflow completes |
compute_node_wait_for_healthy_database_minutes | integer | none | Compute nodes wait this many minutes for database recovery |
jobs_sort_method | ClaimJobsSortMethod | none | Method for sorting jobs when claiming them |
JobSpec
Defines a single computational task within a workflow.
| Name | Type | Default | Description |
|---|---|---|---|
name | string | required | Name of the job |
command | string | required | Command to execute for this job |
invocation_script | string | none | Optional script for job invocation |
resource_requirements | string | none | Name of a ResourceRequirementsSpec to use |
failure_handler | string | none | Name of a FailureHandlerSpec to use |
scheduler | string | none | Name of the scheduler to use for this job |
cancel_on_blocking_job_failure | boolean | false | Cancel this job if a blocking job fails |
supports_termination | boolean | false | Whether this job supports graceful termination |
depends_on | [string] | none | Job names that must complete before this job runs (exact matches) |
depends_on_regexes | [string] | none | Regex patterns for job dependencies |
input_files | [string] | none | File names this job reads (exact matches) |
input_file_regexes | [string] | none | Regex patterns for input files |
output_files | [string] | none | File names this job produces (exact matches) |
output_file_regexes | [string] | none | Regex patterns for output files |
input_user_data | [string] | none | User data names this job reads (exact matches) |
input_user_data_regexes | [string] | none | Regex patterns for input user data |
output_user_data | [string] | none | User data names this job produces (exact matches) |
output_user_data_regexes | [string] | none | Regex patterns for output user data |
parameters | map<string, string> | none | Local parameters for generating multiple jobs |
parameter_mode | string | "product" | How to combine parameters: "product" (Cartesian) or "zip" |
use_parameters | [string] | none | Workflow parameter names to use for this job |
FileSpec
Defines input/output file artifacts that establish implicit job dependencies.
| Name | Type | Default | Description |
|---|---|---|---|
name | string | required | Name of the file (used for referencing in jobs) |
path | string | required | File system path |
parameters | map<string, string> | none | Parameters for generating multiple files |
parameter_mode | string | "product" | How to combine parameters: "product" (Cartesian) or "zip" |
use_parameters | [string] | none | Workflow parameter names to use for this file |
UserDataSpec
Arbitrary JSON data that can establish dependencies between jobs.
| Name | Type | Default | Description |
|---|---|---|---|
name | string | none | Name of the user data (used for referencing in jobs) |
data | JSON | none | The data content as a JSON value |
is_ephemeral | boolean | false | Whether the user data is ephemeral |
ResourceRequirementsSpec
Defines compute resource requirements for jobs.
| Name | Type | Default | Description |
|---|---|---|---|
name | string | required | Name of this resource configuration (referenced by jobs) |
num_cpus | integer | required | Number of CPUs required |
memory | string | required | Memory requirement (e.g., "1m", "2g", "512k") |
num_gpus | integer | 0 | Number of GPUs required |
num_nodes | integer | 1 | Number of nodes required |
runtime | string | "PT1H" | Runtime limit in ISO8601 duration format (e.g., "PT30M", "PT2H") |
FailureHandlerSpec
Defines error recovery strategies for jobs.
| Name | Type | Default | Description |
|---|---|---|---|
name | string | required | Name of the failure handler (referenced by jobs) |
rules | [FailureHandlerRuleSpec] | required | Rules for handling different exit codes |
FailureHandlerRuleSpec
A single rule within a failure handler for handling specific exit codes.
| Name | Type | Default | Description |
|---|---|---|---|
exit_codes | [integer] | [] | Exit codes that trigger this rule |
match_all_exit_codes | boolean | false | If true, matches any non-zero exit code |
recovery_script | string | none | Optional script to run before retrying |
max_retries | integer | 3 | Maximum number of retry attempts |
SlurmSchedulerSpec
Defines a Slurm HPC job scheduler configuration.
| Name | Type | Default | Description |
|---|---|---|---|
name | string | none | Name of the scheduler (used for referencing) |
account | string | required | Slurm account |
partition | string | none | Slurm partition name |
nodes | integer | 1 | Number of nodes to allocate |
walltime | string | "01:00:00" | Wall time limit |
mem | string | none | Memory specification |
gres | string | none | Generic resources (e.g., GPUs) |
qos | string | none | Quality of service |
ntasks_per_node | integer | none | Number of tasks per node |
tmp | string | none | Temporary storage specification |
extra | string | none | Additional Slurm parameters |
SlurmDefaultsSpec
Workflow-level default parameters applied to all Slurm schedulers. This is a map of parameter names to values.
Any valid sbatch long option can be specified (without the leading --), except for parameters
managed by torc: partition, nodes, walltime, time, mem, gres, name, job-name.
The account parameter is allowed as a workflow-level default.
Example:
slurm_defaults:
qos: "high"
constraint: "cpu"
mail-user: "user@example.com"
mail-type: "END,FAIL"
WorkflowActionSpec
Defines conditional actions triggered by workflow or job state changes.
| Name | Type | Default | Description |
|---|---|---|---|
trigger_type | string | required | When to trigger: "on_workflow_start", "on_workflow_complete", "on_jobs_ready", "on_jobs_complete" |
action_type | string | required | What to do: "run_commands", "schedule_nodes" |
jobs | [string] | none | For job triggers: exact job names to match |
job_name_regexes | [string] | none | For job triggers: regex patterns to match job names |
commands | [string] | none | For run_commands: commands to execute |
scheduler | string | none | For schedule_nodes: scheduler name |
scheduler_type | string | none | For schedule_nodes: scheduler type ("slurm", "local") |
num_allocations | integer | none | For schedule_nodes: number of node allocations |
start_one_worker_per_node | boolean | none | For schedule_nodes: start one worker per allocated node |
max_parallel_jobs | integer | none | For schedule_nodes: maximum parallel jobs |
persistent | boolean | false | Whether the action persists and can be claimed by multiple workers |
ResourceMonitorConfig
Configuration for resource usage monitoring.
| Name | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable resource monitoring |
granularity | MonitorGranularity | "Summary" | Level of detail for metrics collection |
sample_interval_seconds | integer | 5 | Sampling interval in seconds |
generate_plots | boolean | false | Generate resource usage plots |
MonitorGranularity
Enum specifying the level of detail for resource monitoring.
| Value | Description |
|---|---|
Summary | Collect summary statistics only |
TimeSeries | Collect detailed time series data |
ClaimJobsSortMethod
Enum specifying how jobs are sorted when being claimed by workers.
| Value | Description |
|---|---|
none | No sorting (default) |
gpus_runtime_memory | Sort by GPUs, then runtime, then memory |
gpus_memory_runtime | Sort by GPUs, then memory, then runtime |
Parameter Formats
Parameters support several formats for generating multiple jobs or files:
| Format | Example | Description |
|---|---|---|
| Integer range | "1:100" | Inclusive range from 1 to 100 |
| Integer range with step | "0:100:10" | Range with step size |
| Float range | "0.0:1.0:0.1" | Float range with step |
| Integer list | "[1,5,10,100]" | Explicit list of integers |
| Float list | "[0.1,0.5,0.9]" | Explicit list of floats |
| String list | "['adam','sgd','rmsprop']" | Explicit list of strings |
Template substitution in strings:
- Basic:
{param_name}- Replace with parameter value - Formatted integer:
{i:03d}- Zero-padded (001, 042, 100) - Formatted float:
{lr:.4f}- Precision (0.0010, 0.1000)
See the Job Parameterization reference for more details.
Job Parameterization
Parameterization allows creating multiple jobs/files from a single specification by expanding parameter ranges.
Parameter Formats
Integer Ranges
parameters:
i: "1:10" # Expands to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
i: "0:100:10" # Expands to [0, 10, 20, 30, ..., 90, 100] (with step)
Float Ranges
parameters:
lr: "0.0001:0.01:10" # 10 values from 0.0001 to 0.01 (log scale)
alpha: "0.0:1.0:0.1" # [0.0, 0.1, 0.2, ..., 0.9, 1.0]
Lists (Integer)
parameters:
batch_size: "[16,32,64,128]"
Lists (Float)
parameters:
threshold: "[0.1,0.5,0.9]"
Lists (String)
parameters:
optimizer: "['adam','sgd','rmsprop']"
dataset: "['train','test','validation']"
Template Substitution
Use parameter values in job/file specifications with {param_name} syntax:
Basic Substitution
jobs:
- name: job_{i}
command: python train.py --run={i}
parameters:
i: "1:5"
Expands to:
jobs:
- name: job_1
command: python train.py --run=1
- name: job_2
command: python train.py --run=2
# ... etc
Format Specifiers
Zero-padded integers:
jobs:
- name: job_{i:03d}
command: echo {i}
parameters:
i: "1:100"
Expands to: job_001, job_002, ..., job_100
Float precision:
jobs:
- name: train_lr{lr:.4f}
command: python train.py --lr={lr}
parameters:
lr: "[0.0001,0.001,0.01]"
Expands to: train_lr0.0001, train_lr0.0010, train_lr0.0100
Multiple decimals:
files:
- name: result_{threshold:.2f}
path: /results/threshold_{threshold:.2f}.csv
parameters:
threshold: "0.1:1.0:0.1"
Expands to: result_0.10, result_0.20, ..., result_1.00
Multi-Dimensional Parameterization
Use multiple parameters to create Cartesian products:
Example: Hyperparameter Sweep
jobs:
- name: train_lr{lr:.4f}_bs{batch_size}
command: |
python train.py \
--learning-rate={lr} \
--batch-size={batch_size}
parameters:
lr: "[0.0001,0.001,0.01]"
batch_size: "[16,32,64]"
This expands to 3 × 3 = 9 jobs:
train_lr0.0001_bs16train_lr0.0001_bs32train_lr0.0001_bs64train_lr0.0010_bs16- ... (9 total)
Example: Multi-Dataset Processing
jobs:
- name: process_{dataset}_rep{rep:02d}
command: python process.py --data={dataset} --replicate={rep}
parameters:
dataset: "['train','validation','test']"
rep: "1:5"
This expands to 3 × 5 = 15 jobs
Parameterized Dependencies
Parameters work in dependency specifications:
jobs:
# Generate data for each configuration
- name: generate_{config}
command: python generate.py --config={config}
output_files:
- data_{config}
parameters:
config: "['A','B','C']"
# Process each generated dataset
- name: process_{config}
command: python process.py --input=data_{config}.pkl
input_files:
- data_{config}
depends_on:
- generate_{config}
parameters:
config: "['A','B','C']"
This creates 6 jobs with proper dependencies:
generate_A→process_Agenerate_B→process_Bgenerate_C→process_C
Parameterized Files and User Data
Files:
files:
- name: model_{run_id:03d}
path: /models/run_{run_id:03d}.pt
parameters:
run_id: "1:100"
User Data:
user_data:
- name: config_{experiment}
data:
experiment: "{experiment}"
learning_rate: 0.001
parameters:
experiment: "['baseline','ablation','full']"
Shared (Workflow-Level) Parameters
Define parameters once at the workflow level and reuse them across multiple jobs and files using
use_parameters:
Basic Usage
name: hyperparameter_sweep
parameters:
lr: "[0.0001,0.001,0.01]"
batch_size: "[16,32,64]"
optimizer: "['adam','sgd']"
jobs:
# Training jobs - inherit parameters via use_parameters
- name: train_lr{lr:.4f}_bs{batch_size}_opt{optimizer}
command: python train.py --lr={lr} --batch-size={batch_size} --optimizer={optimizer}
use_parameters:
- lr
- batch_size
- optimizer
# Aggregate results - also uses shared parameters
- name: aggregate_results
command: python aggregate.py
depends_on:
- train_lr{lr:.4f}_bs{batch_size}_opt{optimizer}
use_parameters:
- lr
- batch_size
- optimizer
files:
- name: model_lr{lr:.4f}_bs{batch_size}_opt{optimizer}
path: /models/model_lr{lr:.4f}_bs{batch_size}_opt{optimizer}.pt
use_parameters:
- lr
- batch_size
- optimizer
Benefits
- DRY (Don't Repeat Yourself) - Define parameter ranges once, use everywhere
- Consistency - Ensures all jobs use the same parameter values
- Maintainability - Change parameters in one place, affects all uses
- Selective inheritance - Jobs can choose which parameters to use
Selective Parameter Inheritance
Jobs don't have to use all workflow parameters:
parameters:
lr: "[0.0001,0.001,0.01]"
batch_size: "[16,32,64]"
dataset: "['train','validation']"
jobs:
# Only uses lr and batch_size (9 jobs)
- name: train_lr{lr:.4f}_bs{batch_size}
command: python train.py --lr={lr} --batch-size={batch_size}
use_parameters:
- lr
- batch_size
# Only uses dataset (2 jobs)
- name: prepare_{dataset}
command: python prepare.py --dataset={dataset}
use_parameters:
- dataset
Local Parameters Override Shared
Jobs can define local parameters that take precedence over workflow-level parameters:
parameters:
lr: "[0.0001,0.001,0.01]"
jobs:
# Uses workflow parameter (3 jobs)
- name: train_lr{lr:.4f}
command: python train.py --lr={lr}
use_parameters:
- lr
# Uses local override (2 jobs instead of 3)
- name: special_lr{lr:.4f}
command: python special.py --lr={lr}
parameters:
lr: "[0.01,0.1]" # Local override - ignores workflow's lr
KDL Syntax
parameters {
lr "[0.0001,0.001,0.01]"
batch_size "[16,32,64]"
}
job "train_lr{lr:.4f}_bs{batch_size}" {
command "python train.py --lr={lr} --batch-size={batch_size}"
use_parameters "lr" "batch_size"
}
JSON5 Syntax
{
parameters: {
lr: "[0.0001,0.001,0.01]",
batch_size: "[16,32,64]"
},
jobs: [
{
name: "train_lr{lr:.4f}_bs{batch_size}",
command: "python train.py --lr={lr} --batch-size={batch_size}",
use_parameters: ["lr", "batch_size"]
}
]
}
Parameter Modes
By default, when multiple parameters are specified, Torc generates the Cartesian product of all
parameter values. You can change this behavior using parameter_mode.
Product Mode (Default)
The default mode generates all possible combinations:
jobs:
- name: job_{a}_{b}
command: echo {a} {b}
parameters:
a: "[1, 2, 3]"
b: "['x', 'y', 'z']"
# parameter_mode: product # This is the default
This creates 3 × 3 = 9 jobs: job_1_x, job_1_y, job_1_z, job_2_x, etc.
Zip Mode
Use parameter_mode: zip to pair parameters element-wise (like Python's zip() function). All
parameter lists must have the same length.
jobs:
- name: train_{dataset}_{model}
command: python train.py --dataset={dataset} --model={model}
parameters:
dataset: "['cifar10', 'mnist', 'imagenet']"
model: "['resnet', 'cnn', 'transformer']"
parameter_mode: zip
This creates 3 jobs (not 9):
train_cifar10_resnettrain_mnist_cnntrain_imagenet_transformer
When to use zip mode:
- Pre-determined parameter pairings (dataset A always uses model X)
- Corresponding input/output file pairs
- Parallel arrays where position matters
Error handling: If parameter lists have different lengths in zip mode, Torc will return an error:
All parameters must have the same number of values when using 'zip' mode.
Parameter 'dataset' has 3 values, but 'model' has 2 values.
KDL Syntax
job "train_{dataset}_{model}" {
command "python train.py --dataset={dataset} --model={model}"
parameters {
dataset "['cifar10', 'mnist', 'imagenet']"
model "['resnet', 'cnn', 'transformer']"
}
parameter_mode "zip"
}
JSON5 Syntax
{
name: "train_{dataset}_{model}",
command: "python train.py --dataset={dataset} --model={model}",
parameters: {
dataset: "['cifar10', 'mnist', 'imagenet']",
model: "['resnet', 'cnn', 'transformer']"
},
parameter_mode: "zip"
}
Best Practices
- Use descriptive parameter names -
lrnotx,batch_sizenotb - Format numbers consistently - Use
:03dfor run IDs,:.4ffor learning rates - Keep parameter counts reasonable - 3×3×3 = 27 jobs is manageable, 10×10×10 = 1000 may overwhelm the system
- Match parameter ranges across related jobs - Use same parameter values for generator and consumer jobs
- Consider parameter dependencies - Some parameter combinations may be invalid
- Prefer shared parameters for multi-job workflows - Use
use_parametersto avoid repeating definitions - Use selective inheritance - Only inherit the parameters each job actually needs
- Use zip mode for paired parameters - When parameters have a 1:1 correspondence, use
parameter_mode: zip
Resource Requirements Reference
Technical reference for job resource specifications and allocation strategies.
Resource Requirements Fields
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Identifier to reference from jobs |
num_cpus | integer | No | Number of CPU cores |
num_gpus | integer | No | Number of GPUs |
num_nodes | integer | No | Number of compute nodes |
memory | string | No | Memory allocation (see format below) |
runtime | string | No | Maximum runtime (ISO 8601 duration) |
Example
resource_requirements:
- name: small
num_cpus: 2
num_gpus: 0
num_nodes: 1
memory: 4g
runtime: PT30M
- name: large
num_cpus: 16
num_gpus: 2
num_nodes: 1
memory: 128g
runtime: PT8H
Memory Format
String format with unit suffix:
| Suffix | Unit | Example |
|---|---|---|
k | Kilobytes | 512k |
m | Megabytes | 512m |
g | Gigabytes | 16g |
Examples:
memory: 512m # 512 MB
memory: 1g # 1 GB
memory: 16g # 16 GB
Runtime Format
ISO 8601 duration format:
| Format | Description | Example |
|---|---|---|
PTnM | Minutes | PT30M (30 minutes) |
PTnH | Hours | PT2H (2 hours) |
PnD | Days | P1D (1 day) |
PnDTnH | Days and hours | P1DT12H (1.5 days) |
Examples:
runtime: PT10M # 10 minutes
runtime: PT4H # 4 hours
runtime: P1D # 1 day
runtime: P1DT12H # 1 day, 12 hours
Job Allocation Strategies
Resource-Based Allocation (Default)
The server considers each job's resource requirements and only returns jobs that fit within available compute node resources.
Behavior:
- Considers CPU, memory, and GPU requirements
- Prevents resource over-subscription
- Enables efficient packing of heterogeneous workloads
Configuration: Run without --max-parallel-jobs:
torc run $WORKFLOW_ID
Queue-Based Allocation
The server returns the next N ready jobs regardless of resource requirements.
Behavior:
- Ignores job resource requirements
- Only limits concurrent job count
- Simpler and faster (no resource calculation)
Configuration: Run with --max-parallel-jobs:
torc run $WORKFLOW_ID --max-parallel-jobs 10
Use cases:
- Homogeneous workloads where all jobs need similar resources
- Simple task queues
- When resource tracking overhead is not wanted
Resource Tracking
When using resource-based allocation, the job runner tracks:
| Resource | Description |
|---|---|
| CPUs | Number of CPU cores in use |
| Memory | Total memory allocated to running jobs |
| GPUs | Number of GPUs in use |
| Nodes | Number of jobs running per node |
Jobs are only started when sufficient resources are available.
Environment Variables
When Torc executes jobs, it automatically sets several environment variables that provide context about the job and enable communication with the Torc server. These variables are available to all job commands during execution.
Variables Set During Job Execution
TORC_WORKFLOW_ID
The unique identifier of the workflow that contains this job.
- Type: Integer (provided as string)
- Example:
"42" - Use case: Jobs can use this to query workflow information or to organize output files by workflow
# Example: Create a workflow-specific output directory
mkdir -p "/data/results/workflow_${TORC_WORKFLOW_ID}"
echo "Processing data..." > "/data/results/workflow_${TORC_WORKFLOW_ID}/output.txt"
TORC_JOB_ID
The unique identifier of the currently executing job.
- Type: Integer (provided as string)
- Example:
"123" - Use case: Jobs can use this for logging, creating job-specific output files, or querying job metadata
# Example: Log job-specific information
echo "Job ${TORC_JOB_ID} started at $(date)" >> "/var/log/torc/job_${TORC_JOB_ID}.log"
TORC_API_URL
The URL of the Torc API server that the job runner is communicating with.
- Type: String (URL)
- Example:
"http://localhost:8080/torc-service/v1" - Use case: Jobs can make API calls to the Torc server to query data, create files, update user data, or perform other operations
# Example: Query workflow information from within a job
curl -s "${TORC_API_URL}/workflows/${TORC_WORKFLOW_ID}" | jq '.name'
# Example: Create a file entry in Torc
curl -X POST "${TORC_API_URL}/files" \
-H "Content-Type: application/json" \
-d "{
\"workflow_id\": ${TORC_WORKFLOW_ID},
\"name\": \"result_${TORC_JOB_ID}\",
\"path\": \"/data/results/output.txt\"
}"
TORC_JOB_NAME
The name of the currently executing job as defined in the workflow specification.
- Type: String
- Example:
"train_model" - Use case: Jobs can use this for logging or creating human-readable output file names
# Example: Log with job name
echo "[${TORC_JOB_NAME}] Processing started at $(date)"
TORC_OUTPUT_DIR
The output directory where job logs and artifacts are stored.
- Type: String (path)
- Example:
"/path/to/output" - Use case: Jobs can write additional output files to this directory alongside the standard stdout/stderr logs
# Example: Write job artifacts to output directory
cp results.json "${TORC_OUTPUT_DIR}/job_${TORC_JOB_ID}_results.json"
TORC_ATTEMPT_ID
The current attempt number for this job execution. Starts at 1 and increments with each retry when using failure handlers.
- Type: Integer (provided as string)
- Example:
"1"(first attempt),"2"(first retry), etc. - Use case: Jobs can adjust behavior based on retry attempt, or include attempt information in logs
# Example: Log attempt information
echo "Running attempt ${TORC_ATTEMPT_ID} of job ${TORC_JOB_NAME}"
# Example: Adjust behavior on retry
if [ "${TORC_ATTEMPT_ID}" -gt 1 ]; then
echo "This is a retry - using more conservative settings"
BATCH_SIZE=16
else
BATCH_SIZE=64
fi
Variables Set During Recovery Script Execution
When a job fails and has a failure handler configured, Torc may run a recovery script before retrying the job. Recovery scripts receive all the standard job environment variables plus additional context about the failure.
TORC_RETURN_CODE
The exit code from the failed job that triggered the recovery script. Only available in recovery scripts, not during normal job execution.
- Type: Integer (provided as string)
- Example:
"137"(OOM killed),"1"(general error) - Use case: Recovery scripts can inspect the exit code to determine appropriate recovery actions
# Example: Recovery script that handles different exit codes
#!/bin/bash
echo "Job ${TORC_JOB_NAME} failed with exit code ${TORC_RETURN_CODE}"
case ${TORC_RETURN_CODE} in
137)
echo "Out of memory - reducing batch size for retry"
# Modify config for next attempt
;;
139)
echo "Segmentation fault - checking for corrupted data"
# Clean up corrupted files
;;
*)
echo "Unknown error - attempting general recovery"
;;
esac
exit 0 # Exit 0 to proceed with retry, non-zero to abort
Complete Example
Here's a complete example of a job that uses the environment variables:
name: "Environment Variables Demo"
user: "demo"
jobs:
- name: "example_job"
command: |
#!/bin/bash
set -e
echo "=== Job Environment ==="
echo "Workflow ID: ${TORC_WORKFLOW_ID}"
echo "Job ID: ${TORC_JOB_ID}"
echo "Job Name: ${TORC_JOB_NAME}"
echo "Attempt: ${TORC_ATTEMPT_ID}"
echo "Output Dir: ${TORC_OUTPUT_DIR}"
echo "API URL: ${TORC_API_URL}"
# Create job-specific output directory
OUTPUT_DIR="/tmp/workflow_${TORC_WORKFLOW_ID}/job_${TORC_JOB_ID}"
mkdir -p "${OUTPUT_DIR}"
# Do some work
echo "Processing data..." > "${OUTPUT_DIR}/status.txt"
date >> "${OUTPUT_DIR}/status.txt"
echo "Job completed successfully!"
Summary Table
| Variable | Type | Available In | Description |
|---|---|---|---|
TORC_WORKFLOW_ID | Integer | Jobs, Recovery Scripts | Workflow identifier |
TORC_JOB_ID | Integer | Jobs, Recovery Scripts | Job identifier |
TORC_JOB_NAME | String | Jobs, Recovery Scripts | Job name from workflow spec |
TORC_API_URL | URL | Jobs, Recovery Scripts | Torc server API endpoint |
TORC_OUTPUT_DIR | Path | Jobs, Recovery Scripts | Output directory for logs/artifacts |
TORC_ATTEMPT_ID | Integer | Jobs, Recovery Scripts | Current attempt number (1, 2, 3...) |
TORC_RETURN_CODE | Integer | Recovery Scripts only | Exit code that triggered recovery |
Notes
- All environment variables are set as strings, even numeric values like workflow and job IDs
- The
TORC_API_URLincludes the full base path to the API (e.g.,/torc-service/v1) - Jobs inherit all other environment variables from the job runner process
- These variables are available in both local and Slurm-scheduled job executions
TORC_ATTEMPT_IDstarts at 1 for the first execution and increments with each retryTORC_RETURN_CODEis only available in recovery scripts, not during normal job execution
Resource Monitoring Reference
Technical reference for Torc's resource monitoring system.
Configuration Options
The resource_monitor section in workflow specifications accepts the following fields:
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable or disable monitoring |
granularity | string | "summary" | "summary" or "time_series" |
sample_interval_seconds | integer | 5 | Seconds between resource samples |
generate_plots | boolean | false | Reserved for future use |
Granularity Modes
Summary mode ("summary"):
- Stores only peak and average values per job
- Metrics stored in the main database results table
- Minimal storage overhead
Time series mode ("time_series"):
- Stores samples at regular intervals
- Creates separate SQLite database per workflow run
- Database location:
<output_dir>/resource_utilization/resource_metrics_<hostname>_<workflow_id>_<run_id>.db
Sample Interval Guidelines
| Job Duration | Recommended Interval |
|---|---|
| < 1 hour | 1-2 seconds |
| 1-4 hours | 5 seconds (default) |
| > 4 hours | 10-30 seconds |
Time Series Database Schema
job_resource_samples Table
| Column | Type | Description |
|---|---|---|
id | INTEGER | Primary key |
job_id | INTEGER | Torc job ID |
timestamp | REAL | Unix timestamp |
cpu_percent | REAL | CPU utilization percentage |
memory_bytes | INTEGER | Memory usage in bytes |
num_processes | INTEGER | Process count including children |
job_metadata Table
| Column | Type | Description |
|---|---|---|
job_id | INTEGER | Primary key, Torc job ID |
job_name | TEXT | Human-readable job name |
Summary Metrics in Results
When using summary mode, the following fields are added to job results:
| Field | Type | Description |
|---|---|---|
peak_cpu_percent | float | Maximum CPU percentage observed |
avg_cpu_percent | float | Average CPU percentage |
peak_memory_gb | float | Maximum memory in GB |
avg_memory_gb | float | Average memory in GB |
check-resource-utilization JSON Output
When using --format json:
{
"workflow_id": 123,
"run_id": null,
"total_results": 10,
"over_utilization_count": 3,
"violations": [
{
"job_id": 15,
"job_name": "train_model",
"resource_type": "Memory",
"specified": "8.00 GB",
"peak_used": "10.50 GB",
"over_utilization": "+31.3%"
}
]
}
| Field | Description |
|---|---|
workflow_id | Workflow being analyzed |
run_id | Specific run ID if provided, otherwise null for latest |
total_results | Total number of completed jobs analyzed |
over_utilization_count | Number of violations found |
violations | Array of violation details |
Violation Object
| Field | Description |
|---|---|
job_id | Job ID with violation |
job_name | Human-readable job name |
resource_type | "Memory", "CPU", or "Runtime" |
specified | Resource requirement from workflow spec |
peak_used | Actual peak usage observed |
over_utilization | Percentage over/under specification |
plot-resources Output Files
| File | Description |
|---|---|
resource_plot_job_<id>.html | Per-job timeline with CPU, memory, process count |
resource_plot_cpu_all_jobs.html | CPU comparison across all jobs |
resource_plot_memory_all_jobs.html | Memory comparison across all jobs |
resource_plot_summary.html | Bar chart dashboard of peak vs average |
All plots are self-contained HTML files using Plotly.js with:
- Interactive hover tooltips
- Zoom and pan controls
- Legend toggling
- Export options (PNG, SVG)
Monitored Metrics
| Metric | Unit | Description |
|---|---|---|
| CPU percentage | % | Total CPU utilization across all cores |
| Memory usage | bytes | Resident memory consumption |
| Process count | count | Number of processes in job's process tree |
Process Tree Tracking
The monitoring system automatically tracks child processes spawned by jobs. When a job creates worker processes (e.g., Python multiprocessing), all descendants are included in the aggregated metrics.
Performance Characteristics
- Single background monitoring thread regardless of job count
- Typical overhead: <1% CPU even with 1-second sampling
- Uses native OS APIs via the
sysinfocrate - Non-blocking async design
Configuration Reference
Complete reference for Torc configuration options.
Configuration Sources
Torc loads configuration from multiple sources in this order (later sources override earlier):
- Built-in defaults (lowest priority)
- System config:
/etc/torc/config.toml - User config:
~/.config/torc/config.toml(platform-dependent) - Project config:
./torc.toml - Environment variables:
TORC_*prefix - CLI arguments (highest priority)
Configuration Commands
torc config show # Show effective configuration
torc config show --format json # Show as JSON
torc config paths # Show configuration file locations
torc config init --user # Create user config file
torc config init --local # Create project config file
torc config init --system # Create system config file
torc config validate # Validate current configuration
Client Configuration
Settings for the torc CLI.
[client] Section
| Option | Type | Default | Description |
|---|---|---|---|
api_url | string | http://localhost:8080/torc-service/v1 | Torc server API URL |
format | string | table | Output format: table or json |
log_level | string | info | Log level: error, warn, info, debug, trace |
username | string | (none) | Username for basic authentication |
[client.run] Section
Settings for torc run command.
| Option | Type | Default | Description |
|---|---|---|---|
poll_interval | float | 5.0 | Job completion poll interval (seconds) |
output_dir | path | output | Output directory for job logs |
max_parallel_jobs | int | (none) | Maximum parallel jobs (overrides resource-based) |
num_cpus | int | (none) | Available CPUs for resource-based scheduling |
memory_gb | float | (none) | Available memory (GB) for resource-based scheduling |
num_gpus | int | (none) | Available GPUs for resource-based scheduling |
Example
[client]
api_url = "http://localhost:8080/torc-service/v1"
format = "table"
log_level = "info"
username = "myuser"
[client.run]
poll_interval = 5.0
output_dir = "output"
max_parallel_jobs = 4
num_cpus = 8
memory_gb = 32.0
num_gpus = 1
[client.hpc] Section
Settings for HPC profile system (used by torc hpc and torc slurm commands).
| Option | Type | Default | Description |
|---|---|---|---|
profile_overrides | table | {} | Override settings for built-in HPC profiles |
custom_profiles | table | {} | Define custom HPC profiles |
[client.hpc.profile_overrides.<profile>] Section
Override settings for built-in profiles (e.g., kestrel).
| Option | Type | Default | Description |
|---|---|---|---|
default_account | string | (none) | Default Slurm account for this profile |
[client.hpc.custom_profiles.<name>] Section
Define a custom HPC profile.
| Option | Type | Required | Description |
|---|---|---|---|
display_name | string | No | Human-readable name |
description | string | No | Profile description |
detect_env_var | string | No | Environment variable for detection (NAME=value) |
detect_hostname | string | No | Regex pattern for hostname detection |
default_account | string | No | Default Slurm account |
partitions | array | Yes | List of partition configurations |
[[client.hpc.custom_profiles.<name>.partitions]] Section
Define partitions for a custom profile.
| Option | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Partition name |
cpus_per_node | int | Yes | CPU cores per node |
memory_mb | int | Yes | Memory per node in MB |
max_walltime_secs | int | Yes | Maximum walltime in seconds |
gpus_per_node | int | No | GPUs per node |
gpu_type | string | No | GPU model (e.g., "H100") |
shared | bool | No | Whether partition supports shared jobs |
min_nodes | int | No | Minimum required nodes |
requires_explicit_request | bool | No | Must be explicitly requested |
HPC Example
[client.hpc.profile_overrides.kestrel]
default_account = "my_default_account"
[client.hpc.custom_profiles.mycluster]
display_name = "My Research Cluster"
description = "Internal research HPC system"
detect_env_var = "MY_CLUSTER=research"
default_account = "default_project"
[[client.hpc.custom_profiles.mycluster.partitions]]
name = "compute"
cpus_per_node = 64
memory_mb = 256000
max_walltime_secs = 172800
shared = false
[[client.hpc.custom_profiles.mycluster.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 128000
max_walltime_secs = 86400
gpus_per_node = 4
gpu_type = "A100"
shared = false
Server Configuration
Settings for torc-server.
[server] Section
| Option | Type | Default | Description |
|---|---|---|---|
log_level | string | info | Log level |
https | bool | false | Enable HTTPS |
url | string | localhost | Hostname/IP to bind to |
port | int | 8080 | Port to listen on |
threads | int | 1 | Number of worker threads |
database | string | (none) | SQLite database path (falls back to DATABASE_URL env) |
auth_file | string | (none) | Path to htpasswd file |
require_auth | bool | false | Require authentication for all requests |
enforce_access_control | bool | false | Enforce access control based on workflow ownership |
admin_users | string array | [] | Users to add to the admin group |
completion_check_interval_secs | float | 30.0 | Background job processing interval |
[server.logging] Section
| Option | Type | Default | Description |
|---|---|---|---|
log_dir | path | (none) | Directory for log files (enables file logging) |
json_logs | bool | false | Use JSON format for log files |
Example
[server]
url = "0.0.0.0"
port = 8080
threads = 4
database = "/var/lib/torc/torc.db"
auth_file = "/etc/torc/htpasswd"
require_auth = true
enforce_access_control = true
admin_users = ["alice", "bob"]
completion_check_interval_secs = 30.0
log_level = "info"
https = false
[server.logging]
log_dir = "/var/log/torc"
json_logs = false
Dashboard Configuration
Settings for torc-dash.
[dash] Section
| Option | Type | Default | Description |
|---|---|---|---|
host | string | 127.0.0.1 | Hostname/IP to bind to |
port | int | 8090 | Port to listen on |
api_url | string | http://localhost:8080/torc-service/v1 | Torc server API URL |
torc_bin | string | torc | Path to torc CLI binary |
torc_server_bin | string | torc-server | Path to torc-server binary |
standalone | bool | false | Auto-start torc-server |
server_port | int | 0 | Server port for standalone mode (0 = auto) |
database | string | (none) | Database path for standalone mode |
completion_check_interval_secs | int | 5 | Completion check interval (standalone mode) |
Example
[dash]
host = "0.0.0.0"
port = 8090
api_url = "http://localhost:8080/torc-service/v1"
torc_bin = "/usr/local/bin/torc"
torc_server_bin = "/usr/local/bin/torc-server"
standalone = true
server_port = 0
completion_check_interval_secs = 5
Environment Variables
Environment variables use double underscore (__) to separate nested keys.
Client Variables
| Variable | Maps To |
|---|---|
TORC_CLIENT__API_URL | client.api_url |
TORC_CLIENT__FORMAT | client.format |
TORC_CLIENT__LOG_LEVEL | client.log_level |
TORC_CLIENT__USERNAME | client.username |
TORC_CLIENT__RUN__POLL_INTERVAL | client.run.poll_interval |
TORC_CLIENT__RUN__OUTPUT_DIR | client.run.output_dir |
TORC_CLIENT__RUN__MAX_PARALLEL_JOBS | client.run.max_parallel_jobs |
TORC_CLIENT__RUN__NUM_CPUS | client.run.num_cpus |
TORC_CLIENT__RUN__MEMORY_GB | client.run.memory_gb |
TORC_CLIENT__RUN__NUM_GPUS | client.run.num_gpus |
Server Variables
| Variable | Maps To |
|---|---|
TORC_SERVER__URL | server.url |
TORC_SERVER__PORT | server.port |
TORC_SERVER__THREADS | server.threads |
TORC_SERVER__DATABASE | server.database |
TORC_SERVER__AUTH_FILE | server.auth_file |
TORC_SERVER__REQUIRE_AUTH | server.require_auth |
TORC_SERVER__ENFORCE_ACCESS_CONTROL | server.enforce_access_control |
TORC_SERVER__LOG_LEVEL | server.log_level |
TORC_SERVER__COMPLETION_CHECK_INTERVAL_SECS | server.completion_check_interval_secs |
TORC_SERVER__LOGGING__LOG_DIR | server.logging.log_dir |
TORC_SERVER__LOGGING__JSON_LOGS | server.logging.json_logs |
Dashboard Variables
| Variable | Maps To |
|---|---|
TORC_DASH__HOST | dash.host |
TORC_DASH__PORT | dash.port |
TORC_DASH__API_URL | dash.api_url |
TORC_DASH__STANDALONE | dash.standalone |
Legacy Variables
These environment variables are still supported directly by clap:
| Variable | Component | Description |
|---|---|---|
TORC_API_URL | Client | Server API URL (CLI only) |
TORC_PASSWORD | Client | Authentication password (CLI only) |
TORC_AUTH_FILE | Server | htpasswd file path |
TORC_LOG_DIR | Server | Log directory |
TORC_COMPLETION_CHECK_INTERVAL_SECS | Server | Completion check interval |
TORC_ADMIN_USERS | Server | Comma-separated list of admin usernames |
DATABASE_URL | Server | SQLite database URL |
RUST_LOG | All | Log level filter |
Complete Example
# ~/.config/torc/config.toml
[client]
api_url = "http://localhost:8080/torc-service/v1"
format = "table"
log_level = "info"
username = "developer"
[client.run]
poll_interval = 5.0
output_dir = "output"
num_cpus = 8
memory_gb = 32.0
num_gpus = 1
[server]
log_level = "info"
https = false
url = "localhost"
port = 8080
threads = 4
database = "/var/lib/torc/torc.db"
auth_file = "/etc/torc/htpasswd"
require_auth = true
enforce_access_control = true
admin_users = ["alice", "bob"]
completion_check_interval_secs = 30.0
[server.logging]
log_dir = "/var/log/torc"
json_logs = false
[dash]
host = "127.0.0.1"
port = 8090
api_url = "http://localhost:8080/torc-service/v1"
torc_bin = "torc"
torc_server_bin = "torc-server"
standalone = false
server_port = 0
completion_check_interval_secs = 5
See Also
OpenAPI Specification
The Torc server implements a HTTP API defined in api/openapi.yaml. All endpoints are prefixed with
/torc-service/v1.
Core Endpoints
Workflows
Create Workflow
# curl
curl -X POST http://localhost:8080/torc-service/v1/workflows \
-H "Content-Type: application/json" \
-d '{
"name": "test_workflow",
"user": "alice",
"description": "Test workflow"
}' | jq '.'
# nushell
http post http://localhost:8080/torc-service/v1/workflows {
name: "test_workflow"
user: "alice"
description: "Test workflow"
}
Response:
{
"id": 1,
"name": "test_workflow",
"user": "alice",
"description": "Test workflow",
"timestamp": 1699000000.0
}
List Workflows
# curl with jq
curl http://localhost:8080/torc-service/v1/workflows?offset=0&limit=10 | jq '.workflows'
# nushell (native JSON parsing)
http get http://localhost:8080/torc-service/v1/workflows?offset=0&limit=10 | get workflows
Get Workflow
# curl
curl http://localhost:8080/torc-service/v1/workflows/1 | jq '.'
# nushell
http get http://localhost:8080/torc-service/v1/workflows/1
Initialize Jobs
# curl
curl -X POST http://localhost:8080/torc-service/v1/workflows/1/initialize_jobs \
-H "Content-Type: application/json" \
-d '{"reinitialize": false, "ignore_missing_data": false}' | jq '.'
# nushell
http post http://localhost:8080/torc-service/v1/workflows/1/initialize_jobs {
reinitialize: false
ignore_missing_data: false
}
Jobs
Create Job
# curl
curl -X POST http://localhost:8080/torc-service/v1/jobs \
-H "Content-Type: application/json" \
-d '{
"workflow_id": 1,
"name": "job1",
"command": "echo hello",
"resource_requirements_id": 1,
"input_file_ids": [],
"output_file_ids": [],
"depends_on_job_ids": []
}' | jq '.'
List Jobs
# curl - filter by status
curl "http://localhost:8080/torc-service/v1/jobs?workflow_id=1&status=ready" \
| jq '.jobs[] | {name, status, id}'
# nushell - filter and format
http get "http://localhost:8080/torc-service/v1/jobs?workflow_id=1"
| get jobs
| where status == "ready"
| select name status id
Update Job Status
# curl
curl -X POST http://localhost:8080/torc-service/v1/jobs/1/manage_status_change \
-H "Content-Type: application/json" \
-d '{"target_status": "running"}' | jq '.'
Files
Create File
# curl
curl -X POST http://localhost:8080/torc-service/v1/files \
-H "Content-Type: application/json" \
-d '{
"workflow_id": 1,
"name": "input_data",
"path": "/data/input.csv"
}' | jq '.'
List Files
curl "http://localhost:8080/torc-service/v1/files?workflow_id=1" | jq '.files'
User Data
Create User Data
curl -X POST http://localhost:8080/torc-service/v1/user_data \
-H "Content-Type: application/json" \
-d '{
"workflow_id": 1,
"name": "config",
"data": {"learning_rate": 0.001, "batch_size": 32}
}' | jq '.'
Update User Data
curl -X PUT http://localhost:8080/torc-service/v1/user_data/1 \
-H "Content-Type: application/json" \
-d '{
"workflow_id": 1,
"name": "config",
"data": {"learning_rate": 0.01, "batch_size": 64}
}' | jq '.'
Resource Requirements
Create Resource Requirements
curl -X POST http://localhost:8080/torc-service/v1/resource_requirements \
-H "Content-Type: application/json" \
-d '{
"workflow_id": 1,
"name": "gpu_large",
"num_cpus": 16,
"num_gpus": 4,
"num_nodes": 1,
"memory": "128g",
"runtime": "PT8H"
}' | jq '.'
Memory Format: String with suffix: 1m (MB), 2g (GB), 512k (KB)
Runtime Format: ISO 8601 duration: PT30M (30 minutes), PT2H (2 hours), P1DT12H (1.5 days)
Compute Nodes
Create Compute Node
curl -X POST http://localhost:8080/torc-service/v1/compute_nodes \
-H "Content-Type: application/json" \
-d '{
"workflow_id": 1,
"hostname": "compute-01",
"num_cpus": 32,
"memory": "256g",
"num_gpus": 8,
"is_active": true
}' | jq '.'
List Active Compute Nodes
curl "http://localhost:8080/torc-service/v1/compute_nodes?workflow_id=1&is_active=true" \
| jq '.compute_nodes[] | {hostname, num_cpus, num_gpus}'
Results
Create Result
curl -X POST http://localhost:8080/torc-service/v1/results \
-H "Content-Type: application/json" \
-d '{
"workflow_id": 1,
"job_id": 1,
"exit_code": 0,
"stdout": "Job completed successfully",
"stderr": ""
}' | jq '.'
Events
List Events
curl "http://localhost:8080/torc-service/v1/events?workflow_id=1&limit=20" \
| jq '.events[] | {timestamp, data}'
Advanced Endpoints
Prepare Next Jobs for Submission (Job Runner)
curl -X POST "http://localhost:8080/torc-service/v1/workflows/1/claim_next_jobs?num_jobs=5" \
-H "Content-Type: application/json" \
-d '{}' | jq '.jobs'
Process Changed Job Inputs (Reinitialization)
curl -X POST "http://localhost:8080/torc-service/v1/workflows/1/process_changed_job_inputs?dry_run=true" \
-H "Content-Type: application/json" \
-d '{}' | jq '.reinitialized_jobs'
HPC & Slurm
Running Torc workflows on HPC clusters with Slurm.
- Submit a Workflow to Slurm - Running workflows on HPC clusters
- Slurm Workflows - Configuring Slurm-based workflows
- Debugging Slurm Workflows - Troubleshooting Slurm issues
- Working with Slurm - Slurm integration details
- HPC Profiles - Pre-configured HPC settings
- HPC Profiles Reference - Available profiles
- HPC Deployment - Deploying Torc on HPC systems
- Custom HPC Profile - Creating custom profiles
How to Submit a Workflow to Slurm
Submit a workflow specification to a Slurm-based HPC system with automatic scheduler generation.
Quick Start
torc submit-slurm --account <your-account> workflow.yaml
Torc will:
- Detect your HPC system (e.g., NREL Kestrel, Eagle)
- Match job requirements to appropriate partitions
- Generate Slurm scheduler configurations
- Submit everything for execution
Preview Before Submitting
Always preview the generated configuration first:
torc slurm generate --account <your-account> workflow.yaml
This shows the Slurm schedulers and workflow actions that would be created without submitting.
Requirements
Your workflow must define resource requirements for jobs:
name: my_workflow
resource_requirements:
- name: standard
num_cpus: 4
memory: 8g
runtime: PT1H
jobs:
- name: process_data
command: python process.py
resource_requirements: standard
Options
# See all options
torc submit-slurm --help
See Also
- Slurm Workflows — Full Slurm integration guide
- HPC Profiles — Available HPC system configurations
Slurm Workflows
This document explains how Torc simplifies running workflows on Slurm-based HPC systems. The key insight is that you don't need to understand Slurm schedulers or workflow actions to run workflows on HPC systems—Torc handles this automatically.
The Simple Approach
Running a workflow on Slurm requires just two things:
- Define your jobs with resource requirements
- Submit with
submit-slurm
That's it. Torc will analyze your workflow, generate appropriate Slurm configurations, and submit everything for execution.
⚠️ Important: The
submit-slurmcommand uses heuristics to auto-generate Slurm schedulers and workflow actions. For complex workflows with unusual dependency patterns, the generated configuration may not be optimal and could result in suboptimal allocation timing. Always preview the configuration first usingtorc slurm generate(see Previewing Generated Configuration) before submitting production workflows.
Example Workflow
Here's a complete workflow specification that runs on Slurm:
name: data_analysis_pipeline
description: Analyze experimental data with preprocessing, training, and evaluation
resource_requirements:
- name: light
num_cpus: 4
memory: 8g
runtime: PT30M
- name: compute
num_cpus: 32
memory: 64g
runtime: PT2H
- name: gpu
num_cpus: 16
num_gpus: 2
memory: 128g
runtime: PT4H
jobs:
- name: preprocess
command: python preprocess.py --input data/ --output processed/
resource_requirements: light
- name: train_model
command: python train.py --data processed/ --output model/
resource_requirements: gpu
depends_on: [preprocess]
- name: evaluate
command: python evaluate.py --model model/ --output results/
resource_requirements: compute
depends_on: [train_model]
- name: generate_report
command: python report.py --results results/
resource_requirements: light
depends_on: [evaluate]
Submitting the Workflow
torc submit-slurm --account myproject workflow.yaml
Torc will:
- Detect which HPC system you're on (e.g., NREL Kestrel)
- Match each job's requirements to appropriate partitions
- Generate Slurm scheduler configurations
- Create workflow actions that stage resource allocation based on dependencies
- Submit the workflow for execution
How It Works
When you use submit-slurm, Torc performs intelligent analysis of your workflow:
1. Per-Job Scheduler Generation
Each job gets its own Slurm scheduler configuration based on its resource requirements. This means:
- Jobs are matched to the most appropriate partition
- Memory, CPU, and GPU requirements are correctly specified
- Walltime is set to the partition's maximum (explained below)
2. Staged Resource Allocation
Torc analyzes job dependencies and creates staged workflow actions:
- Jobs without dependencies trigger
on_workflow_start— resources are allocated immediately - Jobs with dependencies trigger
on_jobs_ready— resources are allocated only when the job becomes ready to run
This prevents wasting allocation time on resources that aren't needed yet. For example, in the workflow above:
preprocessresources are allocated at workflow starttrain_modelresources are allocated whenpreprocesscompletesevaluateresources are allocated whentrain_modelcompletesgenerate_reportresources are allocated whenevaluatecompletes
3. Walltime Calculation
By default, Torc sets the walltime to 1.5× your longest job's runtime (capped at the partition's maximum). This provides headroom for jobs that run slightly longer than expected.
You can customize this behavior:
--walltime-strategy max-job-runtime(default): Uses longest job runtime × multiplier--walltime-strategy max-partition-time: Uses the partition's maximum walltime--walltime-multiplier 2.0: Change the safety multiplier (default: 1.5)
See Walltime Strategy Options for details.
4. HPC Profile Knowledge
Torc includes built-in knowledge of HPC systems like NREL Kestrel, including:
- Available partitions and their resource limits
- GPU configurations
- Memory and CPU specifications
- Special requirements (e.g., minimum node counts for high-bandwidth partitions)
Using an unsupported HPC? Please request built-in support so everyone benefits. You can also create a custom profile for immediate use.
Resource Requirements Specification
Resource requirements are the key to the simplified workflow. Define them once and reference them from jobs:
resource_requirements:
- name: small
num_cpus: 4
num_gpus: 0
num_nodes: 1
memory: 8g
runtime: PT1H
- name: gpu_training
num_cpus: 32
num_gpus: 4
num_nodes: 1
memory: 256g
runtime: PT8H
Fields
| Field | Description | Example |
|---|---|---|
name | Reference name for jobs | "compute" |
num_cpus | CPU cores required | 32 |
num_gpus | GPUs required (0 if none) | 2 |
num_nodes | Nodes required | 1 |
memory | Memory with unit suffix | "64g", "512m" |
runtime | ISO8601 duration | "PT2H", "PT30M" |
Runtime Format
Use ISO8601 duration format:
PT30M— 30 minutesPT2H— 2 hoursPT1H30M— 1 hour 30 minutesP1D— 1 dayP2DT4H— 2 days 4 hours
Job Dependencies
Define dependencies explicitly or implicitly through file/data relationships:
Explicit Dependencies
jobs:
- name: step1
command: ./step1.sh
resource_requirements: small
- name: step2
command: ./step2.sh
resource_requirements: small
depends_on: [step1]
- name: step3
command: ./step3.sh
resource_requirements: small
depends_on: [step1, step2] # Waits for both
Implicit Dependencies (via Files)
files:
- name: raw_data
path: /data/raw.csv
- name: processed_data
path: /data/processed.csv
jobs:
- name: process
command: python process.py
input_files: [raw_data]
output_files: [processed_data]
resource_requirements: compute
- name: analyze
command: python analyze.py
input_files: [processed_data] # Creates implicit dependency on 'process'
resource_requirements: compute
Previewing Generated Configuration
Recommended Practice: Always preview the generated configuration before submitting to Slurm, especially for complex workflows. This allows you to verify that schedulers and actions are appropriate for your workflow structure.
Viewing the Execution Plan
Before generating schedulers, visualize how your workflow will execute in stages:
torc workflows execution-plan workflow.yaml
This shows the execution stages, which jobs run at each stage, and (if schedulers are defined) when Slurm allocations are requested. See Visualizing Workflow Structure for detailed examples.
Generating Slurm Configuration
Preview what Torc will generate:
torc slurm generate --account myproject --profile kestrel workflow.yaml
This outputs the complete workflow with generated schedulers and actions:
Scheduler Grouping Options
By default, Torc creates one scheduler per unique resource_requirements name. This means if
you have three jobs with three different resource requirement definitions (e.g., cpu, memory,
mixed), you get three schedulers—even if all three would fit on the same partition.
The --group-by option controls how jobs are grouped into schedulers:
# Default: one scheduler per resource_requirements name
torc slurm generate --account myproject workflow.yaml
torc slurm generate --account myproject --group-by resource-requirements workflow.yaml
# Result: 3 schedulers (cpu_scheduler, memory_scheduler, mixed_scheduler)
# Group by partition: one scheduler per partition
torc slurm generate --account myproject --group-by partition workflow.yaml
# Result: 1 scheduler (short_scheduler) if all jobs fit on the "short" partition
When to use --group-by partition:
- Your workflow has many small resource requirement definitions that all fit on the same partition
- You want to minimize Slurm queue overhead by reducing the number of allocations
- Jobs have similar characteristics and can share nodes efficiently
When to use --group-by resource-requirements (default):
- Jobs have significantly different resource profiles that benefit from separate allocations
- You want fine-grained control over which jobs share resources
- You're debugging and want clear separation between job types
When grouping by partition, the scheduler uses the maximum resource values from all grouped requirements (max memory, max CPUs, max runtime, etc.) to ensure all jobs can run.
Walltime Strategy Options
The --walltime-strategy option controls how Torc calculates the walltime for generated schedulers:
# Default: use max job runtime with a safety multiplier (1.5x)
torc slurm generate --account myproject workflow.yaml
torc slurm generate --account myproject --walltime-strategy max-job-runtime workflow.yaml
# Use the partition's maximum allowed walltime
torc slurm generate --account myproject --walltime-strategy max-partition-time workflow.yaml
Walltime strategies:
| Strategy | Description |
|---|---|
max-job-runtime | Uses the longest job's runtime × multiplier (default: 1.5x). Capped at partition max. |
max-partition-time | Uses the partition's maximum walltime. More conservative but may impact queue scheduling. |
Customizing the multiplier:
The --walltime-multiplier option (default: 1.5) provides a safety margin when using
max-job-runtime:
# Use 2x the max job runtime for extra buffer
torc slurm generate --account myproject --walltime-multiplier 2.0 workflow.yaml
# Use exact job runtime (no buffer - use with caution)
torc slurm generate --account myproject --walltime-multiplier 1.0 workflow.yaml
When to use max-job-runtime (default):
- You want better queue scheduling (shorter walltime requests often get prioritized)
- Your job runtime estimates are reasonably accurate
- You prefer the Torc runner to exit early rather than holding idle allocations
When to use max-partition-time:
- Your job runtimes are highly variable or unpredictable
- You consistently underestimate job runtimes
- Queue priority is not a concern
name: data_analysis_pipeline
# ... original content ...
jobs:
- name: preprocess
command: python preprocess.py --input data/ --output processed/
resource_requirements: light
scheduler: preprocess_scheduler
# ... more jobs ...
slurm_schedulers:
- name: preprocess_scheduler
account: myproject
mem: 8g
nodes: 1
walltime: "04:00:00"
- name: train_model_scheduler
account: myproject
mem: 128g
nodes: 1
gres: "gpu:2"
walltime: "04:00:00"
# ... more schedulers ...
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: preprocess_scheduler
scheduler_type: slurm
num_allocations: 1
- trigger_type: on_jobs_ready
action_type: schedule_nodes
jobs: [train_model]
scheduler: train_model_scheduler
scheduler_type: slurm
num_allocations: 1
# ... more actions ...
Save the output to inspect or modify before submission:
torc slurm generate --account myproject workflow.yaml -o workflow_with_schedulers.yaml
Torc Server Considerations
The Torc server must be accessible to compute nodes. Options include:
- Shared server (Recommended): A team member allocates a dedicated server in the HPC environment
- Login node: Suitable for small workflows with few, long-running jobs
For large workflows with many short jobs, a dedicated server prevents overloading login nodes.
Best Practices
1. Focus on Resource Requirements
Spend time accurately defining resource requirements. Torc handles the rest:
resource_requirements:
# Be specific about what each job type needs
- name: io_heavy
num_cpus: 4
memory: 32g # High memory for data loading
runtime: PT1H
- name: compute_heavy
num_cpus: 64
memory: 16g # Less memory, more CPU
runtime: PT4H
2. Use Meaningful Names
Name resource requirements by their purpose, not by partition:
# Good - describes the workload
resource_requirements:
- name: data_preprocessing
- name: model_training
- name: inference
# Avoid - ties you to specific infrastructure
resource_requirements:
- name: short_partition
- name: gpu_h100
3. Group Similar Jobs
Jobs with similar requirements can share resource requirement definitions:
resource_requirements:
- name: quick_task
num_cpus: 2
memory: 4g
runtime: PT15M
jobs:
- name: validate_input
command: ./validate.sh
resource_requirements: quick_task
- name: check_output
command: ./check.sh
resource_requirements: quick_task
depends_on: [main_process]
4. Test Locally First
Validate your workflow logic locally before submitting to HPC:
# Run locally (without Slurm)
torc run workflow.yaml
# Then submit to HPC
torc submit-slurm --account myproject workflow.yaml
Limitations and Caveats
The auto-generation in submit-slurm uses heuristics that work well for common workflow patterns
but may not be optimal for all cases:
When Auto-Generation Works Well
- Linear pipelines: A → B → C → D
- Fan-out patterns: One job unblocks many (e.g., preprocess → 100 work jobs)
- Fan-in patterns: Many jobs unblock one (e.g., 100 work jobs → postprocess)
- Simple DAGs: Clear dependency structures with distinct resource tiers
When to Use Manual Configuration
Consider using torc slurm generate to preview and manually adjust, or define schedulers manually,
when:
- Complex dependency graphs: Multiple interleaved dependency patterns
- Shared schedulers: You want multiple jobs to share the same Slurm allocation
- Custom timing: Specific requirements for when allocations should be requested
- Resource optimization: Fine-tuning to minimize allocation waste
- Multi-node jobs: Jobs requiring coordination across multiple nodes
What Could Go Wrong
Without previewing, auto-generation might:
- Request allocations too early: Wasting queue time waiting for dependencies
- Request allocations too late: Adding latency to job startup
- Create suboptimal scheduler groupings: Not sharing allocations when beneficial
- Miss optimization opportunities: Not recognizing patterns that could share resources
Best Practice: For production workflows, always run torc slurm generate first, review the
output, and submit the reviewed configuration with torc submit.
Advanced: Manual Scheduler Configuration
For advanced users who need fine-grained control, you can define schedulers and actions manually. See Working with Slurm for details.
Common reasons for manual configuration:
- Non-standard partition requirements
- Custom Slurm directives (e.g.,
--constraint) - Multi-node jobs with specific topology requirements
- Reusing allocations across multiple jobs for efficiency
Troubleshooting
"No partition found for job"
Your resource requirements exceed what's available. Check:
- Memory doesn't exceed partition limits
- Runtime doesn't exceed partition walltime
- GPU count is available on GPU partitions
Use torc hpc partitions <profile> to see available resources.
Jobs Not Starting
Ensure the Torc server is accessible from compute nodes:
# From a compute node
curl $TORC_API_URL/health
Wrong Partition Selected
Use torc hpc match to see which partitions match your requirements:
torc hpc match kestrel --cpus 32 --memory 64g --walltime 2h --gpus 2
See Also
- Visualizing Workflow Structure — Execution plans and DAG visualization
- HPC Profiles — Detailed HPC profile usage
- Working with Slurm — Advanced Slurm configuration
- Resource Requirements Reference — Complete specification
- Workflow Actions — Understanding actions
Debugging Slurm Workflows
When running workflows on Slurm clusters, Torc provides additional debugging tools specifically designed for Slurm environments. This guide covers Slurm-specific debugging techniques and tools.
For general debugging concepts and tools that apply to all workflows, see Debugging Workflows.
Overview
Slurm workflows generate additional log files beyond the standard job logs:
- Slurm stdout/stderr: Output from Slurm's perspective (job allocation, environment setup)
- Slurm environment logs: All SLURM environment variables captured at job runner startup
- dmesg logs: Kernel message buffer captured when the Slurm job runner exits
These logs help diagnose issues specific to the cluster environment, such as resource allocation failures, node problems, and system-level errors.
Slurm Log File Structure
For jobs executed via Slurm scheduler (compute_node_type: "slurm"), the debug report includes
these additional log paths:
{
"job_stdout": "output/job_stdio/job_wf1_j456_r1.o",
"job_stderr": "output/job_stdio/job_wf1_j456_r1.e",
"job_runner_log": "output/job_runner_slurm_wf1_sl12345_n0_pid67890.log",
"slurm_stdout": "output/slurm_output_wf1_sl12345.o",
"slurm_stderr": "output/slurm_output_wf1_sl12345.e",
"slurm_env_log": "output/slurm_env_wf1_sl12345_n0_pid67890.log",
"dmesg_log": "output/dmesg_slurm_wf1_sl12345_n0_pid67890.log"
}
All Slurm log files include the workflow ID (wf<id>) prefix, making it easy to identify and
collect logs for a specific workflow.
Log File Descriptions
-
slurm_stdout (
output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.o):- Standard output from Slurm's perspective
- Includes Slurm environment setup, job allocation info
- Use for: Debugging Slurm job submission issues
-
slurm_stderr (
output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.e):- Standard error from Slurm's perspective
- Contains Slurm-specific errors (allocation failures, node issues)
- Use for: Investigating Slurm scheduler problems
-
job_runner_log (
output/job_runner_slurm_wf<id>_sl<slurm_job_id>_n<node>_pid<pid>.log):- Log output from the Torc Slurm job runner process
- Contains job execution details, status updates, and runner-level errors
- Use for: Debugging job runner issues, understanding job execution flow
-
slurm_env_log (
output/slurm_env_wf<id>_sl<slurm_job_id>_n<node_id>_pid<task_pid>.log):- All SLURM environment variables captured at job runner startup
- Contains job allocation details, resource limits, node assignments
- Use for: Verifying Slurm job configuration, debugging resource allocation issues
-
dmesg_log (
output/dmesg_slurm_wf<id>_sl<slurm_job_id>_n<node_id>_pid<task_pid>.log):- Kernel message buffer captured when the Slurm job runner exits (only on failure)
- Contains system-level events: OOM killer activity, hardware errors, kernel panics
- Use for: Investigating job failures caused by system-level issues (e.g., out-of-memory kills, hardware failures)
Note: All Slurm log files include the workflow ID, Slurm job ID, node ID, and task PID in the filename for easy filtering and correlation with Slurm's own logs.
Parsing Slurm Log Files for Errors
The torc slurm parse-logs command scans Slurm stdout/stderr log files for known error patterns and
correlates them with affected Torc jobs:
# Parse logs for a specific workflow
torc slurm parse-logs <workflow_id>
# Specify custom output directory
torc slurm parse-logs <workflow_id> --output-dir /path/to/output
# Output as JSON for programmatic processing
torc slurm parse-logs <workflow_id> --format json
Detected Error Patterns
The command detects common Slurm failure patterns including:
Memory Errors:
out of memory,oom-kill,cannot allocate memorymemory cgroup out of memory,Exceeded job memory limittask/cgroup: .*: Killedstd::bad_alloc(C++),MemoryError(Python)
Slurm-Specific Errors:
slurmstepd: error:,srun: error:DUE TO TIME LIMIT,DUE TO PREEMPTIONNODE_FAIL,FAILED,CANCELLEDExceeded.*step.*limit
GPU/CUDA Errors:
CUDA out of memory,CUDA error,GPU memory.*exceeded
Signal/Crash Errors:
Segmentation fault,SIGSEGVBus error,SIGBUSkilled by signal,core dumped
Python Errors:
Traceback (most recent call last)ModuleNotFoundError,ImportError
File System Errors:
No space left on device,Disk quota exceededRead-only file system,Permission denied
Network Errors:
Connection refused,Connection timed out,Network is unreachable
Example Output
Table format:
Slurm Log Analysis Results
==========================
Found 2 error(s) in log files:
╭─────────────────────────────┬──────────────┬──────┬─────────────────────────────┬──────────┬──────────────────────────────╮
│ File │ Slurm Job ID │ Line │ Pattern │ Severity │ Affected Torc Jobs │
├─────────────────────────────┼──────────────┼──────┼─────────────────────────────┼──────────┼──────────────────────────────┤
│ slurm_output_sl12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │
│ slurm_output_sl12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │
╰─────────────────────────────┴──────────────┴──────┴─────────────────────────────┴──────────┴──────────────────────────────╯
Viewing Slurm Accounting Data
The torc slurm sacct command displays a summary of Slurm job accounting data for all scheduled
compute nodes in a workflow:
# Display sacct summary table for a workflow
torc slurm sacct <workflow_id>
# Also save full JSON files for detailed analysis
torc slurm sacct <workflow_id> --save-json --output-dir /path/to/output
# Output as JSON for programmatic processing
torc slurm sacct <workflow_id> --format json
Summary Table Fields
The command displays a summary table with key metrics:
- Slurm Job: The Slurm job ID
- Job Step: Name of the job step (e.g., "worker_1", "batch")
- State: Job state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, etc.)
- Exit Code: Exit code of the job step
- Elapsed: Wall clock time for the job step
- Max RSS: Maximum resident set size (memory usage)
- CPU Time: Total CPU time consumed
- Nodes: Compute nodes used
Example Output
Slurm Accounting Summary for Workflow 123
╭────────────┬───────────┬───────────┬───────────┬─────────┬─────────┬──────────┬─────────╮
│ Slurm Job │ Job Step │ State │ Exit Code │ Elapsed │ Max RSS │ CPU Time │ Nodes │
├────────────┼───────────┼───────────┼───────────┼─────────┼─────────┼──────────┼─────────┤
│ 12345 │ worker_1 │ COMPLETED │ 0 │ 2h 15m │ 4.5GB │ 4h 30m │ node01 │
│ 12345 │ batch │ COMPLETED │ 0 │ 2h 16m │ 128.0MB │ 1m 30s │ node01 │
│ 12346 │ worker_1 │ FAILED │ 1 │ 45m 30s │ 8.2GB │ 1h 30m │ node02 │
╰────────────┴───────────┴───────────┴───────────┴─────────┴─────────┴──────────┴─────────╯
Total: 3 job steps
Saving Full JSON Output
Use --save-json to save full sacct JSON output to files for detailed analysis:
torc slurm sacct 123 --save-json --output-dir output
# Creates: output/sacct_12345.json, output/sacct_12346.json, etc.
Viewing Slurm Logs in torc-dash
The torc-dash web interface provides two ways to view Slurm logs:
Debugging Tab - Slurm Log Analysis
The Debugging tab includes a "Slurm Log Analysis" section:
- Navigate to the Debugging tab
- Find the Slurm Log Analysis section
- Enter the output directory path (default:
output) - Click Analyze Slurm Logs
The results show all detected errors with their Slurm job IDs, line numbers, error patterns, severity levels, and affected Torc jobs.
Debugging Tab - Slurm Accounting Data
The Debugging tab also includes a "Slurm Accounting Data" section:
- Navigate to the Debugging tab
- Find the Slurm Accounting Data section
- Click Collect sacct Data
This displays a summary table showing job state, exit codes, elapsed time, memory usage (Max RSS), CPU time, and nodes for all Slurm job steps. The table helps quickly identify failed jobs and resource usage patterns.
Scheduled Nodes Tab - View Slurm Logs
You can view individual Slurm job logs directly from the Details view:
- Select a workflow
- Go to the Details tab
- Switch to the Scheduled Nodes sub-tab
- Find a Slurm scheduled node in the table
- Click the View Logs button in the Logs column
This opens a modal with tabs for viewing the Slurm job's stdout and stderr files.
Viewing Slurm Logs in the TUI
The torc tui terminal interface also supports Slurm log viewing:
- Launch the TUI:
torc tui - Select a workflow and press Enter to load details
- Press Tab to switch to the Scheduled Nodes tab
- Navigate to a Slurm scheduled node using arrow keys
- Press
lto view the Slurm job's logs
The log viewer shows:
- stdout tab: Slurm job standard output (
slurm_output_wf<id>_sl<slurm_job_id>.o) - stderr tab: Slurm job standard error (
slurm_output_wf<id>_sl<slurm_job_id>.e)
Use Tab to switch between stdout/stderr, arrow keys to scroll, / to search, and q to close.
Debugging Slurm Job Failures
When a Slurm job fails, follow this debugging workflow:
-
Parse logs for known errors:
torc slurm parse-logs <workflow_id> -
If OOM or resource issues are detected, collect sacct data:
torc slurm sacct <workflow_id> cat output/sacct_<slurm_job_id>.json | jq '.jobs[].steps[].tres.requested' -
View the specific Slurm log files:
- Use torc-dash: Details → Scheduled Nodes → View Logs
- Or use TUI: Scheduled Nodes tab → press
l - Or directly:
cat output/slurm_output_wf<workflow_id>_sl<slurm_job_id>.e
-
Check the job's own stderr for application errors:
torc reports results <workflow_id> > report.json jq -r '.results[] | select(.return_code != 0) | .job_stderr' report.json | xargs cat -
Review dmesg logs for system-level issues:
cat output/dmesg_slurm_wf<workflow_id>_sl<slurm_job_id>_*.log
Orphaned Jobs and Status Synchronization
When a Slurm allocation terminates unexpectedly (e.g., due to timeout, node failure, or admin intervention), jobs may become "orphaned" - stuck in "running" status in Torc's database even though no process is actually executing them.
Detecting Orphaned Jobs
Common signs of orphaned jobs:
- Jobs remain in "running" status long after the Slurm allocation ended
torc recoverreports "there are active Slurm allocations" butsqueueshows none- Workflow appears stuck but no Slurm jobs are actually running
Synchronizing Status with Slurm
The torc workflows sync-status command detects and fixes orphaned jobs by checking the actual
Slurm state:
# Preview what would be cleaned up (recommended first)
torc workflows sync-status <workflow_id> --dry-run
# Clean up orphaned jobs
torc workflows sync-status <workflow_id>
# Get JSON output for scripting
torc -f json workflows sync-status <workflow_id>
This command:
- Checks each "active" scheduled compute node against
squeue - If Slurm reports the job is no longer running, marks associated Torc jobs as failed
- Updates scheduled compute node status to "complete"
- Also handles "pending" allocations that were cancelled before starting
Example Output
Synchronizing job statuses for workflow 42...
Cleaned up orphaned jobs:
- 3 job(s) from terminated Slurm allocations
- 1 pending allocation(s) that no longer exist in Slurm
Affected jobs:
- Job 107 (train_model_7): Allocation terminated (Slurm job 12345)
- Job 112 (train_model_12): Allocation terminated (Slurm job 12345)
- Job 123 (train_model_23): Allocation terminated (Slurm job 12345)
Total: 3 job(s) marked as failed
You can now run `torc recover 42` to retry failed jobs.
Automatic Cleanup in Recovery
The torc recover command automatically performs orphan detection as its first step, so you
typically don't need to run sync-status manually before recovery. However, sync-status is useful
when:
- You want to clean up orphaned jobs without triggering a full recovery
- You want to preview what
recoverwould clean up (using--dry-run) - You're debugging why
recoverreports active allocations
Common Slurm Issues and Solutions
Out of Memory (OOM) Kills
Symptoms:
torc slurm parse-logsshows "Out of Memory (OOM) Kill"- Job exits with signal 9 (SIGKILL)
- dmesg log shows "oom-kill" entries
Solutions:
- Increase memory request in job specification
- Check
torc slurm sacctoutput for actual memory usage (Max RSS) - Consider splitting job into smaller chunks
Time Limit Exceeded
Symptoms:
torc slurm parse-logsshows "DUE TO TIME LIMIT"- Job state in sacct shows "TIMEOUT"
Solutions:
- Increase runtime in job specification
- Check if job is stuck (review stdout for progress)
- Consider optimizing the job or splitting into phases
Node Failures
Symptoms:
torc slurm parse-logsshows "NODE_FAIL"- Job may have completed partially
Solutions:
- Reinitialize workflow to retry failed jobs
- Check cluster status with
sinfo - Review dmesg logs for hardware issues
GPU/CUDA Errors
Symptoms:
torc slurm parse-logsshows "CUDA out of memory" or "CUDA error"
Solutions:
- Reduce batch size or model size
- Check GPU memory with
nvidia-smiin job script - Ensure correct CUDA version is loaded
Related Commands
torc slurm parse-logs: Parse Slurm logs for known error patternstorc slurm sacct: Collect Slurm accounting data for workflow jobstorc workflows sync-status: Detect and fix orphaned jobs from terminated Slurm allocationstorc reports results: Generate debug report with all log file pathstorc results list: View summary of job results in table formattorc-dash: Launch web interface with Slurm log viewingtorc tui: Launch terminal UI with Slurm log viewing
See Also
- Debugging Workflows — General debugging tools and workflows
- Working with Logs — Bundling and analyzing logs
Working with Slurm
This guide covers advanced Slurm configuration for users who need fine-grained control over their HPC workflows.
For most users: See Slurm Workflows for the recommended approach using
torc submit-slurm. You don't need to manually configure schedulers or actions—Torc handles this automatically.
When to Use Manual Configuration
Manual Slurm configuration is useful when you need:
- Custom Slurm directives (e.g.,
--constraint,--exclusive) - Multi-node jobs with specific topology requirements
- Shared allocations across multiple jobs for efficiency
- Non-standard partition configurations
- Fine-tuned control over allocation timing
Torc Server Requirements
The Torc server must be accessible from compute nodes:
- External server (Recommended): A team member allocates a shared server in the HPC environment. This is recommended if your operations team provides this capability.
- Login node: Suitable for small workflows. The server runs single-threaded by default. If you have many thousands of short jobs, check with your operations team about resource limits.
Manual Scheduler Configuration
Defining Slurm Schedulers
Define schedulers in your workflow specification:
slurm_schedulers:
- name: standard
account: my_project
nodes: 1
walltime: "12:00:00"
partition: compute
mem: 64G
- name: gpu_nodes
account: my_project
nodes: 1
walltime: "08:00:00"
partition: gpu
gres: "gpu:4"
mem: 256G
Scheduler Fields
| Field | Description | Required |
|---|---|---|
name | Scheduler identifier | Yes |
account | Slurm account/allocation | Yes |
nodes | Number of nodes | Yes |
walltime | Time limit (HH:MM:SS or D-HH:MM:SS) | Yes |
partition | Slurm partition | No |
mem | Memory per node | No |
gres | Generic resources (e.g., GPUs) | No |
qos | Quality of Service | No |
ntasks_per_node | Tasks per node | No |
tmp | Temporary disk space | No |
extra | Additional sbatch arguments | No |
Defining Workflow Actions
Actions trigger scheduler allocations:
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: standard
scheduler_type: slurm
num_allocations: 1
- trigger_type: on_jobs_ready
action_type: schedule_nodes
jobs: [train_model]
scheduler: gpu_nodes
scheduler_type: slurm
num_allocations: 2
Action Trigger Types
| Trigger | Description |
|---|---|
on_workflow_start | Fires when workflow is submitted |
on_jobs_ready | Fires when specified jobs become ready |
on_jobs_complete | Fires when specified jobs complete |
on_workflow_complete | Fires when all jobs complete |
Assigning Jobs to Schedulers
Reference schedulers in job definitions:
jobs:
- name: preprocess
command: ./preprocess.sh
scheduler: standard
- name: train
command: python train.py
scheduler: gpu_nodes
depends_on: [preprocess]
Scheduling Strategies
Strategy 1: Many Single-Node Allocations
Submit multiple Slurm jobs, each with its own Torc worker:
slurm_schedulers:
- name: work_scheduler
account: my_account
nodes: 1
walltime: "04:00:00"
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: work_scheduler
scheduler_type: slurm
num_allocations: 10
When to use:
- Jobs have diverse resource requirements
- Want independent time limits per job
- Cluster has low queue wait times
Benefits:
- Maximum scheduling flexibility
- Independent time limits per allocation
- Fault isolation
Drawbacks:
- More Slurm queue overhead
- Multiple jobs to schedule
Strategy 2: Multi-Node Allocation, One Worker Per Node
Launch multiple workers within a single allocation:
slurm_schedulers:
- name: work_scheduler
account: my_account
nodes: 10
walltime: "04:00:00"
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: work_scheduler
scheduler_type: slurm
num_allocations: 1
start_one_worker_per_node: true
When to use:
- Many jobs with similar requirements
- Want faster queue scheduling (larger jobs often prioritized)
Benefits:
- Single queue wait
- Often prioritized by Slurm scheduler
Drawbacks:
- Shared time limit for all workers
- Less flexibility
Strategy 3: Single Worker Per Allocation
One Torc worker handles all nodes:
slurm_schedulers:
- name: work_scheduler
account: my_account
nodes: 10
walltime: "04:00:00"
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: work_scheduler
scheduler_type: slurm
num_allocations: 1
When to use:
- Your application manages node coordination
- Need full control over compute resources
Staged Allocations
For pipelines with distinct phases, stage allocations to avoid wasted resources:
slurm_schedulers:
- name: preprocess_sched
account: my_project
nodes: 2
walltime: "01:00:00"
- name: compute_sched
account: my_project
nodes: 20
walltime: "08:00:00"
- name: postprocess_sched
account: my_project
nodes: 1
walltime: "00:30:00"
actions:
# Preprocessing starts immediately
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: preprocess_sched
scheduler_type: slurm
num_allocations: 1
# Compute nodes allocated when compute jobs are ready
- trigger_type: on_jobs_ready
action_type: schedule_nodes
jobs: [compute_step]
scheduler: compute_sched
scheduler_type: slurm
num_allocations: 1
start_one_worker_per_node: true
# Postprocessing allocated when those jobs are ready
- trigger_type: on_jobs_ready
action_type: schedule_nodes
jobs: [postprocess]
scheduler: postprocess_sched
scheduler_type: slurm
num_allocations: 1
Note: The
torc submit-slurmcommand handles this automatically by analyzing job dependencies.
Custom Slurm Directives
Use the extra field for additional sbatch arguments:
slurm_schedulers:
- name: exclusive_nodes
account: my_project
nodes: 4
walltime: "04:00:00"
extra: "--exclusive --constraint=skylake"
Submitting Workflows
With Manual Configuration
# Submit workflow with pre-defined schedulers and actions
torc submit workflow.yaml
Scheduling Additional Nodes
Add more allocations to a running workflow:
torc slurm schedule-nodes -n 5 $WORKFLOW_ID
Debugging
Check Slurm Job Status
squeue --me
View Torc Worker Logs
Workers log to the Slurm output file. Check:
cat slurm-<jobid>.out
Verify Server Connectivity
From a compute node:
curl $TORC_API_URL/health
See Also
- Slurm Workflows — Simplified workflow approach
- HPC Profiles — Automatic partition matching
- Workflow Actions — Action system details
- Debugging Slurm Workflows — Troubleshooting guide
Working with HPC Profiles
HPC (High-Performance Computing) profiles provide pre-configured knowledge about specific HPC systems, including their partitions, resource limits, and optimal settings. Torc uses this information to automatically match job requirements to appropriate partitions.
Overview
HPC profiles contain:
- Partition definitions: Available queues with their resource limits (CPUs, memory, walltime, GPUs)
- Detection rules: How to identify when you're on a specific HPC system
- Default settings: Account names and other system-specific defaults
Built-in profiles are available for systems like NREL's Kestrel. You can also define custom profiles for private clusters.
Listing Available Profiles
View all known HPC profiles:
torc hpc list
Example output:
Known HPC profiles:
╭─────────┬──────────────┬────────────┬──────────╮
│ Name │ Display Name │ Partitions │ Detected │
├─────────┼──────────────┼────────────┼──────────┤
│ kestrel │ NREL Kestrel │ 15 │ ✓ │
╰─────────┴──────────────┴────────────┴──────────╯
The "Detected" column shows if Torc recognizes you're currently on that system.
Detecting the Current System
Torc can automatically detect which HPC system you're on:
torc hpc detect
Detection works through environment variables. For example, NREL Kestrel is detected when
NREL_CLUSTER=kestrel is set.
Viewing Profile Details
See detailed information about a specific profile:
torc hpc show kestrel
This displays:
- Profile name and description
- Detection method
- Default account (if configured)
- Number of partitions
Viewing Available Partitions
List all partitions for a profile:
torc hpc partitions kestrel
Example output:
Partitions for kestrel:
╭──────────┬─────────────┬───────────┬─────────────────┬─────────────────╮
│ Name │ CPUs/Node │ Mem/Node │ Max Walltime │ GPUs │
├──────────┼─────────────┼───────────┼─────────────────┼─────────────────┤
│ debug │ 104 │ 240 GB │ 1h │ - │
│ short │ 104 │ 240 GB │ 4h │ - │
│ standard │ 104 │ 240 GB │ 48h │ - │
│ gpu-h100 │ 2 │ 240 GB │ 48h │ 4 (H100) │
│ ... │ ... │ ... │ ... │ ... │
╰──────────┴─────────────┴───────────┴─────────────────┴─────────────────╯
Finding Matching Partitions
Find partitions that can satisfy specific resource requirements:
torc hpc match kestrel --cpus 32 --memory 64g --walltime 2h
Options:
--cpus <N>: Required CPU cores--memory <SIZE>: Required memory (e.g.,64g,512m)--walltime <DURATION>: Required walltime (e.g.,2h,4:00:00)--gpus <N>: Required GPUs (optional)
This is useful for understanding which partitions your jobs will be assigned to.
Custom HPC Profiles
If your HPC system doesn't have a built-in profile, you have two options:
Request Built-in Support (Recommended)
If your HPC is widely used, please open an issue requesting built-in support. Include:
- Your HPC system name and organization
- Partition names with resource limits (CPUs, memory, walltime, GPUs)
- Detection method (environment variable or hostname pattern)
Built-in profiles benefit everyone using that system and are maintained by the Torc team.
If you need to use your HPC immediately or have a private cluster, you can define a custom profile in your configuration file. See the Custom HPC Profile Tutorial for a complete walkthrough.
Quick Example
Define custom profiles in your configuration file:
# ~/.config/torc/config.toml
[client.hpc.custom_profiles.mycluster]
display_name = "My Research Cluster"
description = "Internal research HPC system"
detect_env_var = "MY_CLUSTER=research"
default_account = "default_project"
[[client.hpc.custom_profiles.mycluster.partitions]]
name = "compute"
cpus_per_node = 64
memory_mb = 256000
max_walltime_secs = 172800
shared = false
[[client.hpc.custom_profiles.mycluster.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 128000
max_walltime_secs = 86400
gpus_per_node = 4
gpu_type = "A100"
shared = false
See Configuration Reference for full configuration options.
Using Profiles with Slurm Workflows
HPC profiles are used by Slurm-related commands to automatically generate scheduler configurations. See Working with Slurm for details on:
torc submit-slurm- Submit workflows with auto-generated schedulerstorc workflows create-slurm- Create workflows with auto-generated schedulers
See Also
- Working with Slurm
- Custom HPC Profile Tutorial
- HPC Profiles Reference
- Configuration Reference
- Resource Requirements Reference
HPC Profiles Reference
Complete reference for HPC profile system and CLI commands.
Overview
HPC profiles contain pre-configured knowledge about High-Performance Computing systems, enabling automatic Slurm scheduler generation based on job resource requirements.
CLI Commands
torc hpc list
List all available HPC profiles.
torc hpc list [OPTIONS]
Options:
| Option | Description |
|---|---|
-f, --format <FORMAT> | Output format: table or json |
Output columns:
- Name: Profile identifier used in commands
- Display Name: Human-readable name
- Partitions: Number of configured partitions
- Detected: Whether current system matches this profile
torc hpc detect
Detect the current HPC system.
torc hpc detect [OPTIONS]
Options:
| Option | Description |
|---|---|
-f, --format <FORMAT> | Output format: table or json |
Returns the detected profile name, or indicates no match.
torc hpc show
Display detailed information about an HPC profile.
torc hpc show <PROFILE> [OPTIONS]
Arguments:
| Argument | Description |
|---|---|
<PROFILE> | Profile name (e.g., kestrel) |
Options:
| Option | Description |
|---|---|
-f, --format <FORMAT> | Output format: table or json |
torc hpc partitions
List partitions for an HPC profile.
torc hpc partitions <PROFILE> [OPTIONS]
Arguments:
| Argument | Description |
|---|---|
<PROFILE> | Profile name (e.g., kestrel) |
Options:
| Option | Description |
|---|---|
-f, --format <FORMAT> | Output format: table or json |
Output columns:
- Name: Partition name
- CPUs/Node: CPU cores per node
- Mem/Node: Memory per node
- Max Walltime: Maximum job duration
- GPUs: GPU count and type (if applicable)
- Shared: Whether partition supports shared jobs
- Notes: Special requirements or features
torc hpc match
Find partitions matching resource requirements.
torc hpc match <PROFILE> [OPTIONS]
Arguments:
| Argument | Description |
|---|---|
<PROFILE> | Profile name (e.g., kestrel) |
Options:
| Option | Description |
|---|---|
--cpus <N> | Required CPU cores |
--memory <SIZE> | Required memory (e.g., 64g, 512m) |
--walltime <DURATION> | Required walltime (e.g., 2h, 4:00:00) |
--gpus <N> | Required GPUs |
-f, --format <FORMAT> | Output format: table or json |
Memory format: <number><unit> where unit is k, m, g, or t (case-insensitive).
Walltime formats:
HH:MM:SS(e.g.,04:00:00)<N>h(e.g.,4h)<N>m(e.g.,30m)<N>s(e.g.,3600s)
torc hpc generate
Generate an HPC profile configuration from the current Slurm cluster.
torc hpc generate [OPTIONS]
Options:
| Option | Description |
|---|---|
--name <NAME> | Profile name (defaults to cluster name or hostname) |
--display-name <NAME> | Human-readable display name |
-o, --output <FILE> | Output file path (prints to stdout if not specified) |
--skip-stdby | Skip standby partitions (names ending in -stdby) |
How it works:
- Queries
sinfoto get partition names, CPUs, memory, time limits, and GRES - Queries
scontrol show partitionfor each partition to get additional details - Parses GRES strings to extract GPU count and type
- Generates hostname-based detection pattern from current hostname
- Outputs TOML configuration ready to add to your config file
Example:
# Generate profile from current cluster
torc hpc generate
# Output:
# [client.hpc.custom_profiles.mycluster]
# display_name = "Mycluster"
# detect_hostname = ".*\\.mycluster\\.edu"
#
# [[client.hpc.custom_profiles.mycluster.partitions]]
# name = "compute"
# cpus_per_node = 64
# memory_mb = 256000
# max_walltime_secs = 172800
# ...
Fields extracted automatically:
- Partition name, CPUs per node, memory (MB), max walltime (seconds)
- GPU count and type from GRES (e.g.,
gpu:a100:4) - Shared node support from OverSubscribe setting
Fields that may need manual adjustment:
requires_explicit_request: Defaults tofalse; set totruefor partitions that shouldn't be auto-selecteddescription: Not available from Slurm; add human-readable descriptionsgpu_memory_gb: Not available from Slurm; add if known
torc slurm generate
Generate Slurm schedulers for a workflow based on job resource requirements.
torc slurm generate [OPTIONS] --account <ACCOUNT> <WORKFLOW_FILE>
Arguments:
| Argument | Description |
|---|---|
<WORKFLOW_FILE> | Path to workflow specification file (YAML, JSON, or JSON5) |
Options:
| Option | Description |
|---|---|
--account <ACCOUNT> | Slurm account to use (required) |
--profile <PROFILE> | HPC profile to use (auto-detected if not specified) |
-o, --output <FILE> | Output file path (prints to stdout if not specified) |
--no-actions | Don't add workflow actions for scheduling nodes |
--force | Overwrite existing schedulers in the workflow |
Generated artifacts:
- Slurm schedulers: One for each unique resource requirement
- Job scheduler assignments: Each job linked to appropriate scheduler
- Workflow actions:
on_workflow_start/schedule_nodesactions (unless--no-actions)
Scheduler naming: <resource_requirement_name>_scheduler
Built-in Profiles
NREL Kestrel
Profile name: kestrel
Detection: Environment variable NREL_CLUSTER=kestrel
Partitions:
| Partition | CPUs | Memory | Max Walltime | GPUs | Notes |
|---|---|---|---|---|---|
debug | 104 | 240 GB | 1h | - | Quick testing |
short | 104 | 240 GB | 4h | - | Short jobs |
standard | 104 | 240 GB | 48h | - | General workloads |
long | 104 | 240 GB | 240h | - | Extended jobs |
medmem | 104 | 480 GB | 48h | - | Medium memory |
bigmem | 104 | 2048 GB | 48h | - | High memory |
shared | 104 | 240 GB | 48h | - | Shared node access |
hbw | 104 | 240 GB | 48h | - | High-bandwidth memory, min 10 nodes |
nvme | 104 | 240 GB | 48h | - | NVMe local storage |
gpu-h100 | 2 | 240 GB | 48h | 4x H100 | GPU compute |
Node specifications:
- Standard nodes: 104 cores (2x Intel Xeon Sapphire Rapids), 240 GB RAM
- GPU nodes: 4x NVIDIA H100 80GB HBM3, 128 cores, 2 TB RAM
Configuration
Custom Profiles
Don't see your HPC? Please request built-in support so everyone benefits. See the Custom HPC Profile Tutorial for creating a profile while you wait.
Define custom profiles in your Torc configuration file:
# ~/.config/torc/config.toml
[client.hpc.custom_profiles.mycluster]
display_name = "My Cluster"
description = "Description of the cluster"
detect_env_var = "CLUSTER_NAME=mycluster"
detect_hostname = ".*\\.mycluster\\.org"
default_account = "myproject"
[[client.hpc.custom_profiles.mycluster.partitions]]
name = "compute"
cpus_per_node = 64
memory_mb = 256000
max_walltime_secs = 172800
shared = false
[[client.hpc.custom_profiles.mycluster.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 128000
max_walltime_secs = 86400
gpus_per_node = 4
gpu_type = "A100"
shared = false
Profile Override
Override settings for built-in profiles:
[client.hpc.profile_overrides.kestrel]
default_account = "my_default_account"
Configuration Options
[client.hpc] Section:
| Option | Type | Description |
|---|---|---|
profile_overrides | table | Override settings for built-in profiles |
custom_profiles | table | Define custom HPC profiles |
Profile override options:
| Option | Type | Description |
|---|---|---|
default_account | string | Default Slurm account for this profile |
Custom profile options:
| Option | Type | Required | Description |
|---|---|---|---|
display_name | string | No | Human-readable name |
description | string | No | Profile description |
detect_env_var | string | No | Environment variable for detection (NAME=value) |
detect_hostname | string | No | Regex pattern for hostname detection |
default_account | string | No | Default Slurm account |
partitions | array | Yes | List of partition configurations |
Partition options:
| Option | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Partition name |
cpus_per_node | int | Yes | CPU cores per node |
memory_mb | int | Yes | Memory per node in MB |
max_walltime_secs | int | Yes | Maximum walltime in seconds |
gpus_per_node | int | No | GPUs per node |
gpu_type | string | No | GPU model (e.g., "H100") |
shared | bool | No | Whether partition supports shared jobs |
min_nodes | int | No | Minimum required nodes |
requires_explicit_request | bool | No | Must be explicitly requested |
Resource Matching Algorithm
When generating schedulers, Torc uses this algorithm to match resource requirements to partitions:
-
Filter by resources: Partitions must satisfy:
- CPUs >= required CPUs
- Memory >= required memory
- GPUs >= required GPUs (if specified)
- Max walltime >= required runtime
-
Exclude debug partitions: Unless no other partition matches
-
Prefer best fit:
- Partitions that exactly match resource needs
- Non-shared partitions over shared
- Shorter max walltime over longer
-
Handle special requirements:
- GPU jobs only match GPU partitions
- Respect
requires_explicit_requestflag - Honor
min_nodesconstraints
Generated Scheduler Format
Example generated Slurm scheduler:
slurm_schedulers:
- name: medium_scheduler
account: myproject
nodes: 1
mem: 64g
walltime: 04:00:00
gres: null
partition: null # Let Slurm choose based on resources
Corresponding workflow action:
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: medium_scheduler
scheduler_type: slurm
num_allocations: 1
Runtime Format Parsing
Resource requirements use ISO 8601 duration format for runtime:
| Format | Example | Meaning |
|---|---|---|
PTnH | PT4H | 4 hours |
PTnM | PT30M | 30 minutes |
PTnS | PT3600S | 3600 seconds |
PTnHnM | PT2H30M | 2 hours 30 minutes |
PnDTnH | P1DT12H | 1 day 12 hours |
Generated walltime uses HH:MM:SS format (e.g., 04:00:00).
See Also
- Working with HPC Profiles
- Custom HPC Profile Tutorial
- Working with Slurm
- Resource Requirements
- Configuration Reference
HPC Deployment Reference
Configuration guide for deploying Torc on High-Performance Computing systems.
Overview
Running Torc on HPC systems requires special configuration to ensure:
- Compute nodes can reach the torc-server running on a login node
- The database is stored on a filesystem accessible to all nodes
- Network paths use the correct hostnames for the HPC interconnect
Server Configuration on Login Nodes
Hostname Requirements
On most HPC systems, login nodes have multiple network interfaces:
- External hostname: Used for SSH access from outside (e.g.,
kl3.hpc.nrel.gov) - Internal hostname: Used by compute nodes via the high-speed interconnect (e.g.,
kl3.hsn.cm.kestrel.hpc.nrel.gov)
When running torc-server on a login node, you must use the internal hostname so compute nodes
can connect.
NREL Kestrel Example
On NREL's Kestrel system, login nodes use the High-Speed Network (HSN) for internal communication:
| Login Node | External Hostname | Internal Hostname (for -u flag) |
|---|---|---|
| kl1 | kl1.hpc.nrel.gov | kl1.hsn.cm.kestrel.hpc.nrel.gov |
| kl2 | kl2.hpc.nrel.gov | kl2.hsn.cm.kestrel.hpc.nrel.gov |
| kl3 | kl3.hpc.nrel.gov | kl3.hsn.cm.kestrel.hpc.nrel.gov |
Starting the server:
# On login node kl3, use the internal hostname
torc-server run \
--database /scratch/$USER/torc.db \
-u kl3.hsn.cm.kestrel.hpc.nrel.gov \
--port 8085
Connecting clients:
# Set the API URL using the internal hostname
export TORC_API_URL="http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"
# Now torc commands will use this URL
torc workflows list
Finding the Internal Hostname
If you're unsure of your system's internal hostname, try these approaches:
# Check all network interfaces
hostname -A
# Look for hostnames in the hosts file
grep $(hostname -s) /etc/hosts
# Check Slurm configuration for the control machine
scontrol show config | grep ControlMachine
Consult your HPC system's documentation or support team for the correct internal hostname format.
Database Placement
The SQLite database must be on a filesystem accessible to both:
- The login node running
torc-server - All compute nodes running jobs
Recommended Locations
| Filesystem | Pros | Cons |
|---|---|---|
Scratch (/scratch/$USER/) | Fast, shared, high capacity | May be purged periodically |
Project (/projects/) | Persistent, shared | May have quotas |
Home (~) | Persistent | Often slow, limited space |
Best practice: Use scratch for active workflows, backup completed workflows to project storage.
# Create a dedicated directory
mkdir -p /scratch/$USER/torc
# Start server with scratch database
torc-server run \
--database /scratch/$USER/torc/workflows.db \
-u $(hostname -s).hsn.cm.kestrel.hpc.nrel.gov \
--port 8085
Database Backup
For long-running workflows, periodically backup the database:
# SQLite backup (safe while server is running)
sqlite3 /scratch/$USER/torc.db ".backup /projects/$USER/torc_backup.db"
Port Selection
Login nodes are shared resources. To avoid conflicts:
- Use a non-default port: Choose a port in the range 8000-9999
- Check for conflicts:
lsof -i :8085 - Consider using your UID:
--port $((8000 + UID % 1000))
# Use a unique port based on your user ID
MY_PORT=$((8000 + $(id -u) % 1000))
torc-server run \
--database /scratch/$USER/torc.db \
-u kl3.hsn.cm.kestrel.hpc.nrel.gov \
--port $MY_PORT
Running in tmux/screen
Always run torc-server in a terminal multiplexer to prevent loss on disconnect:
# Start a tmux session
tmux new -s torc
# Start the server
torc-server run \
--database /scratch/$USER/torc.db \
-u kl3.hsn.cm.kestrel.hpc.nrel.gov \
--port 8085
# Detach with Ctrl+b, then d
# Reattach later with: tmux attach -t torc
Complete Configuration Example
Server Configuration File
Create ~/.config/torc/config.toml:
[server]
# Use internal hostname for compute node access
url = "kl3.hsn.cm.kestrel.hpc.nrel.gov"
port = 8085
database = "/scratch/myuser/torc/workflows.db"
threads = 4
completion_check_interval_secs = 30.0
log_level = "info"
[server.logging]
log_dir = "/scratch/myuser/torc/logs"
Client Configuration File
Create ~/.config/torc/config.toml (or add to existing):
[client]
# Match the server's internal hostname and port
api_url = "http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"
format = "table"
[client.run]
output_dir = "/scratch/myuser/torc/output"
Environment Variables
Alternatively, set environment variables in your shell profile:
# Add to ~/.bashrc or ~/.bash_profile
export TORC_API_URL="http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"
export TORC_CLIENT__RUN__OUTPUT_DIR="/scratch/$USER/torc/output"
Slurm Job Runner Configuration
When submitting workflows to Slurm, the job runners on compute nodes need to reach the server. The
TORC_API_URL is automatically passed to Slurm jobs.
Verify connectivity from a compute node:
# Submit an interactive job
salloc -N 1 -t 00:10:00
# Test connectivity to the server
curl -s "$TORC_API_URL/workflows" | head
# Exit the allocation
exit
Troubleshooting
"Connection refused" from compute nodes
-
Verify the server is using the internal hostname:
torc-server run -u <internal-hostname> --port 8085 -
Check the server is listening on all interfaces:
netstat -tlnp | grep 8085 -
Verify no firewall blocks the port:
# From a compute node nc -zv <internal-hostname> 8085
Database locked errors
SQLite may report locking issues on network filesystems:
- Ensure only one
torc-serverinstance is running - Use a local scratch filesystem rather than NFS home directories
- Consider increasing
completion_check_interval_secsto reduce database contention
Server stops when SSH disconnects
Always use tmux or screen (see above). If the server dies unexpectedly:
# Check if the server is still running
pgrep -f torc-server
# Check server logs
tail -100 /scratch/$USER/torc/logs/torc-server*.log
See Also
Creating a Custom HPC Profile
This tutorial walks you through creating a custom HPC profile for a cluster that Torc doesn't have built-in support for.
Before You Start
Request Built-in Support First!
If your HPC system is widely used, consider requesting that Torc developers add it as a built-in profile. This benefits everyone using that system.
Open an issue at github.com/NREL/torc/issues with:
- Your HPC system name and organization
- Partition names and their resource limits (CPUs, memory, walltime, GPUs)
- How to detect the system (environment variable or hostname pattern)
- Any special requirements (minimum nodes, exclusive partitions, etc.)
Built-in profiles are maintained by the Torc team and stay up-to-date as systems change.
When to Create a Custom Profile
Create a custom profile when:
- Your HPC isn't supported and you need to use it immediately
- You have a private or internal cluster
- You want to test profile configurations before submitting upstream
Quick Start: Auto-Generate from Slurm
If you're on a Slurm cluster, you can automatically generate a profile from the cluster configuration:
# Generate profile from current Slurm cluster
torc hpc generate
# Specify a custom name
torc hpc generate --name mycluster --display-name "My Research Cluster"
# Skip standby/preemptible partitions
torc hpc generate --skip-stdby
# Save to a file
torc hpc generate --skip-stdby -o mycluster-profile.toml
This queries sinfo and scontrol to extract:
- Partition names, CPUs, memory, and time limits
- GPU configuration from GRES
- Node sharing settings
- Hostname-based detection pattern
The generated profile can be added directly to your config file. You may want to review and adjust:
requires_explicit_request: Set totruefor partitions that shouldn't be auto-selecteddescription: Add human-readable descriptions for each partition
After generation, skip to Step 4: Verify the Profile.
Manual Profile Creation
If automatic generation isn't available or you need more control, follow these steps.
Step 1: Gather Partition Information
Collect information about your HPC's partitions. On most Slurm systems:
# List all partitions
sinfo -s
# Get detailed partition info
sinfo -o "%P %c %m %l %G"
For this tutorial, let's say your cluster "ResearchCluster" has these partitions:
| Partition | CPUs/Node | Memory | Max Walltime | GPUs |
|---|---|---|---|---|
batch | 48 | 192 GB | 72 hours | - |
short | 48 | 192 GB | 4 hours | - |
gpu | 32 | 256 GB | 48 hours | 4x A100 |
himem | 48 | 1024 GB | 48 hours | - |
Step 2: Identify Detection Method
Determine how Torc can detect when you're on this system. Common methods:
Environment variable (most common):
echo $CLUSTER_NAME # e.g., "research"
echo $SLURM_CLUSTER # e.g., "researchcluster"
Hostname pattern:
hostname # e.g., "login01.research.edu"
For this tutorial, we'll use the environment variable CLUSTER_NAME=research.
Step 3: Create the Configuration File
Create or edit your Torc configuration file:
# Linux
mkdir -p ~/.config/torc
nano ~/.config/torc/config.toml
# macOS
mkdir -p ~/Library/Application\ Support/torc
nano ~/Library/Application\ Support/torc/config.toml
Add your custom profile:
# Custom HPC Profile for ResearchCluster
[client.hpc.custom_profiles.research]
display_name = "Research Cluster"
description = "University Research HPC System"
detect_env_var = "CLUSTER_NAME=research"
default_account = "my_project"
# Batch partition - general purpose
[[client.hpc.custom_profiles.research.partitions]]
name = "batch"
cpus_per_node = 48
memory_mb = 192000 # 192 GB in MB
max_walltime_secs = 259200 # 72 hours in seconds
shared = false
# Short partition - quick jobs
[[client.hpc.custom_profiles.research.partitions]]
name = "short"
cpus_per_node = 48
memory_mb = 192000
max_walltime_secs = 14400 # 4 hours
shared = true # Allows sharing nodes
# GPU partition
[[client.hpc.custom_profiles.research.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 256000 # 256 GB
max_walltime_secs = 172800 # 48 hours
gpus_per_node = 4
gpu_type = "A100"
shared = false
# High memory partition
[[client.hpc.custom_profiles.research.partitions]]
name = "himem"
cpus_per_node = 48
memory_mb = 1048576 # 1024 GB (1 TB)
max_walltime_secs = 172800 # 48 hours
shared = false
Step 4: Verify the Profile
Check that Torc recognizes your profile:
# List all profiles
torc hpc list
You should see your custom profile:
Known HPC profiles:
╭──────────┬──────────────────┬────────────┬──────────╮
│ Name │ Display Name │ Partitions │ Detected │
├──────────┼──────────────────┼────────────┼──────────┤
│ kestrel │ NREL Kestrel │ 15 │ │
│ research │ Research Cluster │ 4 │ ✓ │
╰──────────┴──────────────────┴────────────┴──────────╯
View the partitions:
torc hpc partitions research
Partitions for research:
╭─────────┬───────────┬───────────┬─────────────┬──────────╮
│ Name │ CPUs/Node │ Mem/Node │ Max Walltime│ GPUs │
├─────────┼───────────┼───────────┼─────────────┼──────────┤
│ batch │ 48 │ 192 GB │ 72h │ - │
│ short │ 48 │ 192 GB │ 4h │ - │
│ gpu │ 32 │ 256 GB │ 48h │ 4 (A100) │
│ himem │ 48 │ 1024 GB │ 48h │ - │
╰─────────┴───────────┴───────────┴─────────────┴──────────╯
Step 5: Test Partition Matching
Verify that Torc correctly matches resource requirements to partitions:
# Should match 'short' partition
torc hpc match research --cpus 8 --memory 16g --walltime 2h
# Should match 'gpu' partition
torc hpc match research --cpus 16 --memory 64g --walltime 8h --gpus 2
# Should match 'himem' partition
torc hpc match research --cpus 24 --memory 512g --walltime 24h
Step 6: Test Scheduler Generation
Create a test workflow to verify scheduler generation:
# test_workflow.yaml
name: profile_test
description: Test custom HPC profile
resource_requirements:
- name: standard
num_cpus: 16
memory: 64g
runtime: PT2H
- name: gpu_compute
num_cpus: 16
num_gpus: 2
memory: 128g
runtime: PT8H
jobs:
- name: preprocess
command: echo "preprocessing"
resource_requirements: standard
- name: train
command: echo "training"
resource_requirements: gpu_compute
depends_on: [preprocess]
Generate schedulers:
torc slurm generate --account my_project --profile research test_workflow.yaml
You should see the generated workflow with appropriate schedulers for each partition.
Step 7: Use Your Profile
Now you can submit workflows using your custom profile:
# Auto-detect the profile (if on the cluster)
torc submit-slurm --account my_project workflow.yaml
# Or explicitly specify the profile
torc submit-slurm --account my_project --hpc-profile research workflow.yaml
Advanced Configuration
Hostname-Based Detection
If your cluster doesn't set a unique environment variable, use hostname detection:
[client.hpc.custom_profiles.research]
display_name = "Research Cluster"
detect_hostname = ".*\\.research\\.edu" # Regex pattern
Minimum Node Requirements
Some partitions require a minimum number of nodes:
[[client.hpc.custom_profiles.research.partitions]]
name = "large_scale"
cpus_per_node = 128
memory_mb = 512000
max_walltime_secs = 172800
min_nodes = 16 # Must request at least 16 nodes
Explicit Request Partitions
Some partitions shouldn't be auto-selected:
[[client.hpc.custom_profiles.research.partitions]]
name = "priority"
cpus_per_node = 48
memory_mb = 192000
max_walltime_secs = 86400
requires_explicit_request = true # Only used when explicitly requested
Troubleshooting
Profile Not Detected
If torc hpc detect doesn't find your profile:
-
Check the environment variable or hostname:
echo $CLUSTER_NAME hostname -
Verify the detection pattern in your config matches exactly
-
Test with explicit profile specification:
torc hpc show research
No Partition Found for Job
If torc slurm generate can't find a matching partition:
-
Check if any partition satisfies all requirements:
torc hpc match research --cpus 32 --memory 128g --walltime 8h -
Verify memory is specified in MB in the config (not GB)
-
Verify walltime is in seconds (not hours)
Configuration File Location
Torc looks for config files in these locations:
- Linux:
~/.config/torc/config.toml - macOS:
~/Library/Application Support/torc/config.toml - Windows:
%APPDATA%\torc\config.toml
You can also use the TORC_CONFIG environment variable to specify a custom path.
Contributing Your Profile
If your HPC is used by others, please contribute it upstream:
- Fork the Torc repository
- Add your profile to
src/client/hpc_profiles.rs - Add tests for your profile
- Submit a pull request
Or simply open an issue with your partition information and we'll add it for you.
See Also
- Working with HPC Profiles - General HPC profile usage
- HPC Profiles Reference - Complete configuration options
- Slurm Workflows - Simplified Slurm approach
Remote Workers
Running Torc jobs on remote machines.
- Setting Up Remote Workers - Configuring remote worker nodes
Remote Workers
Run workflows across multiple machines via SSH without requiring an HPC scheduler.
Overview
Torc supports three execution modes:
- Local (
torc run) - Jobs run on the current machine - HPC (
torc submit-slurm) - Jobs run on Slurm-allocated nodes - Remote Workers (
torc remote run) - Jobs run on SSH-accessible machines
Remote workers are ideal for:
- Ad-hoc clusters of workstations or cloud VMs
- Environments without a scheduler
- Testing distributed workflows before HPC deployment
Worker File Format
Create a text file listing remote machines:
# Lines starting with # are comments
# Format: [user@]hostname[:port]
# Simple hostname
worker1.example.com
# With username
alice@worker2.example.com
# With custom SSH port
admin@192.168.1.10:2222
# IPv4 address
10.0.0.5
# IPv6 address (must be in brackets for port specification)
[2001:db8::1]
[::1]:2222
Each host can only appear once. Duplicate hosts will cause an error.
Worker Management
Workers are stored in the database and persist across command invocations. This means you only need to specify workers once, and subsequent commands can reference them by workflow ID.
Add Workers
torc remote add-workers <workflow-id> <worker>...
Add one or more workers directly on the command line:
torc remote add-workers 42 worker1.example.com alice@worker2.example.com admin@192.168.1.10:2222
Add Workers from File
torc remote add-workers-from-file <worker-file> [workflow-id]
Example:
torc remote add-workers-from-file workers.txt 42
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
List Workers
torc remote list-workers [workflow-id]
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Remove a Worker
torc remote remove-worker <worker> [workflow-id]
Example:
torc remote remove-worker worker1.example.com 42
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Commands
Start Workers
torc remote run [workflow-id] [options]
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Workers are fetched from the database. If you want to add workers from a file at the same time:
torc remote run <workflow-id> --workers <worker-file> [options]
Options:
| Option | Default | Description |
|---|---|---|
--workers | none | Worker file to add before starting |
-o, --output-dir | torc_output | Output directory on remote machines |
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
-p, --poll-interval | 5.0 | How often workers poll for jobs (seconds) |
--max-parallel-jobs | auto | Maximum parallel jobs per worker |
--num-cpus | auto | CPUs per worker (auto-detected if not specified) |
--memory-gb | auto | Memory per worker (auto-detected if not specified) |
--num-gpus | auto | GPUs per worker (auto-detected if not specified) |
--skip-version-check | false | Skip version verification (not recommended) |
Example:
# First time: add workers and start
torc remote run 42 --workers workers.txt \
--output-dir /data/torc_output \
--poll-interval 10
# Subsequent runs: workers already in database
torc remote run 42 --output-dir /data/torc_output
Check Status
torc remote status [workflow-id] [options]
Shows which workers are still running. Workers are fetched from the database. If workflow-id is
omitted, you'll be prompted to select a workflow interactively.
Options:
| Option | Default | Description |
|---|---|---|
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
Stop Workers
torc remote stop [workflow-id] [options]
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Options:
| Option | Default | Description |
|---|---|---|
--force | false | Send SIGKILL instead of SIGTERM |
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
Collect Logs
torc remote collect-logs [workflow-id] [options]
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Options:
| Option | Default | Description |
|---|---|---|
-l, --local-output-dir | remote_logs | Local directory for collected logs |
--remote-output-dir | torc_output | Remote output directory |
--delete | false | Delete remote logs after successful collection |
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
Example with deletion:
# Collect logs and clean up remote workers
torc remote collect-logs 42 --delete
Delete Logs
torc remote delete-logs [workflow-id] [options]
Delete the output directory from all remote workers without collecting logs first. Use
collect-logs --delete if you want to save logs before deleting.
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Options:
| Option | Default | Description |
|---|---|---|
--remote-output-dir | torc_output | Remote output directory |
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
Typical Workflow
-
Create a workflow:
torc workflows create my_workflow.yaml -
Add workers:
# From command line torc remote add-workers 42 worker1.example.com worker2.example.com # Or from file torc remote add-workers-from-file workers.txt 42 -
Start workers:
torc remote run 42 -
Monitor status:
torc remote status 42 -
Collect logs when complete:
torc remote collect-logs 42 -l ./logs
Or combine steps 2 and 3:
torc remote run 42 --workers workers.txt
How It Works
- Version Check: Verifies all remote machines have the same torc version
- Worker Start: Uses
nohupto start detached workers that survive SSH disconnection - Job Execution: Each worker polls the server for available jobs and executes them locally
- Completion: Workers exit when the workflow is complete or canceled
The server coordinates job distribution. Multiple workers can safely poll the same workflow without double-allocating jobs.
SSH Configuration
Workers connect using SSH with these options:
ConnectTimeout=30- 30 second connection timeoutBatchMode=yes- No password prompts (requires key-based auth)StrictHostKeyChecking=accept-new- Accept new host keys automatically
For custom SSH configuration, use ~/.ssh/config on the local machine:
Host worker1
HostName worker1.example.com
User alice
Port 2222
IdentityFile ~/.ssh/worker_key
Then reference the alias in your worker file:
worker1
worker2
worker3
Resource Monitoring
If your workflow has resource monitoring enabled, each worker collects utilization data:
name: my_workflow
resource_monitor_config:
enabled: true
granularity: time_series
sample_interval_seconds: 5
The collect-logs command retrieves these databases along with job logs.
Troubleshooting
No Workers Configured
No workers configured for workflow 42. Use 'torc remote add-workers' or '--workers' flag.
Add workers to the workflow using torc remote add-workers or the --workers flag on run.
Version Mismatch
Error: Version check failed on 2 worker(s):
worker1: Version mismatch: local=0.7.0, worker1=0.6.5
worker2: Version mismatch: local=0.7.0, worker2=0.6.5
Install the same torc version on all machines, or use --skip-version-check (not recommended for
production).
SSH Connection Failed
Error: SSH connectivity check failed for 1 worker(s):
worker1: SSH connection failed to worker1: Permission denied (publickey)
Verify SSH key-based authentication works:
ssh worker1.example.com true
Worker Died Immediately
[FAILED] worker1: Process died immediately. Last log:
Error: connection refused...
The worker couldn't connect to the server. Check:
- Server is accessible from the remote machine
- Firewall allows connections on the server port
- The
--urlpoints to the correct server address
Workers Not Claiming Jobs
If workers start but don't claim jobs:
- Check the workflow is initialized:
torc workflows status <id> - Check jobs are ready:
torc jobs list <id> - Check resource requirements match available resources
Comparison with Slurm
| Feature | Remote Workers | Slurm |
|---|---|---|
| Scheduler required | No | Yes |
| Resource allocation | Manual (worker file) | Automatic |
| Fault tolerance | Limited | Full (job requeue) |
| Walltime limits | No | Yes |
| Priority/queuing | No | Yes |
| Best for | Ad-hoc clusters, testing | Production HPC |
Security Considerations
- Workers authenticate to the torc server (if authentication is enabled)
- SSH keys should be properly secured
- Workers run with the permissions of the SSH user on each machine
- The torc server URL is passed to workers and visible in process lists
Fault Tolerance & Recovery
Handling failures and recovering workflows automatically.
- Automatic Failure Recovery - Automatic retry and resource adjustment
- Configurable Failure Handlers - Per-job retry logic based on exit codes
- AI-Assisted Recovery - Intelligent error classification with AI agents
- Job Checkpointing - Saving and restoring job state
Automatic Failure Recovery
This guide explains how to use Torc's automatic recovery features to handle workflow failures without manual intervention.
Overview
Torc provides automatic failure recovery through two commands:
torc recover- One-shot recovery for Slurm workflowstorc watch --recover- Continuous monitoring with automatic recovery
When jobs fail, the system:
- Diagnoses the failure cause (OOM, timeout, or unknown)
- Applies heuristics to adjust resource requirements
- Resets failed jobs and submits new Slurm allocations
- (watch only) Resumes monitoring until completion or max retries
This deterministic approach handles the majority of HPC failures without human intervention.
Why Deterministic Recovery?
Most HPC job failures fall into predictable categories:
| Failure Type | Frequency | Solution |
|---|---|---|
| Out of Memory | ~60% | Increase memory allocation |
| Timeout | ~25% | Increase runtime limit |
| Transient errors | ~10% | Simple retry |
| Code bugs | ~5% | Manual intervention |
For 85-90% of failures, the solution is mechanical: increase resources and retry. This doesn't require AI judgment—simple heuristics work well.
Recovery Architecture
flowchart LR
A[torc watch<br/>polling] --> B{Workflow<br/>complete?}
B -->|No| A
B -->|Yes, with failures| C[Diagnose failures<br/>check resources]
C --> D[Apply heuristics<br/>adjust resources]
D --> E[Submit new<br/>allocations]
E --> A
B -->|Yes, success| F[Exit 0]
style A fill:#4a9eff,color:#fff
style B fill:#6c757d,color:#fff
style C fill:#ffc107,color:#000
style D fill:#ffc107,color:#000
style E fill:#28a745,color:#fff
style F fill:#28a745,color:#fff
Failure Detection
Torc tracks resource usage during job execution:
- Memory usage (RSS and peak)
- CPU utilization
- Execution time
This data is analyzed to determine failure causes:
OOM Detection:
- Peak memory exceeds specified limit
- Exit code 137 (SIGKILL from OOM killer)
- Flag:
likely_oom: true
Timeout Detection:
- Execution time within 10% of runtime limit
- Job was killed (not graceful exit)
- Flag:
likely_timeout: true
Recovery Heuristics
| Failure Type | Detection | Default Action |
|---|---|---|
| Out of Memory | Peak memory > limit, exit code 137 | Increase memory by 1.5x |
| Timeout | Execution time near limit | Increase runtime by 1.5x |
| Unknown | Other exit codes | Skip (likely bug) |
Note: By default, jobs with unknown failure causes are not retried, since they likely have script or data bugs that won't be fixed by retrying. Use
--retry-unknownto also retry these jobs (e.g., to handle transient errors like network issues).
The torc recover Command
For one-shot recovery when a workflow has failed:
# Preview what would be done (recommended first step)
torc recover 42 --dry-run
# Execute the recovery
torc recover 42
This command:
- Detects and cleans up orphaned jobs from terminated Slurm allocations
- Checks that the workflow is complete and no workers are active
- Diagnoses failure causes (OOM, timeout, etc.)
- Adjusts resource requirements based on heuristics
- Runs optional recovery hook for custom logic
- Resets failed jobs and regenerates Slurm schedulers
- Submits new allocations
Note: Step 1 (orphan cleanup) handles the case where Slurm terminated an allocation unexpectedly, leaving jobs stuck in "running" status. This is done automatically before checking preconditions.
Options
torc recover <workflow_id> \
--memory-multiplier 1.5 \ # Memory increase factor for OOM (default: 1.5)
--runtime-multiplier 1.4 \ # Runtime increase factor for timeout (default: 1.4)
--retry-unknown \ # Also retry jobs with unknown failure causes
--recovery-hook "bash fix.sh" \ # Custom script for unknown failures
--dry-run # Preview without making changes
Example Output
Diagnosing failures...
Applying recovery heuristics...
Job 107 (train_model): OOM detected, increasing memory 8g -> 12g
Applied fixes: 1 OOM, 0 timeout
Resetting 1 job(s) for retry...
Reset 1 job(s)
Reinitializing workflow...
Regenerating Slurm schedulers...
Submitted Slurm allocation with 1 job
Recovery complete for workflow 42
- 1 job(s) had memory increased
Reset 1 job(s). Slurm schedulers regenerated and submitted.
The torc watch --recover Command
The torc watch command can automatically recover from common failures:
torc watch 42 --recover
This will:
- Poll the workflow until completion
- On failure, diagnose the cause (OOM, timeout, etc.)
- Adjust resource requirements based on heuristics
- Reset failed jobs and submit new Slurm allocations
- Resume monitoring
- Repeat until success or max retries exceeded
Options
torc watch <workflow_id> \
-r \ # Enable automatic recovery (--recover)
-m 3 \ # Maximum recovery attempts (--max-retries)
--memory-multiplier 1.5 \ # Memory increase factor for OOM
--runtime-multiplier 1.5 \ # Runtime increase factor for timeout
--retry-unknown \ # Also retry jobs with unknown failures
--recovery-hook "bash fix.sh" \ # Custom recovery script
-p 60 \ # Seconds between status checks (--poll-interval)
-o output \ # Directory for job output files (--output-dir)
-s \ # Display job counts during polling (--show-job-counts)
--auto-schedule \ # Automatically schedule nodes for stranded jobs
--auto-schedule-threshold 5 \ # Min retry jobs before scheduling (default: 5)
--auto-schedule-cooldown 1800 \ # Seconds between auto-schedule attempts (default: 1800)
--auto-schedule-stranded-timeout 7200 # Schedule stranded jobs after this time (default: 7200)
Custom Recovery Hooks
For failures that torc can't handle automatically (not OOM or timeout), you can provide a custom
recovery script using --recovery-hook. This is useful for domain-specific recovery logic, such as
adjusting Apache Spark cluster sizes or fixing configuration issues.
torc watch 42 --recover --recovery-hook "bash fix-spark-cluster.sh"
The hook receives the workflow ID in two ways:
- As an argument:
bash fix-spark-cluster.sh 42 - As an environment variable:
TORC_WORKFLOW_ID=42
Your script can use torc CLI commands to query and modify the workflow:
#!/bin/bash
# fix-spark-cluster.sh - Example recovery hook for Spark jobs
WORKFLOW_ID=$1 # or use $TORC_WORKFLOW_ID
# Find failed jobs
FAILED_JOBS=$(torc jobs list $WORKFLOW_ID --status failed -f json | jq -r '.[].id')
for JOB_ID in $FAILED_JOBS; do
# Get current resource requirements
JOB_INFO=$(torc jobs get $JOB_ID -f json)
RR_ID=$(echo "$JOB_INFO" | jq -r '.resource_requirements_id')
# Check if this is a Spark job that needs more nodes
# (your logic here - parse logs, check error messages, etc.)
# Update resource requirements
torc resource-requirements update $RR_ID --num-nodes 16
echo "Updated job $JOB_ID to use 16 nodes"
done
When a recovery hook is provided:
- Jobs with unknown failures are automatically included for retry
- The hook runs before
reset-statusis called - If the hook fails (non-zero exit), auto-recovery stops with an error
- After the hook succeeds, failed jobs are reset and retried
Auto-Scheduling for Failure Handlers
When using failure handlers that create retry jobs, the originally planned
compute capacity may not be sufficient. The --auto-schedule option enables automatic scheduling of
additional Slurm nodes when:
-
No schedulers available: If there are ready jobs but no active or pending Slurm allocations, new schedulers are immediately regenerated and submitted.
-
Retry jobs accumulating: If there are active schedulers but retry jobs (jobs with
attempt_id > 1) are accumulating beyond the threshold, additional schedulers are submitted after the cooldown period.
This is particularly useful for workflows with failure handlers that retry failed jobs, ensuring those retries get scheduled without manual intervention.
Example: Failure Handler with Auto-Scheduling
# Submit a workflow with failure handlers
torc submit-slurm --account my_project workflow.yaml
# Watch with auto-scheduling enabled (uses defaults)
torc watch $WORKFLOW_ID --auto-schedule
With default settings:
- If all Slurm allocations complete but retry jobs remain, new allocations are submitted
- If 5+ retry jobs accumulate while allocations are running, additional capacity is scheduled
- After scheduling, the system waits 30 minutes before considering another auto-schedule
- If fewer than 5 retry jobs are waiting for 2 hours, they're scheduled anyway (stranded timeout)
Choosing the Right Command
| Use Case | Command |
|---|---|
| One-shot recovery after failure | torc recover |
| Continuous monitoring | torc watch -r |
| Preview what recovery would do | torc recover --dry-run |
| Production long-running workflows | torc watch -r |
| Manual investigation, then retry | torc recover |
Complete Workflow Example
1. Submit a Workflow
torc submit-slurm --account myproject workflow.yaml
Output:
Created workflow 42 with 100 jobs
Submitted to Slurm with 10 allocations
2. Start Watching with Auto-Recovery
torc watch 42 --recover --max-retries 3 --show-job-counts
Note: The
--show-job-countsflag is optional. Without it, the command polls silently until completion, which reduces server load for large workflows.
Output:
Watching workflow 42 (poll interval: 60s, recover enabled, max retries: 3, job counts enabled)
completed=0, running=10, pending=0, failed=0, blocked=90
completed=25, running=10, pending=0, failed=0, blocked=65
...
completed=95, running=0, pending=0, failed=5, blocked=0
Workflow 42 is complete
Workflow completed with failures:
- Failed: 5
- Canceled: 0
- Terminated: 0
- Completed: 95
Attempting automatic recovery (attempt 1/3)
Diagnosing failures...
Applying recovery heuristics...
Job 107 (train_model_7): OOM detected, increasing memory 8g -> 12g
Job 112 (train_model_12): OOM detected, increasing memory 8g -> 12g
Job 123 (train_model_23): OOM detected, increasing memory 8g -> 12g
Job 131 (train_model_31): OOM detected, increasing memory 8g -> 12g
Job 145 (train_model_45): OOM detected, increasing memory 8g -> 12g
Applied fixes: 5 OOM, 0 timeout
Resetting failed jobs...
Regenerating Slurm schedulers and submitting...
Recovery initiated. Resuming monitoring...
Watching workflow 42 (poll interval: 60s, recover enabled, max retries: 3, job counts enabled)
completed=95, running=5, pending=0, failed=0, blocked=0
...
Workflow 42 is complete
Workflow completed successfully (100 jobs)
3. If No Recoverable Jobs Found
If all failures are from unknown causes (not OOM or timeout):
Applying recovery heuristics...
2 job(s) with unknown failure cause (skipped, use --retry-unknown to include)
No recoverable jobs found. 2 job(s) failed with unknown causes.
Use --retry-unknown to retry jobs with unknown failure causes.
Or use the Torc MCP server with your AI assistant to investigate.
This prevents wasting allocation time on jobs that likely have script or data bugs.
4. If Max Retries Exceeded
If failures persist after max retries:
Max retries (3) exceeded. Manual intervention required.
Use the Torc MCP server with your AI assistant to investigate.
At this point, you can use the MCP server with an AI assistant to investigate the root cause.
Log Files
All torc watch output is logged to both the terminal and a log file:
<output-dir>/watch_<hostname>_<workflow_id>.log
For example: output/watch_myhost_42.log
This ensures you have a complete record of the watch session even if your terminal disconnects.
When to Use Manual Recovery
Automatic recovery works well for resource-related failures, but some situations require manual intervention:
Use Manual Recovery When:
-
Jobs keep failing after max retries
- The heuristics aren't solving the problem
- Need to investigate root cause
-
Unknown failure modes
- Exit codes that don't indicate OOM/timeout
- Application-specific errors
-
Code bugs
- Jobs fail consistently with same error
- No resource issue detected
-
Cost optimization
- Want to analyze actual usage before increasing
- Need to decide whether job is worth more resources
MCP Server for Manual Recovery
The Torc MCP server provides tools for AI-assisted investigation:
| Tool | Purpose |
|---|---|
get_workflow_status | Get overall workflow status |
list_failed_jobs | List failed jobs with error info |
get_job_logs | Read stdout/stderr logs |
check_resource_utilization | Detailed resource analysis |
update_job_resources | Manually adjust resources |
resubmit_workflow | Regenerate Slurm schedulers |
Best Practices
1. Start with Conservative Resources
Set initial resource requests lower and let auto-recovery increase them:
- Jobs that succeed keep their original allocation
- Only failing jobs get increased resources
- Avoids wasting HPC resources on over-provisioned jobs
2. Set Reasonable Max Retries
--max-retries 3 # Good for most workflows
Too many retries can waste allocation time on jobs that will never succeed.
3. Use Appropriate Multipliers
For memory-bound jobs:
--memory-multiplier 2.0 # Double on OOM
For time-sensitive jobs where you want larger increases:
--runtime-multiplier 2.0 # Double runtime on timeout
4. Run in tmux or screen
Always run torc watch inside tmux or screen for long-running workflows. HPC workflows can run
for hours or days, and you don't want to lose your monitoring session if:
- Your SSH connection drops
- Your laptop goes to sleep
- You need to disconnect and reconnect later
Using tmux (recommended):
# Start a new tmux session
tmux new -s torc-watch
# Run the watch command
torc watch 42 --recover --poll-interval 300 --show-job-counts
# Detach from session: press Ctrl+b, then d
# Reattach later: tmux attach -t torc-watch
Using screen:
screen -S torc-watch
torc watch 42 --recover --poll-interval 300 --show-job-counts
# Detach: Ctrl+a, then d
# Reattach: screen -r torc-watch
5. Check Resource Utilization Afterward
After completion, review actual usage:
torc reports check-resource-utilization 42
This helps tune future job specifications.
Troubleshooting
Jobs Stuck in "Running" Status
If jobs appear stuck in "running" status after a Slurm allocation ended:
- This usually means the allocation was terminated unexpectedly (timeout, node failure, etc.)
- The
torc recovercommand automatically handles this as its first step - To manually clean up without triggering recovery, use:
torc workflows sync-status <workflow_id> - To preview what would be cleaned up:
torc workflows sync-status <workflow_id> --dry-run
See Debugging Slurm Workflows for more details.
Jobs Keep Failing After Recovery
If jobs fail repeatedly with the same error:
- Check if the error is resource-related (OOM/timeout)
- Review job logs:
torc jobs logs <job_id> - Check if there's a code bug
- Use MCP server with AI assistant to investigate
No Slurm Schedulers Generated
If slurm regenerate fails:
- Ensure workflow was created with
--accountoption - Check HPC profile is detected:
torc hpc detect - Specify profile explicitly:
--profile kestrel
Resource Limits Too High
If jobs are requesting more resources than partitions allow:
- Check partition limits:
torc hpc partitions <profile> - Use smaller multipliers
- Consider splitting jobs into smaller pieces
Comparison: Automatic vs Manual Recovery
| Feature | Automatic | Manual/AI-Assisted |
|---|---|---|
| Human involvement | None | Interactive |
| Speed | Fast | Depends on human |
| Handles OOM/timeout | Yes | Yes |
| Handles unknown errors | Retry only | Full investigation |
| Cost optimization | Basic | Can be sophisticated |
| Use case | Production workflows | Debugging, optimization |
Implementation Details
The Watch Command Flow
- Poll
is_workflow_completeAPI - Print status updates
- On completion, check for failures
- If failures and recover enabled:
- Run
torc reports check-resource-utilization --include-failed - Parse results for
likely_oomandlikely_timeoutflags - Update resource requirements via API
- Run
torc workflows reset-status --failed-only --reinitialize - Run
torc slurm regenerate --submit - Increment retry counter
- Resume polling
- Run
- Exit 0 on success, exit 1 on max retries exceeded
The Regenerate Command Flow
- Query jobs with status uninitialized/ready/blocked
- Group by resource requirements
- For each group:
- Find best partition using HPC profile
- Calculate jobs per node
- Determine number of allocations needed
- Create scheduler config
- Update jobs with new scheduler reference
- Submit allocations via sbatch
See Also
- Configurable Failure Handlers - Per-job retry with exit-code-specific recovery
- Resource Monitoring - Understanding resource tracking
Tutorial: Configurable Failure Handlers
This tutorial shows how to configure automatic job retry based on specific exit codes, with optional recovery scripts.
Learning Objectives
By the end of this tutorial, you will:
- Understand failure handlers and rules
- Configure exit-code-specific recovery
- Write recovery scripts
- Monitor retry attempts
Prerequisites
- Torc installed with the client feature
- A running Torc server
What Are Failure Handlers?
Failure handlers provide per-job automatic retry logic based on exit codes. Unlike
torc watch --recover which applies workflow-level recovery heuristics, failure handlers let you
define:
- Exit codes to match: Which return codes trigger recovery
- Recovery scripts: Optional scripts to run before retry
- Max retries: How many times to retry per exit code
Quick Start
Try the runnable demo:
torc run examples/yaml/failure_handler_demo.yaml
This workflow includes jobs that randomly fail with different exit codes, demonstrating how failure handlers automatically retry them.
Basic Example
Create a workflow with a failure handler:
failure_handlers:
- name: job_recovery
rules:
- exit_codes: [10, 11, 12]
recovery_script: ./recovery.sh
max_retries: 3
jobs:
- name: process_data
command: python process.py
failure_handler: job_recovery
How It Works
When a job with a failure handler fails:
- JobRunner checks the exit code against handler rules
- If a matching rule is found and
attempt_id < max_retries:- Run recovery_script (if defined) with environment variables
- If recovery succeeds: job is reset to Ready with incremented attempt_id
- If recovery fails: job is marked as Failed
- If no match or max retries exceeded: job is marked as Failed
flowchart TD
fail["Job Fails<br/>(exit code 10)"]
match{"Matching rule<br/>for exit code?"}
retry{"attempt_id<br/>< max_retries?"}
recovery["Run recovery script"]
success{"Recovery<br/>succeeded?"}
reset["Reset to Ready<br/>attempt_id += 1"]
failed["Mark Failed"]
fail --> match
match -->|Yes| retry
match -->|No| failed
retry -->|Yes| recovery
retry -->|No| failed
recovery --> success
success -->|Yes| reset
success -->|No| failed
style fail fill:#dc3545,color:#fff
style reset fill:#28a745,color:#fff
style failed fill:#6c757d,color:#fff
style recovery fill:#ffc107,color:#000
Environment Variables
Recovery scripts receive the following environment variables:
| Variable | Description |
|---|---|
| TORC_WORKFLOW_ID | Workflow ID |
| TORC_JOB_ID | Job ID |
| TORC_JOB_NAME | Job name |
| TORC_API_URL | Server URL |
| TORC_OUTPUT_DIR | Output directory |
| TORC_ATTEMPT_ID | Current attempt (1, 2, 3...) |
| TORC_RETURN_CODE | Exit code that triggered recovery |
Jobs also receive these environment variables during execution:
| Variable | Description |
|---|---|
| TORC_WORKFLOW_ID | Workflow ID |
| TORC_JOB_ID | Job ID |
| TORC_JOB_NAME | Job name |
| TORC_API_URL | Server URL |
| TORC_OUTPUT_DIR | Output directory |
| TORC_ATTEMPT_ID | Current attempt (starts at 1) |
Writing Recovery Scripts
Example recovery script:
#!/bin/bash
# recovery.sh - Run before retrying a failed job
echo "=== Recovery Script ==="
echo "Job: $TORC_JOB_NAME (attempt $TORC_ATTEMPT_ID)"
echo "Failed with exit code: $TORC_RETURN_CODE"
# Log the recovery attempt
LOG_FILE="$TORC_OUTPUT_DIR/recovery.log"
echo "$(date): Recovery for $TORC_JOB_NAME (exit $TORC_RETURN_CODE)" >> "$LOG_FILE"
# Take action based on exit code
case $TORC_RETURN_CODE in
10)
echo "Handling convergence error - adjusting parameters"
# Modify config files, adjust parameters, etc.
;;
11)
echo "Handling resource error - cleaning up"
# Free resources, clean temp files, etc.
;;
12)
echo "Handling transient error - no action needed"
;;
esac
exit 0 # Zero = proceed with retry
Make sure your script is executable:
chmod +x recovery.sh
Multiple Rules
Handle different exit codes with different strategies:
failure_handlers:
- name: comprehensive_recovery
rules:
# Convergence errors: adjust parameters
- exit_codes: [10]
recovery_script: ./adjust_params.sh
max_retries: 3
# Resource errors: clean up and retry
- exit_codes: [11]
recovery_script: ./cleanup.sh
max_retries: 2
# Transient errors: simple retry
- exit_codes: [12]
max_retries: 3
# Exit code 1 is NOT listed - jobs with exit 1 won't retry
Shared Handlers
Multiple jobs can share a failure handler:
failure_handlers:
- name: simulation_recovery
rules:
- exit_codes: [10, 11, 12]
recovery_script: ./recovery.sh
max_retries: 3
jobs:
- name: simulation_1
command: python simulate.py --config config1.yaml
failure_handler: simulation_recovery
- name: simulation_2
command: python simulate.py --config config2.yaml
failure_handler: simulation_recovery
Simple Retry for Any Failure
For jobs that are simply flaky and need retrying on any failure, use match_all_exit_codes:
failure_handlers:
- name: simple_retry
rules:
- match_all_exit_codes: true
max_retries: 3
jobs:
- name: flaky_job
command: ./flaky_script.sh
failure_handler: simple_retry
This retries the job up to 3 times on any non-zero exit code, without running a recovery script.
You can also combine match_all_exit_codes with specific exit code rules. Rules are evaluated in
order, so put specific rules first:
failure_handlers:
- name: mixed_recovery
rules:
# Specific handling for known error codes
- exit_codes: [10]
recovery_script: ./fix_convergence.sh
max_retries: 3
# Catch-all for any other failures
- match_all_exit_codes: true
max_retries: 2
Log Files
Each attempt gets separate log files, preserving history across retries:
output/job_stdio/job_wf1_j42_r1_a1.o # Attempt 1 stdout
output/job_stdio/job_wf1_j42_r1_a1.e # Attempt 1 stderr
output/job_stdio/job_wf1_j42_r1_a2.o # Attempt 2 stdout
output/job_stdio/job_wf1_j42_r1_a2.e # Attempt 2 stderr
The a{N} suffix indicates the attempt number.
Comparison with torc watch --recover
| Feature | Failure Handlers | torc watch --recover |
|---|---|---|
| Scope | Per-job, exit-code-specific | Workflow-wide |
| Triggers | Specific exit codes | OOM, timeout detection |
| Recovery | Custom scripts | Resource adjustment |
| Timing | Immediate (during run) | After workflow completes |
| Configuration | In workflow spec | Command-line options |
Use both together for comprehensive recovery:
- Failure handlers for immediate, exit-code-specific retry
torc watch --recoverfor workflow-level resource adjustments
Slurm Integration with --auto-schedule
When using failure handlers with Slurm workflows, retried jobs need compute nodes to run on. The original Slurm allocations may have already completed or may not have enough capacity for the retries.
Use torc watch --auto-schedule to automatically submit new Slurm allocations when retry jobs are
waiting:
# Submit a workflow with failure handlers
torc submit-slurm --account my_project workflow.yaml
# Watch with auto-scheduling enabled (uses defaults)
torc watch $WORKFLOW_ID --auto-schedule
How It Works
-
No schedulers available: If there are ready jobs but no active or pending Slurm allocations, new schedulers are immediately regenerated and submitted.
-
Retry jobs accumulating: If there are active schedulers but retry jobs (jobs with
attempt_id > 1) exceed the threshold, additional schedulers are submitted after the cooldown period. -
Stranded jobs: If retry jobs exist but are below the threshold and have been waiting longer than the stranded timeout, schedulers are submitted anyway. This prevents jobs from being stranded indefinitely when not enough failures occur to reach the threshold.
Options
| Option | Default | Description |
|---|---|---|
--auto-schedule | false | Enable automatic scheduling for stranded jobs |
--auto-schedule-threshold | 5 | Minimum retry jobs before scheduling (when active) |
--auto-schedule-cooldown | 1800 | Seconds between auto-schedule attempts (30 min) |
--auto-schedule-stranded-timeout | 7200 | Seconds before scheduling stranded jobs anyway (2 hrs) |
Example Scenario
Workflow starts with 100 jobs across 10 Slurm allocations
Jobs 15, 23, 47 fail with exit code 10 → failure handler retries them
→ Jobs reset to Ready with attempt_id=2
All 10 allocations finish (97 jobs completed)
torc watch detects:
- No active schedulers
- 3 ready jobs (all retries)
→ Auto-schedules new allocation
New allocation starts, runs the 3 retry jobs
→ Job 15 succeeds
→ Jobs 23, 47 fail again → retry with attempt_id=3
Process continues until all jobs succeed or max_retries exceeded
Complete Example
Here's a complete workflow specification with failure handlers. See
examples/yaml/failure_handler_simulation.yaml for the full runnable example.
name: failure_handler_simulation
description: Simulation sweep with automatic failure recovery
failure_handlers:
- name: simulation_recovery
rules:
# Convergence issues: run recovery script, retry up to 3 times
- exit_codes: [10]
recovery_script: examples/scripts/recovery_demo.sh
max_retries: 3
# Resource issues: run recovery script, retry up to 2 times
- exit_codes: [11]
recovery_script: examples/scripts/recovery_demo.sh
max_retries: 2
# Transient errors: simple retry, no recovery script
- exit_codes: [12]
max_retries: 3
# Note: exit code 1 is intentionally NOT included (unrecoverable)
jobs:
# Parameterized jobs that may fail with different error codes
- name: simulate_m{model}_s{scenario}
command: bash examples/scripts/failure_demo_job.sh --fail-rate 0.7
failure_handler: simulation_recovery
parameters:
model: "0:2"
scenario: "0:2"
# Runs after all simulations complete
- name: aggregate_results
command: echo "All simulations completed successfully!"
depends_on:
- simulate_m{model}_s{scenario}
parameters:
model: "0:2"
scenario: "0:2"
Run this example with:
torc run examples/yaml/failure_handler_simulation.yaml
Tips and Best Practices
1. Use Exit Codes Consistently
Define meaningful exit codes in your scripts:
# simulate.py
import sys
try:
# Simulation code
pass
except ConvergenceError:
sys.exit(10) # Will trigger recovery
except ResourceError:
sys.exit(11) # Will trigger recovery
except TransientError:
sys.exit(12) # Will trigger recovery
except Exception as e:
print(f"Unrecoverable error: {e}")
sys.exit(1) # Will NOT trigger recovery
2. Keep Recovery Scripts Simple
Recovery scripts should be quick and focused:
# Good: Simple, focused recovery
#!/bin/bash
echo "Cleaning up before retry..."
rm -f /tmp/lockfile
exit 0
# Avoid: Complex logic that might fail
3. Set Reasonable Max Retries
rules:
- exit_codes: [10]
max_retries: 3 # Good for transient errors
- exit_codes: [1]
max_retries: 1 # Generic errors - don't retry many times
4. Log Recovery Actions
Your recovery scripts should log what they're doing:
#!/bin/bash
LOG_FILE="$TORC_OUTPUT_DIR/recovery_${TORC_JOB_ID}.log"
echo "$(date): Recovery attempt $TORC_ATTEMPT_ID for exit code $TORC_RETURN_CODE" >> "$LOG_FILE"
Troubleshooting
Recovery Script Not Running
- Check that the script is executable:
chmod +x script.sh - Verify the script path is correct (relative to where torc runs)
- Check that the exit code matches a rule
Job Keeps Failing After Max Retries
- Check the job logs for all attempts
- Review the recovery script output
- Consider increasing
max_retriesor fixing the underlying issue
Environment Variables Not Set
Ensure you're accessing the variables correctly:
# Correct
echo $TORC_JOB_ID
# Also correct
echo ${TORC_JOB_ID}
Runnable Examples
The following examples are included with Torc and can be run directly:
- Simple demo:
examples/yaml/failure_handler_demo.yaml - Parameterized simulation:
examples/yaml/failure_handler_simulation.yaml
These examples use:
examples/scripts/failure_demo_job.sh- A bash script that randomly fails with different exit codesexamples/scripts/recovery_demo.sh- A recovery script that logs actions
What Happens Without a Matching Handler
When a job fails with an exit code that doesn't match any failure handler rule, the job enters
pending_failed status instead of failed. This provides an opportunity for intelligent recovery:
flowchart TD
FAIL["Job fails<br/>(exit code 1)"]
MATCH{"Failure handler<br/>rule matches?"}
RETRY["Retry via<br/>failure handler"]
PENDING["Status: pending_failed<br/>Awaiting classification"]
AI["AI agent or user<br/>classifies error"]
TRANSIENT["Retry<br/>(transient error)"]
PERMANENT["Fail<br/>(permanent error)"]
FAIL --> MATCH
MATCH -->|Yes| RETRY
MATCH -->|No| PENDING
PENDING --> AI
AI -->|Transient| TRANSIENT
AI -->|Permanent| PERMANENT
style FAIL fill:#dc3545,color:#fff
style RETRY fill:#28a745,color:#fff
style PENDING fill:#ffc107,color:#000
style AI fill:#4a9eff,color:#fff
style TRANSIENT fill:#28a745,color:#fff
style PERMANENT fill:#6c757d,color:#fff
Benefits of pending_failed
- No immediate cascade: Downstream jobs stay blocked instead of being canceled
- Time to analyze: Errors can be reviewed before deciding retry vs fail
- AI-assisted recovery: MCP tools allow AI agents to classify errors intelligently
Handling pending_failed Jobs
Option 1: Manual reset
# Reset all pending_failed jobs (along with failed/canceled/terminated)
torc workflows reset-status $WORKFLOW_ID --failed-only
Option 2: AI-assisted classification
Use an AI agent with the torc MCP server:
list_pending_failed_jobs- See jobs with their stderrclassify_and_resolve_failures- Apply retry/fail decisions
See AI-Assisted Recovery for details.
Option 3: Catch-all failure handler
To prevent pending_failed status entirely, add a catch-all rule:
failure_handlers:
- name: comprehensive_recovery
rules:
# Specific handling for known codes
- exit_codes: [10, 11]
recovery_script: ./recover.sh
max_retries: 3
# Catch-all for any other failures
- match_all_exit_codes: true
max_retries: 1
With match_all_exit_codes: true, all failures are handled by the failure handler and will never
reach pending_failed status.
Summary
Failure handlers provide fine-grained control over job retry behavior:
- Define rules for specific exit codes
- Run recovery scripts before retry
- Limit retries to prevent infinite loops
- Share handlers across multiple jobs
- Unmatched failures enter
pending_failedfor AI-assisted or manual classification
Use failure handlers for immediate, exit-code-specific recovery, and combine with
torc watch --recover for comprehensive workflow resilience.
Tutorial: AI-Assisted Failure Recovery
🧪 EXPERIMENTAL: This feature is new and not yet well-tested. The API and behavior may change based on user feedback.
This tutorial shows how to use AI agents (Claude Code, GitHub Copilot, or custom MCP clients) to intelligently classify and recover from workflow failures that can't be handled by rule-based mechanisms.
Learning Objectives
By the end of this tutorial, you will:
- Understand the
pending_failedjob status - Configure workflows for AI-assisted recovery
- Use the torc MCP server with AI agents
- Classify transient vs permanent errors
Prerequisites
- Torc installed with the client feature
- A running Torc server
- An MCP-compatible AI agent (Claude Code, GitHub Copilot, or custom)
torc-mcp-serverinstalled
The Problem
Traditional recovery mechanisms have limitations:
| Mechanism | Limitation |
|---|---|
| Failure handlers | Require predefined exit codes |
torc watch --recover | Only handles OOM and timeout patterns |
--retry-unknown | Blindly retries everything |
Some failures require intelligent classification:
- Transient errors: Network timeouts, node failures, service outages - should retry
- Permanent errors: Code bugs, missing files, invalid inputs - should fail
AI agents can analyze error logs and make these distinctions.
How It Works
flowchart TD
JOB["Job exits with<br/>non-zero code"]
HANDLER{"Failure handler<br/>matches?"}
RETRY["Retry via<br/>failure handler"]
PENDING["Status: pending_failed<br/>Awaiting classification"]
WATCH["torc watch/recover<br/>+ AI agent"]
CLASSIFY["AI analyzes stderr"]
TRANSIENT["Transient error<br/>→ Retry"]
PERMANENT["Permanent error<br/>→ Fail"]
JOB --> HANDLER
HANDLER -->|Yes| RETRY
HANDLER -->|No| PENDING
PENDING --> WATCH
WATCH --> CLASSIFY
CLASSIFY --> TRANSIENT
CLASSIFY --> PERMANENT
style JOB fill:#dc3545,color:#fff
style PENDING fill:#ffc107,color:#000
style CLASSIFY fill:#4a9eff,color:#fff
style TRANSIENT fill:#28a745,color:#fff
style PERMANENT fill:#6c757d,color:#fff
When a job fails without a matching failure handler rule, it enters the pending_failed status
instead of failed. This prevents immediate downstream job cancellation and gives the AI agent time
to classify the error.
Quick Start
Option A: Automatic AI Agent Invocation (Recommended)
Use the --ai-recovery flag to automatically invoke the Claude CLI for classification:
# One-shot recovery with AI classification
torc recover 123 --ai-recovery
# Continuous monitoring with AI classification
torc watch 123 --ai-recovery
# Specify a different AI agent
torc recover 123 --ai-recovery --ai-agent claude # Default
torc recover 123 --ai-recovery --ai-agent copilot # GitHub Copilot
When --ai-recovery is enabled:
- Torc detects jobs in
pending_failedstatus - Automatically invokes the AI agent CLI with the torc MCP server
- AI agent analyzes stderr and classifies each job as transient (retry) or permanent (fail)
- Classifications are applied via MCP tools
- Recovery continues with the newly classified jobs
Requirements:
- Claude: Claude Code CLI installed (
claudecommand available) - GitHub Copilot: GitHub CLI with Copilot installed (
gh copilotcommand available) - Torc MCP server configured in your AI agent's MCP settings
Option B: Manual AI Agent Invocation
If you prefer manual control, configure your AI agent and invoke it yourself.
1. Start the MCP Server
torc-mcp-server --url http://localhost:8080/torc-service/v1
2. Configure Your AI Agent
Add the torc MCP server to your agent's configuration:
Claude Code (~/.claude/mcp_servers.json):
{
"mcpServers": {
"torc": {
"command": "torc-mcp-server",
"args": ["--url", "http://localhost:8080/torc-service/v1"]
}
}
}
GitHub Copilot (.github/copilot/mcp-config.json or global config):
{
"mcpServers": {
"torc": {
"command": "torc-mcp-server",
"args": ["--url", "http://localhost:8080/torc-service/v1"]
}
}
}
3. Run a Workflow
torc run my_workflow.yaml
4. Monitor with AI Recovery
When jobs fail, use your AI agent to:
-
List pending failures:
Agent: Use list_pending_failed_jobs with workflow_id=123 -
Analyze the errors:
Agent: The stderr shows "Connection refused to storage.example.com:443" This is a transient network error - the storage server was temporarily down. -
Classify and resolve:
Agent: Use classify_and_resolve_failures to retry these jobs
MCP Tools
The torc MCP server provides these tools for AI-assisted recovery:
list_pending_failed_jobs
Lists jobs with pending_failed status, including their stderr output.
Input:
{
"workflow_id": 123
}
Output:
{
"workflow_id": 123,
"pending_failed_count": 2,
"pending_failed_jobs": [
{
"job_id": 456,
"name": "process_data",
"return_code": 1,
"stderr_tail": "ConnectionError: Connection refused..."
}
],
"guidance": "Analyze the stderr output to classify each failure..."
}
classify_and_resolve_failures
Applies classifications to pending_failed jobs.
Input:
{
"workflow_id": 123,
"classifications": [
{
"job_id": 456,
"action": "retry",
"reason": "Transient network error - storage server was down"
},
{
"job_id": 789,
"action": "fail",
"reason": "SyntaxError in user code - requires fix"
}
],
"dry_run": true
}
Actions:
retry: Reset toreadystatus with bumpedattempt_idfail: Set tofailedstatus (triggers downstream cancellation)
Optional resource adjustments:
{
"job_id": 456,
"action": "retry",
"memory": "16g",
"runtime": "PT4H",
"reason": "OOM detected in stderr, increasing memory"
}
Error Classification Guide
Transient Errors (Should Retry)
| Error Pattern | Category |
|---|---|
Connection refused, Connection timed out | Network |
NCCL timeout, GPU communication error | GPU/HPC |
EIO, Input/output error | Hardware |
Slurm: node failure, PREEMPTED | HPC scheduling |
Service Unavailable, 503 | External service |
Permanent Errors (Should Fail)
| Error Pattern | Category |
|---|---|
SyntaxError, IndentationError | Code bug |
ModuleNotFoundError, ImportError | Missing dependency |
FileNotFoundError (for input files) | Missing data |
IndexError, KeyError | Logic error |
PermissionDenied (consistent) | Configuration issue |
Integration with Existing Recovery
AI-assisted recovery works alongside other mechanisms:
failure_handlers:
- name: known_errors
rules:
# Known recoverable exit codes handled immediately
- exit_codes: [10, 11]
recovery_script: ./recover.sh
max_retries: 3
# Unknown errors go to pending_failed for AI classification
When a job fails with an exit code not covered by the failure handler, it becomes pending_failed
instead of failed.
Using with torc recover
The torc recover command can work with AI agents:
# Check for pending_failed jobs
torc jobs list $WORKFLOW_ID --status pending_failed
# Use AI agent to classify and resolve
# (via MCP tools list_pending_failed_jobs and classify_and_resolve_failures)
# After AI classification, reinitialize and continue
torc workflows reinitialize $WORKFLOW_ID
torc workflows run $WORKFLOW_ID
Custom MCP Servers
You can create domain-specific MCP servers that the AI agent uses alongside torc:
flowchart LR
subgraph agent["AI Agent"]
LLM["Claude/Copilot"]
end
subgraph mcp["MCP Servers"]
TORC["torc-mcp-server"]
CUSTOM["custom-mcp-server"]
end
subgraph systems["Systems"]
API["Torc API"]
DOMAIN["Domain APIs"]
end
LLM --> TORC
LLM --> CUSTOM
TORC --> API
CUSTOM --> DOMAIN
style LLM fill:#4a9eff,color:#fff
style TORC fill:#28a745,color:#fff
style CUSTOM fill:#ffc107,color:#000
For example, a custom MCP server could:
- Check external service status (was the storage server down?)
- Look up known error patterns in a database
- Query infrastructure monitoring systems
Workflow Reset
If AI classification doesn't happen (e.g., no agent is running), jobs stay in pending_failed
status indefinitely. To reset them:
# Reset all failed/pending_failed jobs
torc workflows reset-status $WORKFLOW_ID --failed-only
Example Session
Here's an example conversation with an AI agent:
User: My workflow 123 has some failures. Can you help?
Agent: Let me check what's happening.
[Uses get_workflow_status tool]
The workflow has 3 jobs in pending_failed status.
Agent: [Uses list_pending_failed_jobs tool]
I found 3 jobs awaiting classification:
Job 456 (process_batch_1): stderr shows
"ConnectionError: Connection refused to storage.internal:443"
Job 457 (process_batch_2): stderr shows the same connection error
Job 458 (validate_data): stderr shows
"ValueError: Column 'timestamp' has invalid format"
Agent: Analysis:
- Jobs 456, 457: Transient network error. The storage server was
temporarily unreachable. These should be retried.
- Job 458: Permanent error. There's a data format issue that
requires fixing the input data.
Agent: [Uses classify_and_resolve_failures with dry_run=true]
Here's my proposed classification:
- Job 456: retry (transient network error)
- Job 457: retry (transient network error)
- Job 458: fail (invalid input data)
Should I apply these classifications?
User: Yes, go ahead.
Agent: [Uses classify_and_resolve_failures with dry_run=false]
Done! Jobs 456 and 457 are now ready to retry.
Job 458 has been marked as failed.
Note: You'll need to fix the timestamp format in your input data
before rerunning job 458.
Summary
AI-assisted recovery provides intelligent failure classification:
pending_failedstatus: Defers classification for AI analysis- MCP tools:
list_pending_failed_jobs,classify_and_resolve_failures - Error patterns: Transient vs permanent classification
- Integration: Works with failure handlers and
torc recover
How to Checkpoint a Job During Wall-Time Timeout
When running jobs on HPC systems like Slurm, your job may be terminated when the allocated wall-time expires. Torc supports graceful termination, allowing jobs to save checkpoints before exiting. This guide explains how to configure Slurm and your jobs to handle wall-time timeouts gracefully.
Overview
When Slurm is about to reach wall-time, it can be configured to send a SIGTERM signal to the Torc worker process. Torc then:
- Sends SIGTERM to jobs with
supports_termination: true - Sends SIGKILL to jobs with
supports_termination: false(or unset) - Waits for all processes to exit
- Reports job status as
terminatedto the server
Jobs that support termination can catch SIGTERM and perform cleanup operations like saving checkpoints, flushing buffers, or releasing resources.
Enabling Graceful Termination
Configuring Slurm to Send a Signal Before Timeout
By default, Slurm does not send any signal before the job's end time. When the wall-time limit
is reached, Slurm immediately terminates all processes. To receive a warning signal before timeout,
you must explicitly configure it using the --signal option in the extra field of your Slurm
scheduler specification:
slurm_schedulers:
- name: gpu_scheduler
account: my_project
partition: gpu
nodes: 1
walltime: "04:00:00"
extra: "--signal=B:TERM@300" # Send SIGTERM to batch script 300 seconds before timeout
The --signal option format is [B:]<sig_num>[@sig_time]:
B:prefix sends the signal only to the batch shell (by default, all job steps are signaled but not the batch shell itself)sig_numis the signal name or number (e.g.,TERM,USR1,10)sig_timeis seconds before the time limit to send the signal (default: 60 if not specified)
Note: Due to Slurm's event handling resolution, the signal may be sent up to 60 seconds earlier than specified.
To enable graceful termination for a job, set supports_termination: true in your job
specification:
Configuring a Torc job to be terminated gracefully
jobs:
- name: training_job
command: python train.py --checkpoint-dir /scratch/checkpoints
supports_termination: true
resource_requirements:
num_cpus: 4
memory: 16g
runtime: PT2H
Writing a Job That Handles SIGTERM
Your job script must catch SIGTERM and save its state. Here's a Python example:
import signal
import sys
import pickle
# Global state
checkpoint_path = "/scratch/checkpoints/model.pkl"
model_state = None
def save_checkpoint():
"""Save current model state to disk."""
print("Saving checkpoint...")
with open(checkpoint_path, 'wb') as f:
pickle.dump(model_state, f)
print(f"Checkpoint saved to {checkpoint_path}")
def handle_sigterm(signum, frame):
"""Handle SIGTERM by saving checkpoint and exiting."""
print("Received SIGTERM - saving checkpoint before exit")
save_checkpoint()
sys.exit(0) # Exit cleanly after saving
# Register the signal handler
signal.signal(signal.SIGTERM, handle_sigterm)
# Main training loop
def train():
global model_state
for epoch in range(1000):
# Training logic here...
model_state = {"epoch": epoch, "weights": [...]}
# Optionally save periodic checkpoints
if epoch % 100 == 0:
save_checkpoint()
if __name__ == "__main__":
train()
Bash Script Example
For shell scripts, use trap to catch SIGTERM:
#!/bin/bash
CHECKPOINT_FILE="/scratch/checkpoints/progress.txt"
# Function to save checkpoint
save_checkpoint() {
echo "Saving checkpoint at iteration $ITERATION"
echo "$ITERATION" > "$CHECKPOINT_FILE"
}
# Trap SIGTERM and save checkpoint
trap 'save_checkpoint; exit 0' SIGTERM
# Load checkpoint if exists
if [ -f "$CHECKPOINT_FILE" ]; then
ITERATION=$(cat "$CHECKPOINT_FILE")
echo "Resuming from iteration $ITERATION"
else
ITERATION=0
fi
# Main loop
while [ $ITERATION -lt 1000 ]; do
# Do work...
ITERATION=$((ITERATION + 1))
sleep 1
done
Complete Workflow Example
name: ml_training_workflow
user: researcher
jobs:
- name: preprocess
command: python preprocess.py
supports_termination: false # Quick job, no checkpointing needed
- name: train_model
command: python train.py --checkpoint-dir /scratch/checkpoints
supports_termination: true # Long job, needs checkpointing
depends_on:
- preprocess
resource_requirements:
num_cpus: 8
memory: 32g
num_gpus: 1
runtime: PT4H
- name: evaluate
command: python evaluate.py
supports_termination: true
depends_on:
- train_model
slurm_schedulers:
- name: gpu_scheduler
account: my_project
partition: gpu
nodes: 1
walltime: "04:00:00"
extra: "--signal=B:TERM@300" # Send SIGTERM to batch script 300 seconds before timeout
actions:
- trigger_type: on_workflow_start
action_type: schedule_nodes
scheduler: gpu_scheduler
scheduler_type: slurm
num_allocations: 1
Restarting After Termination
When a job is terminated due to wall-time, it will have status terminated. To continue the
workflow:
-
Re-submit the workflow to allocate new compute time:
torc workflows submit $WORKFLOW_ID -
Reinitialize terminated jobs to make them ready again:
torc workflows reinitialize $WORKFLOW_ID
Your job script should detect existing checkpoints and resume from where it left off.
Best Practices
1. Verify Checkpoint Integrity
Add validation to ensure checkpoints are complete:
def save_checkpoint():
temp_path = checkpoint_path + ".tmp"
with open(temp_path, 'wb') as f:
pickle.dump(model_state, f)
# Atomic rename ensures complete checkpoint
os.rename(temp_path, checkpoint_path)
2. Handle Multiple Termination Signals
Some systems send multiple signals. Ensure your handler is idempotent:
checkpoint_saved = False
def handle_sigterm(signum, frame):
global checkpoint_saved
if not checkpoint_saved:
save_checkpoint()
checkpoint_saved = True
sys.exit(0)
3. Test Locally
Test your SIGTERM handling locally before running on the cluster:
# Start your job
python train.py &
PID=$!
# Wait a bit, then send SIGTERM
sleep 10
kill -TERM $PID
# Verify checkpoint was saved
ls -la /scratch/checkpoints/
Troubleshooting
Job Killed Without Checkpointing
Symptoms: Job status is terminated but no checkpoint was saved.
Causes:
supports_terminationnot set totrue- Signal handler not registered before training started
- Checkpoint save took longer than the buffer time
Solutions:
- Verify
supports_termination: truein job spec - Register signal handlers early in your script
Checkpoint File Corrupted
Symptoms: Job fails to load checkpoint on restart.
Causes:
- Job was killed during checkpoint write
- Disk space exhausted
Solutions:
- Use atomic file operations (write to temp, then rename)
- Check available disk space before checkpointing
- Implement checkpoint validation on load
Job Doesn't Receive SIGTERM
Symptoms: Job runs until hard kill with no graceful shutdown.
Causes:
- Job running in a subprocess that doesn't forward signals
- Container or wrapper script intercepting signals
Solutions:
- Use
execin wrapper scripts to replace the shell - Configure signal forwarding in containers
- Run the job directly without wrapper scripts
See Also
- Working with Slurm - General Slurm configuration
- Managing Resources - Resource requirements configuration
- Debugging Workflows - Troubleshooting workflow issues
- Slurm sbatch --signal option - Customize which signal is sent and when before wall-time timeout
Administration & Security
Server deployment, configuration, and security.
- Server Deployment - Deploying the Torc server
- Configuration Files - Server configuration
- Authentication - User authentication setup
- Access Groups - Managing user permissions
- Team-Based Access Control Tutorial - Setting up team access
- Security Reference - Security best practices
Server Deployment
This guide covers deploying and operating the Torc server in production environments, including logging configuration, daemonization, and service management.
Server Subcommands
The torc-server binary has two main subcommands:
torc-server run
Use torc-server run for:
- HPC login nodes - Run the server in a tmux session while your jobs are running.
- Development and testing - Run the server interactively in a terminal
- Manual startup - When you want to control when the server starts and stops
- Custom deployment - Integration with external process managers (e.g., supervisord, custom scripts)
- Debugging - Running with verbose logging to troubleshoot issues
# Basic usage
torc-server run
# With options
torc-server run --port 8080 --database ./torc.db --log-level debug
torc-server run --completion-check-interval-secs 5
torc-server service
Use torc-server service for:
- Production deployment - Install as a system service that starts on boot
- Reliability - Automatic restart on failure
- Managed lifecycle - Standard start/stop/status commands
- Platform integration - Uses systemd (Linux), launchd (macOS), or Windows Services
# Install and start as a user service
torc-server service install --user
torc-server service start --user
# Or as a system service (requires root)
sudo torc-server service install
sudo torc-server service start
Which to choose?
- For HPC login nodes/development/testing: Use
torc-server run - For production servers/standalone computers: Use
torc-server service install
Quick Start
User Service (Development)
For development, install as a user service (no root required):
# Install with automatic defaults (logs to ~/.torc/logs, db at ~/.torc/torc.db)
torc-server service install --user
# Start the service
torc-server service start --user
System Service (Production)
For production deployment, install as a system service:
# Install with automatic defaults (logs to /var/log/torc, db at /var/lib/torc/torc.db)
sudo torc-server service install --user
# Start the service
sudo torc-server service start --user
The service will automatically start on boot and restart on failure. Logs are automatically configured to rotate when they reach 10 MiB (keeping 5 files max). See the Service Management section for customization options.
Logging System
Torc-server uses the tracing ecosystem for structured, high-performance logging with automatic
size-based file rotation.
Console Logging (Default)
By default, logs are written to stdout/stderr only:
torc-server run --log-level info
File Logging with Size-Based Rotation
Enable file logging by specifying a log directory:
torc-server run --log-dir /var/log/torc
This will:
- Write logs to both console and file
- Automatically rotate when log file reaches 10 MiB
- Keep up to 5 rotated log files (torc-server.log, torc-server.log.1, ..., torc-server.log.5)
- Oldest files are automatically deleted when limit is exceeded
JSON Format Logs
For structured log aggregation (e.g., ELK stack, Splunk):
torc-server run --log-dir /var/log/torc --json-logs
This writes JSON-formatted logs to the file while keeping human-readable logs on console.
Log Levels
Control verbosity with the --log-level flag or RUST_LOG environment variable:
# Available levels: error, warn, info, debug, trace
torc-server run --log-level debug --log-dir /var/log/torc
# Or using environment variable
RUST_LOG=debug torc-server run --log-dir /var/log/torc
Environment Variables
TORC_LOG_DIR: Default log directoryRUST_LOG: Default log level
Example:
export TORC_LOG_DIR=/var/log/torc
export RUST_LOG=info
torc-server run
Daemonization (Unix/Linux Only)
Run torc-server as a background daemon:
torc-server run --daemon --log-dir /var/log/torc
Important:
- Daemonization is only available on Unix/Linux systems
- When running as daemon, you must use
--log-dirsince console output is lost - The daemon creates a PID file (default:
/var/run/torc-server.pid)
Custom PID File Location
torc-server run --daemon --pid-file /var/run/torc/server.pid --log-dir /var/log/torc
Stopping a Daemon
# Find the PID
cat /var/run/torc-server.pid
# Kill the process
kill $(cat /var/run/torc-server.pid)
# Or forcefully
kill -9 $(cat /var/run/torc-server.pid)
Complete Example: Production Deployment
#!/bin/bash
# Production deployment script
# Create required directories
sudo mkdir -p /var/log/torc
sudo mkdir -p /var/run/torc
sudo mkdir -p /var/lib/torc
# Set permissions (adjust as needed)
sudo chown -R torc:torc /var/log/torc
sudo chown -R torc:torc /var/run/torc
sudo chown -R torc:torc /var/lib/torc
# Start server as daemon
torc-server run \
--daemon \
--log-dir /var/log/torc \
--log-level info \
--json-logs \
--pid-file /var/run/torc/server.pid \
--database /var/lib/torc/torc.db \
--url 0.0.0.0 \
--port 8080 \
--threads 8 \
--auth-file /etc/torc/htpasswd \
--require-auth
Service Management (Recommended for Production)
Automatic Installation
The easiest way to install torc-server as a service is using the built-in service management commands.
User Service (No Root Required)
Install as a user service that runs under your user account (recommended for development):
# Install with defaults (logs to ~/.torc/logs, database at ~/.torc/torc.db)
torc-server service install --user
# Or customize the configuration
torc-server service install --user \
--log-dir ~/custom/logs \
--database ~/custom/torc.db \
--url 0.0.0.0 \
--port 8080 \
--threads 4
# Start the user service
torc-server service start --user
# Check status
torc-server service status --user
# Stop the service
torc-server service stop --user
# Uninstall the service
torc-server service uninstall --user
User Service Defaults:
- Log directory:
~/.torc/logs - Database:
~/.torc/torc.db - Listen address:
0.0.0.0:8080 - Worker threads: 4
System Service (Requires Root)
Install as a system-wide service (recommended for production):
# Install with defaults
sudo torc-server service install
# Or customize the configuration
sudo torc-server service install \
--log-dir /var/log/torc \
--database /var/lib/torc/torc.db \
--url 0.0.0.0 \
--port 8080 \
--threads 8 \
--auth-file /etc/torc/htpasswd \
--require-auth \
--json-logs
# Start the system service
sudo torc-server service start
# Check status
torc-server service status
# Stop the service
sudo torc-server service stop
# Uninstall the service
sudo torc-server service uninstall
System Service Defaults:
- Log directory:
/var/log/torc - Database:
/var/lib/torc/torc.db - Listen address:
0.0.0.0:8080 - Worker threads: 4
This automatically creates the appropriate service configuration for your platform:
- Linux: systemd service (user:
~/.config/systemd/user/, system:/etc/systemd/system/) - macOS: launchd service (user:
~/Library/LaunchAgents/, system:/Library/LaunchDaemons/) - Windows: Windows Service
Manual Systemd Service (Linux)
Alternatively, you can manually create a systemd service:
# /etc/systemd/system/torc-server.service
[Unit]
Description=Torc Workflow Orchestration Server
After=network.target
[Service]
Type=simple
User=torc
Group=torc
WorkingDirectory=/var/lib/torc
Environment="RUST_LOG=info"
Environment="TORC_LOG_DIR=/var/log/torc"
ExecStart=/usr/local/bin/torc-server run \
--log-dir /var/log/torc \
--json-logs \
--database /var/lib/torc/torc.db \
--url 0.0.0.0 \
--port 8080 \
--threads 8 \
--auth-file /etc/torc/htpasswd \
--require-auth
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
Then:
sudo systemctl daemon-reload
sudo systemctl enable torc-server
sudo systemctl start torc-server
sudo systemctl status torc-server
# View logs
journalctl -u torc-server -f
Log Rotation Strategy
The server uses automatic size-based rotation with the following defaults:
- Max file size: 10 MiB per file
- Max files: 5 rotated files (plus the current log file)
- Total disk usage: Maximum of ~50 MiB for all log files
When the current log file reaches 10 MiB, it is automatically rotated:
torc-server.log→torc-server.log.1torc-server.log.1→torc-server.log.2- And so on...
- Oldest file (
torc-server.log.5) is deleted
This ensures predictable disk usage without external tools like logrotate.
Timing Instrumentation
For advanced performance monitoring, enable timing instrumentation:
TORC_TIMING_ENABLED=true torc-server run --log-dir /var/log/torc
This adds detailed timing information for all instrumented functions. Note that timing instrumentation works with both console and file logging.
Troubleshooting
Daemon won't start
-
Check permissions on log directory:
ls -la /var/log/torc -
Check if PID file directory exists:
ls -la /var/run/ -
Try running in foreground first:
torc-server run --log-dir /var/log/torc
No log files created
- Verify
--log-diris specified - Check directory permissions
- Check disk space:
df -h
Logs not rotating
Log rotation happens automatically when a log file reaches 10 MiB. If you need to verify rotation is working:
- Check the log directory for numbered files (e.g.,
torc-server.log.1) - Monitor disk usage - it should never exceed ~50 MiB for all log files
- For testing, you can generate large amounts of logs with
--log-level trace
Using Configuration Files
This guide shows how to set up and use configuration files for Torc components.
Quick Start
Create a user configuration file:
torc config init --user
Edit the file at ~/.config/torc/config.toml to set your defaults.
Configuration File Locations
| Location | Purpose | Priority |
|---|---|---|
/etc/torc/config.toml | System-wide defaults | 1 (lowest) |
~/.config/torc/config.toml | User preferences | 2 |
./torc.toml | Project-specific | 3 |
| Environment variables | Runtime overrides | 4 |
| CLI arguments | Explicit overrides | 5 (highest) |
Available Commands
# Show configuration file locations
torc config paths
# Show effective (merged) configuration
torc config show
# Show as JSON
torc config show --format json
# Create configuration file
torc config init --user # User config
torc config init --local # Project config
torc config init --system # System config (requires root)
# Validate configuration
torc config validate
Client Configuration
Common client settings:
[client]
api_url = "http://localhost:8080/torc-service/v1"
format = "table" # or "json"
log_level = "info"
username = "myuser"
[client.run]
poll_interval = 5.0
output_dir = "output"
max_parallel_jobs = 4
num_cpus = 8
memory_gb = 32.0
num_gpus = 1
Server Configuration
For torc-server:
[server]
url = "0.0.0.0"
port = 8080
threads = 4
database = "/path/to/torc.db"
auth_file = "/path/to/htpasswd"
require_auth = false
completion_check_interval_secs = 30.0
log_level = "info"
https = false
[server.logging]
log_dir = "/var/log/torc"
json_logs = false
Dashboard Configuration
For torc-dash:
[dash]
host = "127.0.0.1"
port = 8090
api_url = "http://localhost:8080/torc-service/v1"
torc_bin = "torc"
torc_server_bin = "torc-server"
standalone = false
server_port = 0
completion_check_interval_secs = 5
Environment Variables
Use environment variables for runtime configuration. Use double underscore (__) to separate nested
keys:
# Client settings
export TORC_CLIENT__API_URL="http://server:8080/torc-service/v1"
export TORC_CLIENT__FORMAT="json"
# Server settings
export TORC_SERVER__PORT="9999"
export TORC_SERVER__THREADS="8"
# Dashboard settings
export TORC_DASH__PORT="8090"
Overriding with CLI Arguments
CLI arguments always take precedence:
# Uses config file for api_url, but CLI for format
torc --format json workflows list
# CLI url overrides config file
torc --url http://other:8080/torc-service/v1 workflows list
Common Patterns
Development Environment
# ~/.config/torc/config.toml
[client]
api_url = "http://localhost:8080/torc-service/v1"
log_level = "debug"
[client.run]
poll_interval = 2.0
Team Shared Server
# ~/.config/torc/config.toml
[client]
api_url = "http://torc.internal.company.com:8080/torc-service/v1"
username = "developer"
CI/CD Pipeline
#!/bin/bash
export TORC_CLIENT__API_URL="${CI_TORC_SERVER}"
export TORC_CLIENT__FORMAT="json"
torc run workflow.yaml
result=$(torc workflows status $WORKFLOW_ID | jq -r '.status')
HPC Cluster
# Project-local torc.toml
[client]
api_url = "http://login-node:8080/torc-service/v1"
[client.run]
num_cpus = 64
memory_gb = 256.0
num_gpus = 8
output_dir = "/scratch/user/workflow_output"
Troubleshooting
Configuration not applied?
- Check which files are loaded:
torc config validate - View effective config:
torc config show - Verify file permissions and syntax
Environment variable not working?
Use double underscore for nesting: TORC_CLIENT__API_URL (not TORC_CLIENT_API_URL)
Invalid configuration?
Run validation: torc config validate
Authentication
Torc supports HTTP Basic authentication to secure access to your workflow orchestration server. This guide explains how to set up and use authentication.
Overview
Torc's authentication system provides:
- Multi-user support via htpasswd files
- Bcrypt password hashing for secure credential storage
- Backward compatibility - authentication is optional by default
- Flexible deployment - can require authentication or allow mixed access
- CLI and environment variable support for credentials
Server-Side Setup
1. Create User Accounts
Use the torc-htpasswd utility to manage user accounts:
# Add a user (will prompt for password)
torc-htpasswd add --file /path/to/htpasswd username
# Add a user with password on command line
torc-htpasswd add --file /path/to/htpasswd --password mypassword username
# Add a user with custom bcrypt cost (higher = more secure but slower)
torc-htpasswd add --file /path/to/htpasswd --cost 14 username
# Generate a password hash for remote registration (see below)
torc-htpasswd hash username
# List all users
torc-htpasswd list --file /path/to/htpasswd
# Verify a password
torc-htpasswd verify --file /path/to/htpasswd username
# Remove a user
torc-htpasswd remove --file /path/to/htpasswd username
The htpasswd file format is simple:
# Torc htpasswd file
# Format: username:bcrypt_hash
alice:$2b$12$abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOP
bob:$2b$12$zyxwvutsrqponmlkjihgfedcba0987654321ZYXWVUTSRQPONMLK
2. Start Server with Authentication
# Optional authentication (backward compatible mode)
torc-server run --auth-file /path/to/htpasswd
# Required authentication (all requests must authenticate)
torc-server run --auth-file /path/to/htpasswd --require-auth
# With access control enforcement and admin users
torc-server run --auth-file /path/to/htpasswd --require-auth \
--enforce-access-control --admin-user alice --admin-user bob
# Can also use environment variable
export TORC_AUTH_FILE=/path/to/htpasswd
export TORC_ADMIN_USERS=alice,bob
torc-server run
Authentication Modes:
- No
--auth-file: Authentication disabled, all requests allowed (default) --auth-fileonly: Authentication optional - authenticated requests are logged, unauthenticated requests allowed--auth-file --require-auth: Authentication required - unauthenticated requests are rejected
Access Control:
--enforce-access-control: Users can only access workflows they own or have group access to--admin-user: Adds users to the admin group (can specify multiple times)
3. Server Logs
The server logs authentication events:
INFO torc_server: Loading htpasswd file from: /path/to/htpasswd
INFO torc_server: Loaded 3 users from htpasswd file
INFO torc_server: Authentication is REQUIRED for all requests
...
DEBUG torc::server::auth: User 'alice' authenticated successfully
WARN torc::server::auth: Authentication failed for user 'bob'
WARN torc::server::auth: Authentication required but no credentials provided
Client-Side Usage
Using Command-Line Flags
# Provide credentials via flags
torc --username alice --password mypassword workflows list
# Username via flag, password will be prompted
torc --username alice workflows list
Password: ****
# All commands support authentication
torc --username alice --password mypassword workflows create workflow.yaml
Using Environment Variables
# Set credentials in environment
export TORC_PASSWORD=mypassword
# Run commands without flags
torc workflows list
torc jobs list my-workflow-id
Mixed Approach
# Username from env, password prompted
torc workflows list
Password: ****
## Security Best Practices
### 1. Use HTTPS in Production
Basic authentication sends base64-encoded credentials (easily decoded). **Always use HTTPS** when
authentication is enabled:
```bash
# Start server with HTTPS
torc-server run --https --auth-file /path/to/htpasswd --require-auth
# Client connects via HTTPS
torc --url https://torc.example.com/torc-service/v1 --username alice workflows list
2. Secure Credential Storage
Do:
- Store htpasswd files with restrictive permissions:
chmod 600 /path/to/htpasswd - Use environment variables for passwords in scripts
- Use password prompting for interactive sessions
- Rotate passwords periodically
Don't:
- Commit htpasswd files to version control
- Share htpasswd files between environments
- Pass passwords as command-line arguments in production (visible in process list)
- Use weak passwords or low bcrypt costs
3. Bcrypt Cost Factor
The cost factor determines password hashing strength:
- Cost 4-8: Fast but weaker (testing only)
- Cost 10-12: Balanced (default: 12)
- Cost 13-15: Strong (production systems)
- Cost 16+: Very strong (high-security environments)
# Use higher cost for production
torc-htpasswd add --file prod_htpasswd --cost 14 alice
4. Audit Logging
Monitor authentication events in server logs:
# Run server with debug logging for auth events
torc-server run --log-level debug --auth-file /path/to/htpasswd
# Or use RUST_LOG for granular control
RUST_LOG=torc::server::auth=debug torc-server run --auth-file /path/to/htpasswd
Common Workflows
Development Environment
# 1. Create test user
torc-htpasswd add --file dev_htpasswd --password devpass developer
# 2. Start server (auth optional)
torc-server run --auth-file dev_htpasswd --database dev.db
# 3. Use client without auth (still works)
torc workflows list
# 4. Or with auth
torc --username developer --password devpass workflows list
Production Deployment
# 1. Create production users with strong passwords and high cost
torc-htpasswd add --file /etc/torc/htpasswd --cost 14 alice
torc-htpasswd add --file /etc/torc/htpasswd --cost 14 bob
# 2. Secure the file
chmod 600 /etc/torc/htpasswd
chown torc-server:torc-server /etc/torc/htpasswd
# 3. Start server with required auth, access control, and HTTPS
torc-server run \
--https \
--auth-file /etc/torc/htpasswd \
--require-auth \
--enforce-access-control \
--admin-user alice \
--database /var/lib/torc/production.db
# 4. Clients must authenticate
torc --url --prompt-password https://torc.example.com/torc-service/v1 workflows list
Password: ****
CI/CD Pipeline
# Store credentials as CI secrets
# TORC_PASSWORD=<secure-password>
# Use in pipeline
export TORC_PASSWORD="${TORC_PASSWORD}"
export TORC_API_URL=https://torc.example.com/torc-service/v1
# Run workflow
torc workflows create pipeline.yaml
torc workflows start "${WORKFLOW_ID}"
Remote User Registration (HPC Environments)
When users cannot directly access the server (e.g., HPC users connecting to a server they don't have
login access to), use the hash command to generate credentials:
User (on HPC):
# Generate password hash (username defaults to $USER)
torc-htpasswd hash
Password for 'alice':
Hashing password (cost=12)...
alice:$2b$12$xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Send the line above to your server administrator.
# Or specify a different username
torc-htpasswd hash myusername
The hash output (alice:$2b$12$...) can be safely sent to the server administrator via email,
Slack, or any other channel - the bcrypt hash cannot be used to recover the original password.
Administrator (on server):
# Append the user's hash line to the htpasswd file
echo "alice:\$2b\$12\$xxxxx..." >> /etc/torc/htpasswd
# Or manually edit the file and paste the line
vim /etc/torc/htpasswd
Notes:
- The password is entered on the user's machine and never transmitted in plaintext
- The bcrypt hash is safe to transmit - it can only verify passwords, not recover them
- Users can customize the cost factor with
--costif needed - For scripting, use
--passwordflag (though less secure)
Migrating from No Auth to Required Auth
# 1. Start: No authentication
torc-server run --database prod.db
# 2. Add authentication file (optional mode)
torc-server run --auth-file /etc/torc/htpasswd --database prod.db
# 3. Monitor logs, ensure clients are authenticating
# Look for "User 'X' authenticated successfully" messages
# 4. Once all clients authenticate, enable required auth
torc-server run --auth-file /etc/torc/htpasswd --require-auth --database prod.db
Troubleshooting
"Authentication required but no credentials provided"
Cause: Server has --require-auth but client didn't send credentials.
Solution:
# Add username and password
torc --username alice --password mypass workflows list
"Authentication failed for user 'alice'"
Cause: Wrong password or user doesn't exist in htpasswd file.
Solutions:
# 1. Verify user exists
torc-htpasswd list --file /path/to/htpasswd
# 2. Verify password
torc-htpasswd verify --file /path/to/htpasswd alice
# 3. Reset password
torc-htpasswd add --file /path/to/htpasswd alice
"No credentials provided, allowing anonymous access"
Cause: Server has --auth-file but not --require-auth, and client didn't authenticate.
Solution: This is normal in optional auth mode. To require auth:
torc-server run --auth-file /path/to/htpasswd --require-auth
Password Prompting in Non-Interactive Sessions
Problem: Scripts or CI/CD fail waiting for password prompt.
Solutions:
# Use environment variable
export TORC_PASSWORD=mypassword
torc --username alice workflows list
# Or pass as flag (less secure - visible in process list)
torc --username alice --password mypassword workflows list
Advanced Topics
Multiple Environments
Maintain separate htpasswd files per environment:
# Development
torc-htpasswd add --file ~/.torc/dev_htpasswd --password devpass developer
# Staging
torc-htpasswd add --file /etc/torc/staging_htpasswd --cost 12 alice
# Production
torc-htpasswd add --file /etc/torc/prod_htpasswd --cost 14 alice
Programmatic Access
When using Torc's Rust, Python, or Julia clients programmatically:
Rust:
#![allow(unused)] fn main() { use torc::client::apis::configuration::Configuration; let mut config = Configuration::new(); config.base_path = "http://localhost:8080/torc-service/v1".to_string(); config.basic_auth = Some(("alice".to_string(), Some("password".to_string()))); }
Python:
from torc import Configuration, ApiClient
config = Configuration(
host="http://localhost:8080/torc-service/v1",
username="alice",
password="password"
)
Julia:
using Torc
using Base64
import OpenAPI
client = OpenAPI.Clients.Client(
"http://localhost:8080/torc-service/v1";
headers = Dict("Authorization" => "Basic " * base64encode("alice:password"))
)
api = Torc.APIClient.DefaultApi(client)
Load Balancer Considerations
When running multiple Torc servers behind a load balancer:
- Share the same htpasswd file across all servers (via NFS, S3, etc.)
- Or use a configuration management tool to sync htpasswd files
- Monitor for htpasswd file changes and reload if needed
Access Groups
Torc supports team-based access control through access groups. This allows teams to share access to workflows while restricting access from other teams.
Overview
Access groups provide:
- Team-based access control - Share workflows with specific groups
- Role-based membership - Users can be members or admins of groups
- Multiple group membership - Users can belong to multiple groups
- Workflow sharing - Associate workflows with one or more groups
Concepts
Admin Group
The admin group is a special system group that controls who can create and manage access groups. Admin group membership is managed via server configuration, not through the CLI.
- Only members of the admin group can create, delete, or modify access groups
- The admin group is created automatically on server startup
- Admin users are specified via
--admin-userCLI flag oradmin_usersconfig option - The admin group cannot be deleted or have its membership modified via the API
Access Groups
An access group is a named collection of users who share access to workflows. Groups have:
- Name - A unique identifier for the group
- Description - Optional description of the group's purpose
Memberships
Users are added to groups as members. Each membership has:
- User name - The username being added
- Role - Either "member" or "admin" (for future use)
Workflow Access
Workflows can be associated with multiple groups. When a workflow is associated with a group, all members of that group can access the workflow.
Access Rules
Access to a workflow is granted if any of these conditions are met:
- Ownership - The user created the workflow
- Group membership - The user belongs to a group that has access to the workflow
CLI Commands
Group Management
Note: Creating, deleting, and modifying access groups requires admin access. Only users who are members of the admin group can perform these operations.
# Create a new group (admin only)
torc access-groups create "data-science" --description "Data science team"
# List all groups
torc access-groups list
# Get a specific group
torc access-groups get 1
# Delete a group (admin only)
torc access-groups delete 1
Membership Management
Note: Adding and removing users from groups requires admin access or group admin role.
# Add a user to a group (admin or group admin only)
torc access-groups add-user 1 alice --role member
# List members of a group
torc access-groups list-members 1
# Remove a user from a group (admin or group admin only)
torc access-groups remove-user 1 alice
# List groups a user belongs to
torc access-groups list-user-groups alice
Workflow Access
Note: Adding and removing workflows from groups requires workflow ownership or admin access.
# Add a workflow to a group (owner or admin only)
torc access-groups add-workflow 42 1
# List groups that have access to a workflow
torc access-groups list-workflow-groups 42
# Remove a workflow from a group (owner or admin only)
torc access-groups remove-workflow 42 1
Common Workflows
Setting Up a Team
As an admin user:
# 1. Create the team group (requires admin access)
torc access-groups create "ml-team" --description "Machine learning team"
# Output: Successfully created access group:
# ID: 1
# Name: ml-team
# Description: Machine learning team
# 2. Add team members (requires admin access)
torc access-groups add-user 1 alice
torc access-groups add-user 1 bob
Sharing a Workflow with a Team
# 1. Create a workflow (using any method)
torc workflows create examples/sample_workflow.yaml
# Output: Created workflow 42
# 2. Add the workflow to the team's group
torc access-groups add-workflow 42 1
# Now all members of ml-team (group 1) can access workflow 42
Multi-Team Access
A workflow can be shared with multiple teams:
# Share with data science team (group 1)
torc access-groups add-workflow 42 1
# Also share with DevOps team (group 2)
torc access-groups add-workflow 42 2
# Both teams can now access the workflow
Checking Group Membership
# List all members in a group
torc access-groups list-members 1
# List all groups a user belongs to
torc access-groups list-user-groups alice
# List all groups with access to a workflow
torc access-groups list-workflow-groups 42
JSON Output
All commands support JSON output format for scripting:
# List groups in JSON format
torc access-groups list --format json
# Get group details in JSON
torc access-groups get 1 --format json
Database Schema
Access groups use three tables:
access_group
| Column | Type | Description |
|---|---|---|
| id | INTEGER | Primary key |
| name | TEXT | Unique group name |
| description | TEXT | Optional description |
| is_system | INTEGER | 1 if system group (cannot be deleted), 0 if not |
| created_at | TEXT | Timestamp of creation |
user_group_membership
| Column | Type | Description |
|---|---|---|
| id | INTEGER | Primary key |
| user_name | TEXT | Username of the member |
| group_id | INTEGER | Foreign key to access_group |
| role | TEXT | Role in the group (member/admin) |
| created_at | TEXT | Timestamp of membership creation |
workflow_access_group
| Column | Type | Description |
|---|---|---|
| workflow_id | INTEGER | Foreign key to workflow |
| group_id | INTEGER | Foreign key to access_group |
| created_at | TEXT | Timestamp of association |
Enabling Access Control Enforcement
By default, access groups are not enforced - all authenticated users can access all workflows. To
enable enforcement, start the server with the --enforce-access-control flag:
torc-server run --enforce-access-control --auth-file /path/to/htpasswd
When enforcement is enabled:
- Users can only access workflows they own or have group access to
- Anonymous access is denied
- API requests to inaccessible workflows return a 403 Forbidden error
- Only admin group members can create and manage access groups
The enforcement setting can also be configured in the torc configuration file:
[server]
enforce_access_control = true
Configuring Admin Users
Admin users have permission to create, delete, and modify access groups. Configure admin users via:
CLI Flag
torc-server run --admin-user alice --admin-user bob --enforce-access-control
Environment Variable
export TORC_ADMIN_USERS="alice,bob"
torc-server run --enforce-access-control
Configuration File
[server]
admin_users = ["alice", "bob"]
enforce_access_control = true
On server startup, the admin group is automatically created or updated to include the configured users. The admin group is a system group that cannot be deleted or modified via the API.
Future Enhancements
- Group admin role - Users with the "admin" role in a group can manage that group's membership
Tutorial: Team-Based Access Control with Access Groups
This tutorial walks you through setting up team-based access control so that workflows can be shared within teams while remaining private from other users.
Learning Objectives
By the end of this tutorial, you will:
- Understand how Torc's access control system works
- Set up authentication with htpasswd files
- Create access groups and add team members
- Share workflows with specific teams
- Enable access control enforcement on the server
Prerequisites
- Torc server and CLI installed
- Basic familiarity with the command line
- Administrative access to start/restart the server
Scenario
You're setting up Torc for an organization with two teams:
- ML Team: Alice and Bob work on machine learning workflows
- Data Team: Carol and Dave work on data processing workflows
Each team should only be able to see and manage their own workflows, but some workflows may need to be shared between teams.
Step 1: Create an htpasswd File
First, create an htpasswd file with user credentials. Torc uses bcrypt-hashed passwords for security.
# Create the htpasswd directory
mkdir -p /etc/torc
# Add users using torc-htpasswd utility
torc-htpasswd -c /etc/torc/htpasswd alice
# Enter password when prompted
torc-htpasswd /etc/torc/htpasswd bob
torc-htpasswd /etc/torc/htpasswd carol
torc-htpasswd /etc/torc/htpasswd dave
Verify the file was created:
cat /etc/torc/htpasswd
Expected output (hashes will differ):
alice:$2b$12$LQv3c1yqBWVHxkd0LHAkCOYz6TtxMQJqhN82lfIptSLnyJqRZaJ/K
bob:$2b$12$...
carol:$2b$12$...
dave:$2b$12$...
Step 2: Start the Server with Authentication and Access Control
Start the server with authentication required, access control enforcement enabled, and Alice as an admin user:
torc-server run \
--database /var/lib/torc/torc.db \
--auth-file /etc/torc/htpasswd \
--require-auth \
--enforce-access-control \
--admin-user alice
You should see:
INFO Starting torc-server version=0.8.0 (abc1234)
INFO Loaded 4 users from htpasswd file
INFO Authentication is REQUIRED for all requests
INFO Access control is ENABLED - users can only access their own workflows and workflows shared via access groups
INFO Admin users configured: ["alice"]
INFO Listening on localhost:8080
Note: The --admin-user flag specifies users who can create and manage access groups. Only
admin users can create, delete, or modify groups.
Step 3: Configure CLI Authentication
Set up credentials for each user. In a new terminal:
# Set the API URL
export TORC_API_URL="http://localhost:8080/torc-service/v1"
# Set credentials for Alice
read -s TORC_PASSWORD && export TORC_PASSWORD
# It will prompt you for the password with displaying it.
# Verify connection
torc ping
Expected output:
{ "status": "ok" }
Step 4: Create Access Groups
As Alice (who is an admin user), create the two team groups:
# Create the ML team group (requires admin access)
torc access-groups create "ml-team" --description "Machine Learning Team"
Output:
Successfully created access group:
ID: 1
Name: ml-team
Description: Machine Learning Team
# Create the Data team group
torc access-groups create "data-team" --description "Data Processing Team"
Output:
Successfully created access group:
ID: 2
Name: data-team
Description: Data Processing Team
List the groups to verify:
torc access-groups list
Output:
╭────┬────────────┬─────────────────────────╮
│ ID │ Name │ Description │
├────┼────────────┼─────────────────────────┤
│ 1 │ ml-team │ Machine Learning Team │
│ 2 │ data-team │ Data Processing Team │
╰────┴────────────┴─────────────────────────╯
Step 5: Add Team Members
Add users to their respective teams:
# Add Alice and Bob to the ML team
torc access-groups add-user 1 alice
torc access-groups add-user 1 bob
# Add Carol and Dave to the Data team
torc access-groups add-user 2 carol
torc access-groups add-user 2 dave
Verify team membership:
# List ML team members
torc access-groups list-members 1
Output:
╭───────────┬────────╮
│ User Name │ Role │
├───────────┼────────┤
│ alice │ member │
│ bob │ member │
╰───────────┴────────╯
# Check which groups Alice belongs to
torc access-groups list-user-groups alice
Output:
╭────┬─────────┬───────────────────────╮
│ ID │ Name │ Description │
├────┼─────────┼───────────────────────┤
│ 1 │ ml-team │ Machine Learning Team │
╰────┴─────────┴───────────────────────╯
Step 6: Create Workflows as Different Users
Now let's create workflows and see how access control works.
As Alice (ML Team)
export TORC_PASSWORD="alice_password"
# Create a workflow
cat > /tmp/ml_training.yaml << 'EOF'
name: ml_training_workflow
description: Train a machine learning model
jobs:
- name: train_model
command: echo "Training model..."
resource_requirements: small
resource_requirements:
- name: small
num_cpus: 1
memory: 1g
runtime: PT10M
EOF
WORKFLOW_ID=$(torc workflows create /tmp/ml_training.yaml -f json | jq -r '.id')
echo "Alice created workflow: $WORKFLOW_ID"
As Carol (Data Team)
export TORC_PASSWORD="carol_password"
# Create a different workflow
cat > /tmp/data_pipeline.yaml << 'EOF'
name: data_pipeline
description: Process incoming data
jobs:
- name: process_data
command: echo "Processing data..."
resource_requirements: small
resource_requirements:
- name: small
num_cpus: 1
memory: 1g
runtime: PT10M
EOF
torc workflows create /tmp/data_pipeline.yaml
Step 7: Observe Access Control in Action
Carol Cannot Access Alice's Workflow
Still as Carol, try to access Alice's workflow:
# Try to get Alice's workflow (assuming ID 1)
torc workflows get 1
Output:
{
"error": "Forbidden",
"message": "User 'carol' does not have access to workflow 1"
}
Carol Can Only See Her Own Workflows
torc workflows list
Output:
╭────┬───────────────┬─────────────────────────┬───────╮
│ ID │ Name │ Description │ User │
├────┼───────────────┼─────────────────────────┼───────┤
│ 2 │ data_pipeline │ Process incoming data │ carol │
╰────┴───────────────┴─────────────────────────┴───────╯
Carol only sees her own workflow, not Alice's.
Step 8: Share a Workflow with Another Team
Sometimes workflows need to be shared between teams. Alice can share her workflow with the Data team.
As Alice, Share the Workflow
export TORC_PASSWORD="alice_password"
# Share workflow 1 with the data team (group 2)
torc access-groups add-workflow 1 2
echo "Shared workflow 1 with data-team"
Verify the Sharing
# List groups that have access to workflow 1
torc access-groups list-workflow-groups 1
Output:
╭────┬────────────┬─────────────────────────╮
│ ID │ Name │ Description │
├────┼────────────┼─────────────────────────┤
│ 2 │ data-team │ Data Processing Team │
╰────┴────────────┴─────────────────────────╯
Carol Can Now Access the Shared Workflow
export TORC_PASSWORD="carol_password"
# Now Carol can access the workflow
torc workflows get 1
Output:
╭────────────────────────────────────────┬────────────────────────────╮
│ Field │ Value │
├────────────────────────────────────────┼────────────────────────────┤
│ ID │ 1 │
│ Name │ ml_training_workflow │
│ User │ alice │
│ Description │ Train a machine learning │
│ │ model │
╰────────────────────────────────────────┴────────────────────────────╯
Carol can now see and interact with Alice's workflow because she's a member of the data-team, which has been granted access.
Step 9: Revoke Access
If you need to remove access:
export TORC_PASSWORD="alice_password"
# Remove the data team's access to workflow 1
torc access-groups remove-workflow 1 2
echo "Revoked data-team access to workflow 1"
Now Carol can no longer access the workflow.
Access Control Summary
Here's how access is determined:
Can user access workflow?
├── Is user the workflow owner? → YES → ALLOWED
├── Is user in a group with access to this workflow? → YES → ALLOWED
└── Otherwise → DENIED
Access Rules
- Ownership: Users always have access to workflows they created
- Group Membership: Users have access to workflows shared with any group they belong to
- No Inheritance: Access is explicit—being in one group doesn't grant access to another group's workflows
Configuration Reference
Server Flags
| Flag | Description |
|---|---|
--auth-file | Path to htpasswd file |
--require-auth | Require authentication for all requests |
--enforce-access-control | Enable access control enforcement |
--admin-user | Add user to admin group (can be repeated) |
Configuration File
You can also configure these in config.toml:
[server]
auth_file = "/etc/torc/htpasswd"
require_auth = true
enforce_access_control = true
admin_users = ["alice", "bob"]
Troubleshooting
"Anonymous access not allowed"
This error appears when:
- No credentials are provided
--require-authis enabled
Solution: Set the TORC_PASSWORD environment variable.
"User is not a system administrator"
This error appears when trying to create, delete, or modify access groups without admin privileges.
Solution: Either:
- Add the user to the admin group in the server configuration using
--admin-useroradmin_usersin config.toml - Use an account that is already an admin
"User does not have access to workflow"
This error appears when:
- The user is not the workflow owner
- The user is not in any group with access to the workflow
--enforce-access-controlis enabled
Solution: Either the workflow owner needs to share it with a group the user belongs to, or add the user to an appropriate group.
Authentication Working but Access Control Not Enforced
Check that --enforce-access-control flag is set when starting the server.
What You Learned
In this tutorial, you learned:
- How to create an htpasswd file with user credentials
- How to start the server with authentication and access control
- How to create and manage access groups
- How to add users to groups
- How to share workflows with teams
- How access control decisions are made
Next Steps
- Learn about Configuration Files to set up persistent configuration
- Explore Server Deployment for production setups
- See the Access Groups Reference for all available commands
Security Reference
This document describes Torc's security features, threat model, and best practices for secure deployments.
Authentication & Authorization
HTTP Basic Authentication
Torc uses HTTP Basic authentication with bcrypt password hashing.
Security Properties:
- ✅ Industry-standard authentication method
- ✅ Bcrypt hashing with configurable work factor (cost 4-31)
- ✅ No plaintext password storage
- ✅ Salt automatically generated per password
- ⚠️ Credentials sent base64-encoded (requires HTTPS)
Architecture:
Client Request
↓
[Authorization: Basic base64(username:password)]
↓
Server Middleware
↓
Extract credentials → Verify against htpasswd file (bcrypt)
↓
Success: Add username to request context
Failure: Return None authorization (logged)
↓
API Handler (receives authorization context)
Authentication Modes
| Mode | Configuration | Behavior |
|---|---|---|
| Disabled | No --auth-file | All requests allowed, no authentication |
| Optional | --auth-file only | Valid credentials logged, invalid/missing allowed |
| Required | --auth-file --require-auth | Invalid/missing credentials rejected |
Recommendation: Use Required mode in production.
Transport Security
HTTPS/TLS
When to use HTTPS:
- ✅ Always when authentication is enabled
- ✅ When transmitting sensitive workflow data
- ✅ Over untrusted networks (internet, shared networks)
- ✅ Compliance requirements (PCI-DSS, HIPAA, etc.)
Configuration:
# Server
torc-server run --https --auth-file /etc/torc/htpasswd
# Client
torc --url https://torc.example.com/torc-service/v1 workflows list
TLS Version: Torc uses the system's OpenSSL/native-tls library. Ensure:
- TLS 1.2 minimum (TLS 1.3 preferred)
- Strong cipher suites enabled
- Valid certificates from trusted CA
Network Security
Deployment Patterns:
Pattern 1: Internal Network Only
[Torc Clients] ←→ [Torc Server]
(Trusted internal network)
- May use HTTP if network is truly isolated
- Still recommend HTTPS for defense in depth
Pattern 2: Load Balancer with TLS Termination
[Torc Clients] ←HTTPS→ [Load Balancer] ←HTTP→ [Torc Server]
(Internet) (Internal trusted network)
- TLS terminates at load balancer
- Internal traffic may use HTTP
- Ensure load balancer validates certificates
Pattern 3: End-to-End TLS
[Torc Clients] ←HTTPS→ [Torc Server]
(Internet or untrusted network)
- Most secure pattern
- TLS all the way to Torc server
- Required for compliance scenarios
Internal Network Quick Setup
For internal, trusted networks where the primary goal is preventing accidental access rather than defending against malicious attacks, use this optimized configuration:
# 1. Create htpasswd with lower cost factor (faster auth)
torc-htpasswd add --cost 8 /etc/torc/htpasswd alice
torc-htpasswd add --cost 8 /etc/torc/htpasswd bob
# 2. Run server with auth and access control
torc-server run \
--auth-file /etc/torc/htpasswd \
--require-auth \
--enforce-access-control \
--credential-cache-ttl-secs 60
This provides:
- User isolation (users only see their own workflows)
- Fast CLI response (~10ms first auth, <1ms cached)
- Protection against accidental modifications
- Simple username/password authentication
For higher security requirements (internet-facing, compliance, sensitive data), use the default cost factor (12) and HTTPS.
Credential Management
Password Requirements
Recommendations:
- Minimum 12 characters
- Mix of uppercase, lowercase, numbers, symbols
- No dictionary words or common patterns
- Unique per user and environment
Bcrypt Cost Factor:
| Cost | Approx Time | Use Case |
|---|---|---|
| 4 | ~1ms | Testing only |
| 8 | ~10ms | Internal trusted networks |
| 10 | ~50ms | Low-security internal |
| 12 | ~250ms | Default, general use, internet-facing |
| 14 | ~1s | High security, sensitive data |
| 15+ | > 2s | Maximum security, infrequent logins |
Cost Selection Criteria:
- Higher cost = more CPU, slower login
- Balance security vs. user experience
- Consider attack surface (internet-facing vs. internal)
- For internal networks with trusted users, cost 8 provides reasonable security with fast response
Creating passwords with a specific cost:
# Default cost (12) - good for most deployments
torc-htpasswd add /etc/torc/htpasswd alice
# Lower cost (8) - faster, suitable for internal networks
torc-htpasswd add --cost 8 /etc/torc/htpasswd alice
# Higher cost (14) - slower, for high-security environments
torc-htpasswd add --cost 14 /etc/torc/htpasswd alice
Credential Caching
To improve CLI responsiveness, the server caches successful authentications for a configurable duration (default: 60 seconds). This avoids repeated bcrypt verification for the same credentials.
Configuration:
# CLI option
torc-server run --credential-cache-ttl-secs 60
# Environment variable
export TORC_CREDENTIAL_CACHE_TTL_SECS=60
# Config file (torc.toml)
[server]
credential_cache_ttl_secs = 60
Performance impact:
| Scenario | Response Time |
|---|---|
| First auth (cost 12) | ~250ms |
| First auth (cost 8) | ~10ms |
| Cached auth (any cost) | < 1ms |
Security notes:
- Passwords are never stored in plaintext in the cache
- Cache keys are SHA-256 hashes of credentials
- Cache entries auto-expire after TTL
- Failed authentications are never cached
- Set
--credential-cache-ttl-secs 0to disable caching
Htpasswd File Security
File Permissions:
# Restrict to server process owner only
chmod 600 /etc/torc/htpasswd
chown torc-server:torc-server /etc/torc/htpasswd
Storage Best Practices:
- ❌ Never commit to version control
- ❌ Never share between environments
- ✅ Store in secure configuration management (Ansible Vault, HashiCorp Vault)
- ✅ Backup with encryption
- ✅ Rotate regularly (quarterly recommended)
File Format Security:
# Comments allowed
username:$2b$12$hash...
- Only bcrypt hashes accepted (
$2a$,$2b$, or$2y$) - No plaintext passwords
- No MD5, SHA-1, or weak hashes
Client Credential Storage
Best Practices:
| Method | Security | Use Case |
|---|---|---|
| Environment variables | ⭐⭐⭐ | Scripts, automation, CI/CD |
| Password prompt | ⭐⭐⭐⭐⭐ | Interactive sessions |
| Config files | ⭐ | Not recommended |
| Command-line args | ⚠️ | Visible in process list, avoid |
Examples:
# Good: Environment variables
read -s TORC_PASSWORD && export TORC_PASSWORD
torc workflows list
# Good: Password prompt
torc alice workflows list
Password: ****
# Acceptable: CI/CD with secrets
TORC_PASSWORD=${{ secrets.TORC_PASSWORD }} torc workflows create
# Bad: Command-line argument (visible in `ps`)
torc --password mypassword workflows list
Threat Model
Threats Mitigated
| Threat | Mitigation | Effectiveness |
|---|---|---|
| Unauthorized API access | Required authentication | ✅ High |
| Credential stuffing | Bcrypt work factor, rate limiting | ✅ Medium-High |
| Password cracking | Bcrypt (cost ≥12) | ✅ High |
| Man-in-the-middle | HTTPS/TLS | ✅ High |
| Credential theft (database) | No plaintext storage, bcrypt | ✅ High |
Threats Not Mitigated
| Threat | Impact | Recommendation |
|---|---|---|
| DDoS attacks | High | Use rate limiting, firewalls, CDN |
| SQL injection | Medium | Use parameterized queries (Torc does) |
| Insider threats | High | Audit logging, least privilege |
| Compromised client | High | Network segmentation, monitoring |
| Side-channel attacks | Low | Constant-time operations (bcrypt does) |
Attack Scenarios
Scenario 1: Compromised htpasswd file
Impact: Attacker has password hashes
Risk: Medium - Bcrypt makes cracking difficult
Mitigation:
- Immediately revoke all user accounts
- Generate new htpasswd file with fresh passwords
- Investigate how file was compromised
- Increase bcrypt cost if needed
Scenario 2: Leaked credentials in logs
Impact: Credentials in plaintext in logs
Risk: High
Prevention:
- Never log passwords
- Sanitize logs before sharing
- Restrict log access
Response:
- Rotate affected credentials immediately
- Audit all log access
- Review code for password logging
Scenario 3: Network eavesdropping (HTTP)
Impact: Credentials intercepted in transit
Risk: Critical over untrusted networks
Prevention:
- Always use HTTPS when authentication is enabled
- Especially critical for internet-facing deployments
Response:
- Enable HTTPS immediately
- Rotate all credentials (assume compromised)
- Review access logs for suspicious activity
Audit & Monitoring
Authentication Events
Server logs authentication events:
# Successful authentication
DEBUG torc::server::auth: User 'alice' authenticated successfully
# Failed authentication (wrong password)
WARN torc::server::auth: Authentication failed for user 'alice'
# Missing credentials when required
WARN torc::server::auth: Authentication required but no credentials provided
# No authentication configured
DEBUG torc::server::auth: No authentication configured, allowing request
Recommended Monitoring
Metrics to track:
- Failed authentication attempts (per user, total)
- Successful authentications (per user)
- Requests without credentials (when auth enabled)
- Unusual access patterns (time, volume, endpoints)
Alerting thresholds:
- 5+ failed attempts from same user in 5 minutes
- 100+ failed attempts total in 1 hour
- Authentication from unexpected IP ranges
- Access during unusual hours (if applicable)
Log aggregation:
# Collect auth events
grep "torc::server::auth" /var/log/torc-server.log
# Count failed attempts per user
grep "Authentication failed" /var/log/torc-server.log | \
awk '{print $(NF)}' | sort | uniq -c
# Monitor in real-time
tail -f /var/log/torc-server.log | grep "WARN.*auth"
Compliance Considerations
GDPR / Privacy
User data in htpasswd:
- Usernames may be personal data (email addresses)
- Password hashes are not personal data (irreversible)
Recommendations:
- Allow users to request account deletion
- Don't use email addresses as usernames (use aliases)
- Document data retention policies
PCI-DSS / SOC2
Requirements that apply:
- Transport encryption: Use HTTPS
- Access control: Enable required authentication
- Password complexity: Enforce strong passwords
- Audit logging: Enable and monitor auth logs
- Regular reviews: Audit user accounts quarterly
Configuration:
# PCI-DSS compliant setup
torc-server run \
--https \
--auth-file /etc/torc/htpasswd \
--require-auth \
--log-level info
Security Checklist
Server Deployment
- HTTPS enabled in production
- Strong TLS configuration (TLS 1.2+, strong ciphers)
- Valid certificate from trusted CA
-
Required authentication enabled (
--require-auth) -
Htpasswd file permissions:
chmod 600 - Htpasswd file owned by server process user
- Bcrypt cost ≥ 12 (≥14 for high-security)
- Strong passwords enforced
- Audit logging enabled
- Log rotation configured
- Firewall rules limit access
- Server runs as non-root user
- Regular security updates applied
Client Usage
- HTTPS URLs used when auth enabled
- Credentials stored in environment variables (not command-line)
- Passwords not logged
- Passwords not committed to version control
- Password prompting used for interactive sessions
- CI/CD secrets used for automation
- Regular password rotation
Operational
- User accounts reviewed quarterly
- Inactive accounts disabled/removed
- Failed login attempts monitored
- Access logs reviewed regularly
- Incident response plan documented
- Backup htpasswd files encrypted
- Disaster recovery tested
Access Control
Admin Group
Torc uses a special admin group to control who can manage access groups. Key properties:
- Config-driven: Admin users are configured via
--admin-userflag oradmin_usersconfig - Auto-created: The admin group is created/synced automatically on server startup
- System group: Cannot be deleted or have membership modified via the API
- Required for management: Only admin members can create, delete, or modify access groups
Configuration:
# Via CLI
torc-server run --admin-user alice --admin-user bob --enforce-access-control
# Via config file
[server]
admin_users = ["alice", "bob"]
enforce_access_control = true
Access Group Permissions
| Operation | Required Permission |
|---|---|
| Create group | System admin |
| Delete group | System admin (non-system groups only) |
| Add user to group | System admin or group admin |
| Remove user from group | System admin or group admin |
| Add workflow to group | Workflow owner or group admin |
| Remove workflow from group | Workflow owner or group admin |
| List groups | Any authenticated user |
| Get group details | Any authenticated user |
Future Enhancements
Planned security features:
- Token-based authentication: JWT/OAuth2 support
- API keys: Long-lived tokens for automation
- LDAP/Active Directory integration: Enterprise SSO
- Rate limiting: Prevent brute force attacks
- 2FA/MFA support: Multi-factor authentication
- Session management: Token expiration, refresh
- Audit trail: Detailed access logging
Resources
- OWASP Authentication Cheat Sheet
- bcrypt Wikipedia
- HTTP Basic Authentication RFC 7617
- NIST Password Guidelines
Tools & Integrations
Additional tools and third-party integrations.
- Workflow Wizard - Interactive workflow creation
- Configuration Files Tutorial - Working with config files
- Dashboard Deployment - Deploying the web dashboard
- Configuring AI Assistants - Setting up AI integration
- AI-Assisted Workflow Management - Using AI for workflow management
- Map Python Functions Across Workers - Python integration
- Filtering CLI Output with Nushell - Advanced CLI usage
- Shell Completions - Tab completion setup
Creating Workflows with the Dashboard Wizard
This tutorial walks you through creating a workflow using the interactive wizard in the Torc dashboard. The wizard provides a guided, step-by-step interface for building workflows without writing YAML or JSON files.
Learning Objectives
By the end of this tutorial, you will:
- Create a multi-job workflow using the dashboard wizard
- Define job dependencies visually
- Configure Slurm schedulers for HPC execution
- Set up workflow actions to automatically schedule nodes
- Understand how the wizard generates workflow specifications
Prerequisites
- Torc dashboard running (see Dashboard Deployment)
- Basic understanding of Torc workflows
Overview
The workflow wizard guides you through five steps:
- Basics - Workflow name and description
- Jobs - Define computational tasks
- Schedulers - Configure Slurm schedulers (optional)
- Actions - Set up automatic node scheduling (optional)
- Review - Preview and create the workflow
Step 1: Open the Create Workflow Modal
- Open the Torc dashboard in your browser
- Click the Create Workflow button in the top-right corner
- Select the Wizard tab at the top of the modal
You'll see the wizard interface with step indicators showing your progress.
Step 2: Configure Basics
Enter the basic workflow information:
- Workflow Name (required): A unique identifier for your workflow (e.g.,
data-pipeline) - Description (optional): A brief description of what the workflow does
Click Next to proceed.
Step 3: Add Jobs
This is where you define the computational tasks in your workflow.
Adding Your First Job
- Click + Add Job
- Fill in the job details:
- Job Name: A unique name (e.g.,
preprocess) - Command: The shell command to execute (e.g.,
python preprocess.py)
- Job Name: A unique name (e.g.,
Setting Dependencies
The Blocked By field lets you specify which jobs must complete before this job can run:
- Click the Blocked By dropdown
- Select one or more jobs that must complete first
- Hold Ctrl/Cmd to select multiple jobs
Configuring Resources
Choose a resource preset or customize:
- Small: 1 CPU, 1GB memory
- Medium: 8 CPUs, 50GB memory
- GPU: 1 CPU, 10GB memory, 1 GPU
- Custom: Specify exact requirements
Example: Three-Job Pipeline
Let's create a simple pipeline:
Job 1: preprocess
- Name:
preprocess - Command:
echo "Preprocessing..." && sleep 5 - Blocked By: (none - this runs first)
- Resources: Small
Job 2: analyze
- Name:
analyze - Command:
echo "Analyzing..." && sleep 10 - Blocked By:
preprocess - Resources: Medium
Job 3: report
- Name:
report - Command:
echo "Generating report..." && sleep 3 - Blocked By:
analyze - Resources: Small
Click Next when all jobs are configured.
Step 4: Configure Schedulers (Optional)
If you're running on an HPC system with Slurm, you can define scheduler configurations here. Skip this step for local execution.
Adding a Scheduler
-
Click + Add Scheduler
-
Fill in the required fields:
- Scheduler Name: A reference name (e.g.,
compute_scheduler) - Account: Your Slurm account name
- Scheduler Name: A reference name (e.g.,
-
Configure optional settings:
- Nodes: Number of nodes to request
- Wall Time: Maximum runtime (HH:MM:SS format)
- Partition: Slurm partition name
- QoS: Quality of service level
- GRES: GPU resources (e.g.,
gpu:2) - Memory: Memory per node (e.g.,
64G) - Temp Storage: Local scratch space
- Extra Slurm Options: Additional sbatch flags
Example: Basic Compute Scheduler
- Scheduler Name:
compute - Account:
my_project - Nodes:
1 - Wall Time:
02:00:00 - Partition:
standard
Assigning Jobs to Schedulers
After defining schedulers, you can assign jobs to them:
- Go back to the Jobs step (click Back)
- In each job card, find the Scheduler dropdown
- Select the scheduler to use for that job
Jobs without a scheduler assigned will run locally.
Click Next when scheduler configuration is complete.
Step 5: Configure Actions (Optional)
Actions automatically schedule Slurm nodes when certain events occur. This is useful for dynamic resource allocation.
Trigger Types
- When workflow starts: Schedule nodes immediately when the workflow begins
- When jobs become ready: Schedule nodes when specific jobs are ready to run
- When jobs complete: Schedule nodes after specific jobs finish
Adding an Action
- Click + Add Action
- Select the Trigger type
- Select the Scheduler to use
- For job-based triggers, select which Jobs trigger the action
- Set the Number of Allocations (how many Slurm jobs to submit)
Example: Stage-Based Scheduling
For a workflow with setup, compute, and finalize stages:
Action 1: Setup Stage
- Trigger: When workflow starts
- Scheduler:
setup_scheduler - Allocations: 1
Action 2: Compute Stage
- Trigger: When jobs become ready
- Jobs:
compute_job1,compute_job2,compute_job3 - Scheduler:
compute_scheduler - Allocations: 3
Action 3: Finalize Stage
- Trigger: When jobs become ready
- Jobs:
finalize - Scheduler:
finalize_scheduler - Allocations: 1
Click Next to proceed to review.
Step 6: Review and Create
The review step shows the generated workflow specification in JSON format. This is exactly what will be submitted to the server.
Reviewing the Spec
Examine the generated specification to verify:
- All jobs are included with correct names and commands
- Dependencies (
depends_on) match your intended workflow structure - Resource requirements are correctly assigned
- Schedulers have the right configuration
- Actions trigger on the expected events
Creating the Workflow
-
Review the Options below the wizard:
- Initialize workflow after creation: Builds the dependency graph (recommended)
- Run workflow immediately: Starts execution right away
-
Click Create to submit the workflow
If successful, you'll see a success notification and the workflow will appear in your workflow list.
Example: Complete Diamond Workflow
Here's how to create a diamond-pattern workflow using the wizard:
preprocess
/ \
work1 work2
\ /
postprocess
Jobs Configuration
| Job | Command | Blocked By | Resources |
|---|---|---|---|
| preprocess | ./preprocess.sh | (none) | Small |
| work1 | ./work1.sh | preprocess | Medium |
| work2 | ./work2.sh | preprocess | Medium |
| postprocess | ./postprocess.sh | work1, work2 | Small |
Generated Spec Preview
The wizard generates a spec like this:
{
"name": "diamond-workflow",
"description": "Fan-out and fan-in example",
"jobs": [
{
"name": "preprocess",
"command": "./preprocess.sh",
"resource_requirements": "res_1cpu_1g"
},
{
"name": "work1",
"command": "./work1.sh",
"depends_on": ["preprocess"],
"resource_requirements": "res_8cpu_50g"
},
{
"name": "work2",
"command": "./work2.sh",
"depends_on": ["preprocess"],
"resource_requirements": "res_8cpu_50g"
},
{
"name": "postprocess",
"command": "./postprocess.sh",
"depends_on": ["work1", "work2"],
"resource_requirements": "res_1cpu_1g"
}
],
"resource_requirements": [
{"name": "res_1cpu_1g", "num_cpus": 1, "memory": "1g", "num_gpus": 0, "num_nodes": 1, "runtime": "PT1H"},
{"name": "res_8cpu_50g", "num_cpus": 8, "memory": "50g", "num_gpus": 0, "num_nodes": 1, "runtime": "PT1H"}
]
}
Using Parameterized Jobs
The wizard supports job parameterization for creating multiple similar jobs:
- In a job card, find the Parameters field
- Enter parameters in the format:
param_name: "value_spec"
Parameter Formats
- Range:
i: "1:10"creates jobs for i=1,2,3,...,10 - Range with step:
i: "0:100:10"creates jobs for i=0,10,20,...,100 - List:
dataset: "['train', 'test', 'validation']"
Example: Parameterized Processing
- Job Name:
process_{i} - Command:
python process.py --index {i} - Parameters:
i: "1:5"
This creates 5 jobs: process_1 through process_5.
Tips and Best Practices
Job Naming
- Use descriptive, unique names
- Avoid spaces and special characters
- For parameterized jobs, include the parameter in the name (e.g.,
job_{i})
Dependencies
- Keep dependency chains as short as possible
- Use the fan-out/fan-in pattern for parallelism
- Avoid circular dependencies (the server will reject them)
Schedulers
- Create separate schedulers for different resource needs
- Use descriptive names that indicate the scheduler's purpose
- Set realistic wall times to avoid queue priority penalties
Actions
- Use
on_workflow_startfor initial resource allocation - Use
on_jobs_readyfor just-in-time scheduling - Match allocations to the number of parallel jobs
What You Learned
In this tutorial, you learned:
- How to navigate the five-step workflow wizard
- How to create jobs with commands, dependencies, and resources
- How to configure Slurm schedulers for HPC execution
- How to set up actions for automatic node scheduling
- How the wizard generates workflow specifications
Next Steps
- Diamond Workflow - Learn about file-based implicit dependencies
- Simple Parameterization - Create parameter sweeps programmatically
- Working with Slurm - Advanced Slurm configuration
Configuration Files Tutorial
This tutorial walks you through setting up Torc configuration files to customize your workflows without specifying options on every command.
What You'll Learn
- How to create a configuration file
- Configuration file locations and priority
- Using environment variables for configuration
- Common configuration patterns
Prerequisites
- Torc CLI installed
- Basic familiarity with TOML format
Step 1: Check Current Configuration
First, let's see what configuration Torc is using:
torc config paths
Output:
Configuration file paths (in priority order):
System: /etc/torc/config.toml (not found)
User: ~/.config/torc/config.toml (not found)
Local: torc.toml (not found)
Environment variables (highest priority):
Use double underscore (__) to separate nested keys:
TORC_CLIENT__API_URL, TORC_CLIENT__FORMAT, TORC_SERVER__PORT, etc.
No configuration files found. Run 'torc config init --user' to create one.
View the effective configuration (defaults):
torc config show
Step 2: Create a User Configuration File
Create a configuration file in your home directory that applies to all your Torc usage:
torc config init --user
This creates ~/.config/torc/config.toml (Linux/macOS) or the equivalent on your platform.
Step 3: Edit the Configuration
Open the configuration file in your editor:
# Linux/macOS
$EDITOR ~/.config/torc/config.toml
# Or find the path
torc config paths
Here's a typical user configuration:
[client]
# Connect to your team's Torc server
api_url = "http://torc-server.internal:8080/torc-service/v1"
# Default to JSON output for scripting
format = "json"
# Enable debug logging
log_level = "debug"
# Username for authentication
username = "alice"
[client.run]
# Default poll interval for local runs
poll_interval = 10.0
# Default output directory
output_dir = "workflow_output"
# Resource limits for local execution
num_cpus = 8
memory_gb = 32.0
num_gpus = 1
Step 4: Validate Your Configuration
After editing, validate the configuration:
torc config validate
Output:
Validating configuration...
Loading configuration from:
- /home/alice/.config/torc/config.toml
Configuration is valid.
Key settings:
client.api_url = http://torc-server.internal:8080/torc-service/v1
client.format = json
server.port = 8080
dash.port = 8090
Step 5: Create a Project-Local Configuration
For project-specific settings, create a torc.toml in your project directory:
cd ~/myproject
torc config init --local
Edit torc.toml:
[client]
# Project-specific server (overrides user config)
api_url = "http://localhost:8080/torc-service/v1"
[client.run]
# Project-specific output directory
output_dir = "results"
# This project needs more memory
memory_gb = 64.0
Step 6: Understanding Priority
Configuration sources are loaded in this order (later sources override earlier):
- Built-in defaults (lowest priority)
- System config (
/etc/torc/config.toml) - User config (
~/.config/torc/config.toml) - Project-local config (
./torc.toml) - Environment variables (
TORC_*) - CLI arguments (highest priority)
Example: If you have api_url set in your user config but run:
torc --url http://other-server:8080/torc-service/v1 workflows list
The CLI argument takes precedence.
Step 7: Using Environment Variables
Environment variables are useful for CI/CD pipelines and temporary overrides.
Use double underscore (__) to separate nested keys:
# Override client.api_url
export TORC_CLIENT__API_URL="http://ci-server:8080/torc-service/v1"
# Override client.format
export TORC_CLIENT__FORMAT="json"
# Override server.port
export TORC_SERVER__PORT="9999"
# Verify
torc config show | grep api_url
Step 8: Server Configuration
If you're running torc-server, you can configure it too:
[server]
# Bind to all interfaces
url = "0.0.0.0"
port = 8080
# Use 4 worker threads
threads = 4
# Database location
database = "/var/lib/torc/torc.db"
# Authentication
auth_file = "/etc/torc/htpasswd"
require_auth = true
# Background job processing interval
completion_check_interval_secs = 30.0
# Log level
log_level = "info"
[server.logging]
# Enable file logging
log_dir = "/var/log/torc"
json_logs = true
Step 9: Dashboard Configuration
Configure torc-dash:
[dash]
# Bind address
host = "0.0.0.0"
port = 8090
# API server to connect to
api_url = "http://localhost:8080/torc-service/v1"
# Standalone mode settings
standalone = false
Common Configuration Patterns
Development Setup
# ~/.config/torc/config.toml
[client]
api_url = "http://localhost:8080/torc-service/v1"
format = "table"
log_level = "debug"
[client.run]
poll_interval = 2.0
output_dir = "output"
Production Server
# /etc/torc/config.toml
[server]
url = "0.0.0.0"
port = 8080
threads = 8
database = "/var/lib/torc/production.db"
require_auth = true
auth_file = "/etc/torc/htpasswd"
completion_check_interval_secs = 30.0
log_level = "info"
[server.logging]
log_dir = "/var/log/torc"
json_logs = true
CI/CD Pipeline
# In CI script
export TORC_CLIENT__API_URL="${CI_TORC_SERVER_URL}"
export TORC_CLIENT__FORMAT="json"
torc run workflow.yaml
Troubleshooting
Configuration Not Loading
Check which files are being loaded:
torc config validate
Environment Variables Not Working
Remember to use double underscore (__) for nesting:
# Correct
TORC_CLIENT__API_URL=http://...
# Wrong (single underscore)
TORC_CLIENT_API_URL=http://...
View Effective Configuration
See the merged result of all configuration sources:
torc config show
Next Steps
- See the Configuration Reference for all available options
- Learn about Server Deployment for production setups
- Set up Authentication for secure access
Dashboard Deployment Tutorial
This tutorial covers three common deployment scenarios for the Torc web dashboard (torc-dash).
Each scenario addresses different environments and use cases.
Prefer the terminal? If you work primarily in SSH sessions or terminal environments, consider using the Terminal UI (TUI) instead. The TUI provides the same workflow and job management capabilities without requiring a web browser or SSH tunnels.
Overview of Deployment Scenarios
| Scenario | Environment | Use Case |
|---|---|---|
| 1. Standalone | Local computer | Single-computer workflows, development, testing |
| 2. All-in-One Login Node | HPC login node | Small HPC workflows (< 100 jobs) |
| 3. Shared Server | HPC login node + dedicated server | Large-scale multi-user HPC workflows |
Prerequisites
Before starting, ensure you have:
-
Built Torc binaries (see Installation):
cargo build --release --workspace -
Added binaries to PATH:
export PATH="$PATH:/path/to/torc/target/release" -
Initialized the database (if not using standalone mode):
sqlx database setup
Scenario 1: Local Development (Standalone Mode)
Best for: Single-computer workflows on your laptop or workstation. Also ideal for development, testing, and learning Torc.
This is the simplest setup - everything runs on one machine with a single command. Use this when you want to run workflows entirely on your local computer without HPC resources.
Architecture
flowchart TB
subgraph computer["Your Computer"]
browser["Browser"]
dash["torc-dash<br/>(web UI)"]
server["torc-server<br/>(managed)"]
cli["torc CLI"]
db[("SQLite DB")]
browser --> dash
dash -->|"HTTP API"| server
dash -->|"executes"| cli
cli -->|"HTTP API"| server
server --> db
end
Setup
Step 1: Start the dashboard in standalone mode
torc-dash --standalone
This single command:
- Automatically starts
torc-serveron a free port - Starts the dashboard on http://127.0.0.1:8090
- Configures the dashboard to connect to the managed server
Step 2: Open your browser
Navigate to http://localhost:8090
Step 3: Create and run a workflow
- Click Create Workflow
- Upload a workflow specification file (YAML, JSON, or KDL)
- Click Create
- Click Initialize on the new workflow
- Click Run Locally to execute
Configuration Options
# Custom dashboard port
torc-dash --standalone --port 8080
# Specify database location
torc-dash --standalone --database /path/to/my.db
# Faster job completion detection
torc-dash --standalone --completion-check-interval-secs 2
# Specify binary paths (if not in PATH)
torc-dash --standalone \
--torc-bin /path/to/torc \
--torc-server-bin /path/to/torc-server
Stopping
Press Ctrl+C in the terminal. This stops both the dashboard and the managed server.
Scenario 2: All-in-One Login Node
Best for: Small HPC workflows (fewer than 100 jobs) where you want the complete Torc stack running on the login node, with jobs submitted to Slurm.
This is the simplest HPC setup - everything runs on the login node. It's ideal for individual users running small HPC workflows without needing a dedicated server infrastructure.
Important: Login nodes are shared resources. The torc-dash and torc-server applications consume minimal resources when workflows are small (e.g., less than 100 jobs). If you run these applications on bigger workflows, especially with faster job completion interval checks, you may impact other users.
Architecture
flowchart TB
subgraph local["Your Local Machine"]
browser["Browser"]
end
subgraph login["Login Node"]
dash["torc-dash<br/>(port 8090)"]
server["torc-server<br/>(port 8080)"]
cli["torc CLI"]
db[("SQLite DB")]
slurm["sbatch/squeue"]
dash -->|"HTTP API"| server
dash -->|"executes"| cli
cli -->|"HTTP API"| server
server --> db
cli --> slurm
end
subgraph compute["Compute Nodes (Slurm)"]
runner1["torc-slurm-job-runner<br/>(job 1)"]
runner2["torc-slurm-job-runner<br/>(job 2)"]
runnerN["torc-slurm-job-runner<br/>(job N)"]
runner1 -->|"HTTP API"| server
runner2 -->|"HTTP API"| server
runnerN -->|"HTTP API"| server
end
browser -->|"SSH tunnel"| dash
slurm --> compute
Setup
Step 1: Start torc-server on the login node
# Start server
torc-server run \
--port 8080 \
--database $SCRATCH/torc.db \
--completion-check-interval-secs 60
Or as a background process:
nohup torc-server run \
--port 8080 \
--database $SCRATCH/torc.db \
> $SCRATCH/torc-server.log 2>&1 &
Step 2: Start torc-dash on the same login node
# Set API URL to local server
export TORC_API_URL="http://localhost:8080/torc-service/v1"
# Start dashboard
torc-dash --port 8090
Or in the background:
nohup torc-dash --port 8090 > $SCRATCH/torc-dash.log 2>&1 &
Step 3: Access via SSH tunnel
From your local machine:
ssh -L 8090:localhost:8090 user@login-node
Important: Use
localhostin the tunnel command, not the login node's hostname. This works because torc-dash binds to 127.0.0.1 by default.
Open http://localhost:8090 in your browser.
Submitting to Slurm
Via Dashboard:
- Create a workflow with Slurm scheduler configuration
- Click Initialize
- Click Submit (not "Run Locally")
Via CLI:
export TORC_API_URL="http://localhost:8080/torc-service/v1"
# Create workflow with Slurm actions
torc workflows create my_slurm_workflow.yaml
# Submit to Slurm
torc submit <workflow_id>
Monitoring Slurm Jobs
The dashboard shows job status updates as Slurm jobs progress:
- Go to Details tab
- Select Jobs
- Enable Auto-refresh
- Watch status change from
pending→running→completed
You can also monitor via:
- Events tab for state transitions
- Debugging tab for job logs after completion
Scenario 3: Shared Server on HPC
Best for: Large-scale multi-user HPC environments where a central torc-server runs
persistently on a dedicated server, and multiple users access it via torc-dash from login nodes.
This is the most scalable setup, suitable for production deployments with many concurrent users and large workflows.
Architecture
flowchart TB
subgraph local["Your Local Machine"]
browser["Browser"]
end
subgraph login["Login Node"]
dash["torc-dash<br/>(port 8090)"]
cli["torc CLI"]
dash -->|"executes"| cli
end
subgraph shared["Shared Server"]
server["torc-server<br/>(port 8080)"]
db[("SQLite DB")]
server --> db
end
browser -->|"SSH tunnel"| dash
dash -->|"HTTP API"| server
cli -->|"HTTP API"| server
Setup
Step 1: Start torc-server on the shared server
On the shared server (e.g., a dedicated service node):
# Start server with production settings
torc-server run \
--port 8080 \
--database /shared/storage/torc.db \
--completion-check-interval-secs 60
For production, consider running as a systemd service:
torc-server service install --user \
--port 8080 \
--database /shared/storage/torc.db
Step 2: Start torc-dash on a login node
SSH to the login node and start the dashboard:
# Connect to the shared server
export TORC_API_URL="http://shared-server:8080/torc-service/v1"
# Start dashboard (accessible only from login node by default)
torc-dash --port 8090
Step 3: Access the dashboard via SSH tunnel
From your local machine, create an SSH tunnel:
ssh -L 8090:localhost:8090 user@login-node
Important: Use
localhostin the tunnel command, not the login node's hostname. The tunnel forwards your local port tolocalhost:8090as seen from the login node, which matches where torc-dash binds (127.0.0.1:8090).
Then open http://localhost:8090 in your local browser.
Using the CLI
Users can also interact with the shared server via CLI:
# Set the API URL
export TORC_API_URL="http://shared-server:8080/torc-service/v1"
# Create and run workflows
torc workflows create my_workflow.yaml
torc workflows run <workflow_id>
Authentication
For multi-user environments, enable authentication:
# Create htpasswd file with users
torc-htpasswd create /path/to/htpasswd
torc-htpasswd add /path/to/htpasswd alice
torc-htpasswd add /path/to/htpasswd bob
# Start server with authentication
torc-server run \
--port 8080 \
--auth-file /path/to/htpasswd \
--require-auth
See Authentication for details.
Comparison Summary
| Feature | Standalone | All-in-One Login Node | Shared Server |
|---|---|---|---|
| Setup complexity | Low | Medium | Medium-High |
| Multi-user support | No | Single user | Yes |
| Slurm integration | No | Yes | Yes |
| Database location | Local | Login node | Shared storage |
| Persistence | Session only | Depends on setup | Persistent |
| Best for | Single-computer workflows | Small HPC workflows (< 100 jobs) | Large-scale production |
Troubleshooting
Cannot connect to server
# Check if server is running
curl http://localhost:8080/torc-service/v1/workflows
# Check server logs
tail -f torc-server.log
SSH tunnel not working
# Verify tunnel is established
lsof -i :8090
# Check for port conflicts
netstat -tuln | grep 8090
Slurm jobs not starting
# Check Slurm queue
squeue --me
# Check Slurm job logs
cat output/slurm_output_*.e
Dashboard shows "Disconnected"
- Verify API URL in Configuration tab
- Check network connectivity to server
- Ensure server is running and accessible
Next Steps
- Web Dashboard Guide - Complete feature reference
- Working with Slurm - Detailed Slurm configuration
- Server Deployment - Production server setup
- Authentication - Securing your deployment
Configuring AI Assistants
Complete guide for configuring AI assistants (Claude Code, GitHub Copilot) to work with Torc.
Overview
Torc provides an MCP (Model Context Protocol) server that enables AI assistants to interact with
workflows. The torc-mcp-server binary acts as a bridge between AI assistants and the Torc HTTP
API.
Available Tools
The AI assistant has access to these Torc operations:
| Tool | Description |
|---|---|
get_workflow_status | Get workflow info with job counts by status |
get_job_details | Get detailed job info including resource requirements |
get_job_logs | Read stdout/stderr from job log files |
list_failed_jobs | List all failed jobs in a workflow |
list_jobs_by_status | Filter jobs by status |
check_resource_utilization | Analyze resource usage and detect OOM/timeout issues |
update_job_resources | Modify job resource requirements |
restart_jobs | Reset and restart failed jobs |
resubmit_workflow | Regenerate Slurm schedulers and submit new allocations |
cancel_jobs | Cancel specific jobs |
create_workflow_from_spec | Create a workflow from JSON specification |
Environment Variables
| Variable | Description | Default |
|---|---|---|
TORC_API_URL | Torc server URL | http://localhost:8080/torc-service/v1 |
TORC_OUTPUT_DIR | Directory containing job logs | output |
TORC_PASSWORD | Password for authentication (optional) | — |
Claude Code Configuration
Configuration Scopes
Claude Code supports MCP configuration at three scopes:
| Scope | File | Use Case |
|---|---|---|
| Project | .mcp.json in project root | Team-shared configuration (commit to git) |
| Local | .mcp.json with --scope local | Personal project settings (gitignored) |
| User | ~/.claude.json | Cross-project personal tools |
CLI Commands
# Add the Torc MCP server
claude mcp add torc \
--scope project \
-e TORC_API_URL=http://localhost:8080/torc-service/v1 \
-e TORC_OUTPUT_DIR=/path/to/your/output \
-- /path/to/torc-mcp-server
# List configured MCP servers
claude mcp list
# Get details about the torc server
claude mcp get torc
# Remove the MCP server
claude mcp remove torc
Manual Configuration
Create or edit .mcp.json in your project root:
{
"mcpServers": {
"torc": {
"command": "/path/to/torc-mcp-server",
"env": {
"TORC_API_URL": "http://localhost:8080/torc-service/v1",
"TORC_OUTPUT_DIR": "/path/to/your/output"
}
}
}
}
Environment Variable Expansion
You can use environment variable expansion in .mcp.json:
{
"mcpServers": {
"torc": {
"command": "/path/to/torc-mcp-server",
"env": {
"TORC_API_URL": "${TORC_API_URL:-http://localhost:8080/torc-service/v1}",
"TORC_OUTPUT_DIR": "${TORC_OUTPUT_DIR:-./output}"
}
}
}
}
VS Code + GitHub Copilot Configuration
Prerequisites
- VS Code 1.99 or later
- GitHub Copilot extension installed
- GitHub Copilot subscription (Business, Enterprise, Pro, or Pro+)
Configuration
Create .vscode/mcp.json in your project root:
{
"servers": {
"torc": {
"command": "/path/to/torc-mcp-server",
"env": {
"TORC_API_URL": "http://localhost:8080/torc-service/v1",
"TORC_OUTPUT_DIR": "./output"
}
}
}
}
Verify Setup
- Open the Command Palette (
Ctrl+Shift+P/Cmd+Shift+P) - Run "MCP: List Servers"
- Verify "torc" appears in the list
Usage
In Copilot Chat, use Agent Mode (@workspace or the agent icon) to access MCP tools.
VS Code Remote SSH for HPC
For users running Torc on HPC clusters, VS Code's Remote SSH extension allows you to use Copilot Chat with the MCP server running directly on the cluster.
Architecture
┌─────────────────────┐ ┌─────────────────────────────────────┐
│ Local Machine │ SSH │ HPC Cluster │
│ │◄───────►│ │
│ VS Code │ │ torc-mcp-server ◄──► torc-server │
│ (Copilot Chat) │ │ ▲ │
│ │ │ │ │
└─────────────────────┘ │ .vscode/mcp.json │
└─────────────────────────────────────┘
The MCP server runs on the HPC, communicates with the Torc server on the HPC, and VS Code proxies requests through SSH. No ports need to be exposed to your local machine.
Step 1: Build torc-mcp-server on the HPC
# On the HPC (via SSH or login node)
cd /path/to/torc
cargo build --release -p torc-mcp-server
Step 2: Configure MCP in your project
Create .vscode/mcp.json in your project directory on the HPC:
{
"servers": {
"torc": {
"command": "/path/on/hpc/torc/target/release/torc-mcp-server",
"env": {
"TORC_API_URL": "http://localhost:8080/torc-service/v1",
"TORC_OUTPUT_DIR": "./output"
}
}
}
}
Important: MCP servers configured in workspace settings (
.vscode/mcp.json) run on the remote host, not your local machine.
Step 3: Connect and use
- Install the Remote - SSH extension
- Connect to the HPC:
Remote-SSH: Connect to Host... - Open your project folder on the HPC
- Open Copilot Chat and use Agent Mode
HPC-Specific Tips
- Module systems: If your HPC uses modules, you may need to set
PATHin the env to include required dependencies - Shared filesystems: Place
.vscode/mcp.jsonin a project directory on a shared filesystem accessible from compute nodes - Firewalls: The MCP server only needs to reach the Torc server on the HPC's internal network
How It Works
Torc uses the Model Context Protocol (MCP), an open standard for connecting AI assistants to
external tools. The torc-mcp-server binary:
- Receives tool calls from the AI assistant via stdio
- Translates them to Torc HTTP API calls
- Returns results in a format the assistant can understand
The server is stateless—it simply proxies requests to your running Torc server. All workflow state remains in Torc's database.
Security Considerations
- The MCP server has full access to your Torc server
- Consider using authentication if your Torc server is exposed
- The server can modify workflows (restart, cancel, update resources)
- Review proposed actions before they execute
Troubleshooting
Claude doesn't see the tools
- Verify the MCP server is configured:
claude mcp list - Check the config file is valid JSON:
cat .mcp.json | jq . - Check that the path to
torc-mcp-serveris correct and the binary exists - Start a new Claude Code session (MCP servers are loaded at startup)
"Failed to connect to server"
- Ensure your Torc server is running
- Check that
TORC_API_URLis correct - Verify network connectivity
"Permission denied" or "Authentication failed"
- Set
TORC_PASSWORDif your server requires auth - Check that the credentials are correct
Logs not found
- Ensure
TORC_OUTPUT_DIRpoints to your job output directory - Check that jobs have actually run (logs are created at runtime)
See Also
Tutorial: AI-Assisted Workflow Management
This tutorial shows how to use AI assistants to manage Torc workflows using natural language.
What You'll Learn
- Set up an AI assistant to work with Torc
- Create and manage workflows through conversation
- Debug failures by asking questions
Prerequisites
- Torc installed and server running
- Claude Code or VS Code with GitHub Copilot
What Can AI Assistants Do?
With Torc's AI integration, you can manage workflows using natural language:
| Task | Example |
|---|---|
| Create workflows | "Create a workflow with 10 parallel jobs" |
| Check status | "What's the status of workflow 42?" |
| Debug failures | "Why did a job in workflow 5 fail?" |
| Fix problems | "Restart the failed jobs with doubled memory" |
| Investigate | "Check if any jobs exceeded their memory limits" |
Quick Setup: Claude Code
Step 1: Add the Torc MCP server to your project:
claude mcp add torc \
--scope project \
-e TORC_API_URL=http://localhost:8080/torc-service/v1 \
-e TORC_OUTPUT_DIR=./output \
-- /path/to/torc-mcp-server
Step 2: Verify it's configured:
claude mcp list
Step 3: Start a new Claude Code session and try:
"What workflows exist?"
Quick Setup: VS Code + Copilot
Step 1: Create .vscode/mcp.json in your project:
{
"servers": {
"torc": {
"command": "/path/to/torc-mcp-server",
"env": {
"TORC_API_URL": "http://localhost:8080/torc-service/v1",
"TORC_OUTPUT_DIR": "./output"
}
}
}
}
Step 2: In Copilot Chat, use Agent Mode and try:
"What workflows exist?"
Example: Creating a Workflow
> Create a workflow with 3 stages: a preprocess script, 10 work jobs that each take an index, and a postprocess job. The work jobs each require
10 CPUs and 20 GB of memory. The other jobs have trivial resource requirements.
⏺ I'll create a workflow spec file with the 3-stage fan-out/fan-in pattern you described.
⏺ Created the workflow spec at ./workflow.json with:
Structure:
- Stage 1: preprocess - runs first, outputs preprocess_output.txt
- Stage 2: work_0 through work_9 - 10 parallel jobs, each reads preprocess output
- Stage 3: postprocess - waits for all work jobs to complete (fan-in pattern)
Resource Requirements:
| Name | CPUs | Memory | Runtime |
|-------|------|--------|---------|
| small | 1 | 1g | 1 hour |
| large | 10 | 20g | 2 hours |
Next steps:
1. Edit the spec to replace placeholder commands with your actual scripts
2. Run locally: torc run workflow.json
3. Or submit to Slurm: torc submit workflow.json (requires adding an account)
You can create much more complex workflows. Output is not shown, but this prompt will create a workflow with job dependencies defined implicitly through job-file relationships:
Create a workflow with four stages of jobs. There are 3 initial jobs each dependent on an input file. Each of those jobs fans out into 10 work jobs through
output files. Each of those jobs creates a file. In the third stage there is a script that aggregates each of the 10 jobs into one file. The final
postprocess job aggregates those three files into one. I will run it with Slurm: account=my_account. The work jobs require 10 cpus and 30 GB of
memory. The other jobs have trivial resource requirements.
Example: Adjusting jobs' resource requirements
Suppose that jobs fail because they use too much CPU or memory.
> Check the resource utilization for workflow 20
⏺ torc - check_resource_utilization (MCP)(workflow_id: 20)
⎿ {
"over_utilization_count": 1,
"run_id": null,
… +17 lines (ctrl+o to expand)
⏺ Resource Utilization for Workflow 20
Summary: 3 job results analyzed, 1 resource violation found
CPU Over-Utilization
| Job ID | Job Name | Specified | Peak Used | Over By |
|--------|---------------|----------------|-----------|---------|
| 264 | cpu_heavy_job | 3 cores (300%) | 602.5% | +100.8% |
The cpu_heavy_job used approximately 6 CPU cores at peak, but was only allocated 3 cores.
Would you like me to update the resource requirements for this job? I can increase the CPU allocation to 7 cores to provide headroom for future runs.
> yes
⏺ torc - update_job_resources (MCP)(job_id: 264, num_cpus: 7)
⎿ {
"job_id": 264,
"resource_requirements_id": 45,
… +8 lines (ctrl+o to expand)
⏺ Done. Updated cpu_heavy_job (job 264) resources:
| Setting | Value |
|---------|------------|
| CPUs | 7 (was 3) |
| Memory | 1g |
| Runtime | 15 minutes |
The job is now configured with enough CPU capacity for future runs.
Example: Recovering a failed workflow that timed out
The workflow in tests/workflows/timeout_auto_recovery_test is used to test torc's auto-recovery of
walltime timeouts. In this example, the workflow was left in a failed state.
Here is a conversation in Claude Code after running the workflow:
> analyze the logs for workflow 6
Workflow 6 Log Analysis Summary
Workflow: timeout_auto_recovery_test - Test workflow for automatic timeout recovery
Job Results
| Job | Name | Status | Runtime | Return Code |
|-----|------------|---------------|---------|--------------|
| 31 | (fast job) | ✅ Completed | 1.0 min | 0 |
| 32 | job_slow | ⛔ Terminated | 5.9 min | -9 (SIGKILL) |
Terminated Job Details (job_slow)
The job was killed by signal 9 (SIGKILL) after running for ~6 minutes, which matches the expected behavior for a timeout test.
stdout log shows the job ran through minute 6 before being killed:
Slow job starting at Sat Dec 27 15:40:59 MST 2025
This job will run for 10 minutes (exceeds 5 min runtime and 8 min walltime)
It should be killed by Slurm and recovered by the watcher
Minute 1 of 10...
Minute 2 of 10...
...
Minute 6 of 10...
stderr log: Empty (no errors written)
Conclusion
The timeout auto-recovery test is working as expected:
- The slow job exceeded its 5-minute runtime limit
- It was killed by the system (SIGKILL, return code -9) at ~6 minutes
- Peak memory usage was ~4.1 MB (no memory issues)
Example: Debugging a failed workflow with over-consumption of memory
The workflow in tests/workflows/oom_auto_recovery_test is used to test torc's diagnosis of memory
over-consumption errors.
Here is a conversation in Claude Code after running the workflow:
Analyze the logs for workflow 16
● torc - analyze_workflow_logs (MCP)(workflow_id: 16) ⎿ { "error_count": 0, "errors_by_type": [], … +7 lines (ctrl+o to expand)
● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)
● The workflow has 3 failed jobs out of 10 total, but the log analyzer didn't find error patterns. Let me dig deeper:
● torc - list_failed_jobs (MCP)(workflow_id: 16) ⎿ { "failed_job_count": 3, "failed_jobs": [ … +18 lines (ctrl+o to expand)
● torc - check_resource_utilization (MCP)(workflow_id: 16, include_failed: true) ⎿ { "failed_jobs": [ { … +137 lines (ctrl+o to expand)
● Workflow 16 Analysis Summary
Workflow: oom_test - "Test workflow with an OOM" Status: 7 completed, 3 failed (10 total jobs)
Root Cause: Out of Memory (OOM)
All 3 failed jobs were killed due to OOM (return code 137):
| Job ID | Name | Peak Memory | Configured | Over-Utilization |
|---|---|---|---|---|
| 103 | work_1 | N/A (killed early) | 10g | SIGKILL 137 |
| 109 | work_7 | 29.91 GB | 10g | +199.1% |
| 110 | work_8 | 26.89 GB | 10g | +168.9% |
Critical Warning: More Failures Expected
9 jobs exceeded their 10 GB memory limit - including 6 jobs that happened to complete but used ~30 GB each:
- work_2 through work_6, work_9, work_10 all peaked at ~30.01 GB (+200.1% over limit)
- These completed due to available system memory but are also misconfigured
Recommendation
- Increase memory allocation to 32g for all work jobs
- Use automated recovery - I can run a dry-run of the recovery tool to preview fixes
Would you like me to run recover_workflow with dry_run=true to preview the recovery actions?
yes
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true) ⎿ { "dry_run": true, "memory_multiplier": 1.5, … +9 lines (ctrl+o to expand)
● Recovery Preview (Dry Run)
Proposed Actions:
- 3 failed jobs would have memory increased (1.5x multiplier: 10g → 15g)
- Failed jobs would be reset to ready status
- Slurm schedulers would be regenerated
⚠️ Important Consideration
The default 1.5x memory multiplier would increase memory from 10g → 15g, but the jobs actually peaked at ~30 GB. This means:
- 15g may still cause OOM failures
- A 3.5x multiplier (10g → 35g) would be safer based on actual usage
Would you like me to proceed with:
- Default recovery (1.5x = 15g) - may require another recovery cycle
- Custom recovery with a higher memory multiplier (e.g., 3.5x = 35g)
2
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true, memory_multiplier: 3.5) ⎿ { "dry_run": true, "memory_multiplier": 3.5, … +9 lines (ctrl+o to expand)
● Recovery Preview (3.5x Memory Multiplier)
Proposed Actions:
- 3 failed jobs would have memory increased: 10g → 35g
- Failed jobs would be reset to ready status
- Slurm schedulers would be regenerated
This should safely cover the ~30 GB peak memory usage observed.
Would you like me to proceed with these recovery actions?
yes
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: false, memory_multiplier: 3.5) ⎿ { "dry_run": false, "memory_multiplier": 3.5, … +7 lines (ctrl+o to expand)
● Recovery Complete ✓
Actions taken:
- 3 jobs had memory increased from 10g → 35g
- Failed jobs reset to ready status
- Slurm schedulers regenerated and submitted
● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)
● Current Status:
| Status | Count |
|---|---|
| Completed | 7 |
| Ready | 3 |
The 3 previously failed jobs (work_1, work_7, work_8) are now ready to run with 35g memory. Slurm allocations have been submitted and will execute these jobs automatically.
What You Learned
- How to configure AI assistants with Torc (one command for Claude Code, one file for VS Code)
- How to create workflows through conversation
- How to debug and fix failures using natural language
Next Steps
- Configuring AI Assistants — Full configuration options, all tools, troubleshooting
- Automatic Failure Recovery — Use
torc watchfor automated recovery - Configuration Files — Set up Torc configuration
Map a Python function to compute nodes
This tutorial will teach you how to build a workflow from Python functions instead of CLI
executables and run on it on an HPC with Slurm.
Pre-requisites
This tutorial requires installation of the python package torc-client. Until the latest version is
published at pypi.org, you must clone this repository install the package in a virtual environment.
Use Python 3.11 or later.
git clone https://github.com/NREL/torc
cd torc/python_client
python -m venv .venv
source .venv/bin/activate
pip install -e .
Workflow Description
Let's suppose that your code is in a module called simulation.py and looks something like this:
def run(job_name: str, input_params: dict) -> dict:
"""Runs one simulation on a set of input parameters.
Returns
-------
job_name: str
Name of the job.
dict
Result of the simulation.
"""
return {
"inputs": input_params,
"result": 5,
"output_data_path": f"/projects/my-project/{job_name}",
}
def postprocess(results: list[dict]) -> dict:
"""Collects the results of the workers and performs post-processing.
Parameters
----------
results : list[dict]
Results from each simulation
Returns
-------
dict
Final result
"""
total = 0
paths = []
for result in results:
assert "result" in result
assert "output_data_path" in result
total += result["result"]
paths.append(result["output_data_path"])
return {"total": total, "output_data_paths": paths}
You need to run this function on hundreds of sets of input parameters and want torc to help you scale this work on an HPC.
The recommended procedure for this task is torc's Python API as shown below. The goal is to mimic the behavior of Python's concurrent.futures.ProcessPoolExecutor.map as much as possible.
Similar functionality is also available with Dask.
Resource Constraints
- Each function call needs 4 CPUs and 20 GiB of memory.
- The function call takes 1 hour to run.
A compute node with 92 GiB of memory are easiest to acquire but would only be able to run 4 jobs at a time. The 180 GiB nodes are fewer in number but would use fewer AUs because they would be able to run 8 jobs at a time.
Torc Overview
Here is what torc does to solve this problem:
- User creates a workflow in Python.
- User passes a callable function as well as a list of all input parameters that need to be mapped to the function.
- For each set of input parameters torc creates a record in the
user_datatable in the database, creates a job with a relationship to that record as an input, and creates a placeholder for data to be created by that job. - When torc runs each job it reads the correct input parameters from the database, imports the user's function, and then calls it with the input parameters.
- When the function completes, torc stores any returned data in the database.
- When all workers complete torc collects all result data from the database into a list and passes that to the postprocess function. It also stores any returned data from that function into the database.
Build the workflow
- Write a script to create the workflow. Note that you need to correct the
apiURL and the Slurmaccount.
import getpass
import os
from torc import make_api, map_function_to_jobs, setup_logging
from torc.openapi_client import (
DefaultApi,
ResourceRequirementsModel,
SlurmSchedulerModel,
WorkflowModel,
)
TORC_API_URL = os.getenv("TORC_API_URL", "http://localhost:8080/torc-service/v1")
def create_workflow(api: DefaultApi) -> WorkflowModel:
"""Create the workflow"""
workflow = WorkflowModel(
user=getpass.getuser(),
name="map_function_workflow",
description="Example workflow that maps a function across workers",
)
return api.create_workflow(workflow)
def build_workflow(api: DefaultApi, workflow: WorkflowModel):
"""Creates a workflow with implicit job dependencies declared through files."""
workflow_id = workflow.id
assert workflow_id is not None
params = [
{"input1": 1, "input2": 2, "input3": 3},
{"input1": 4, "input2": 5, "input3": 6},
{"input1": 7, "input2": 8, "input3": 9},
]
assert workflow.id is not None
rr = api.create_resource_requirements(
ResourceRequirementsModel(
workflow_id=workflow_id,
name="medium",
num_cpus=4,
memory="20g",
runtime="P0DT1H",
),
)
api.create_slurm_scheduler(
SlurmSchedulerModel(
workflow_id=workflow_id,
name="short",
account="my_account",
mem="180224",
walltime="04:00:00",
nodes=1,
),
)
jobs = map_function_to_jobs(
api,
workflow_id,
"simulation",
"run",
params,
resource_requirements_id=rr.id,
# Note that this is optional.
postprocess_func="postprocess",
)
print(f"Created workflow with ID {workflow_id} {len(jobs)} jobs.")
def main():
setup_logging()
api = make_api(TORC_API_URL)
workflow = create_workflow(api)
try:
build_workflow(api, workflow)
except Exception:
api.delete_workflow(workflow.id)
raise
if __name__ == "__main__":
main()
Requirements:
-
Your run function should raise an exception if there is a failure. If that happens, torc will record a non-zero return code for the job.
-
If you want torc to store result data in the database, return it from your run function. Note: be careful on how much result data you return. If you are using a custom database for one workflow, store as much as you want. If you are using a shared server, ensure that you are following its administrator's policies. You could consider storing large data in files and only storing file paths in the database.
-
If you choose to define a postprocess function and want torc to store the final data in the database, return it from that function.
-
The
paramsmust be serializable in JSON format because they will be stored in the database. Basic types like numbers and strings and lists and dictionaries of those will work fine. If you need to store complex, custom types, consider these options:- Define data models with Pydantic. You can use their existing serialization/de-serialization methods or define custom methods.
- Pickle your data and store the result as a string. Your run function would need to understand how to de-serialize it. Note that this has portability limitations. (Please contact the developers if you would like to see this happen automatically.)
-
Torc must be able to import simulation.py from Python. Here are some options:
- Put the script in the current directory.
- Install it in the environment.
- Specify its parent directory like this:
map_function_to_jobs(..., module_directory="my_module")
python map_function_across_workers.py
- Create the workflow.
python examples/python/map_function_across_workers.py
Created workflow 342 with 4 jobs.
- Run the workflow.
$ torc run 342
- View the result data overall or by job (if your run and postprocess functions return something). Note that listing all user-data will return input parameters.
$ torc -f json user-data list 342
Other jobs
You could add "normal" jobs to the workflow as well. For example, you might have preprocessing and
post-processing work to do. You can add those jobs through the API. You could also add multiple
rounds of mapped functions. map_function_to_jobs provides a depends_on_job_ids parameter to
specify ordering. You could also define job-job relationships through files or user-data as
discussed elsewhere in this documentation.
Tutorial 11: Filtering CLI Output with Nushell
This tutorial teaches you how to filter and analyze Torc CLI output using Nushell, a modern shell with powerful structured data capabilities.
Learning Objectives
By the end of this tutorial, you will:
- Understand why Nushell is useful for filtering Torc output
- Know how to filter jobs by status, name, and other fields
- Be able to analyze results and find failures quickly
- Create complex queries combining multiple conditions
Prerequisites
- Torc CLI installed and configured
- A workflow with jobs (ideally one with various statuses)
Why Nushell?
Torc's CLI can output JSON with the -f json flag. While tools like jq can process JSON, Nushell
offers a more readable, SQL-like syntax that's easier to learn and use interactively.
Compare filtering failed jobs:
# jq (cryptic syntax)
torc jobs list 123 -f json | jq '.jobs[] | select(.status == "failed")'
# Nushell (readable, SQL-like)
torc jobs list 123 -f json | from json | get jobs | where status == "failed"
Nushell is:
- Cross-platform: Works on Linux, macOS, and Windows
- Readable: Uses intuitive commands like
where,select,sort-by - Interactive: Tab completion and helpful error messages
- Powerful: Built-in support for JSON, YAML, CSV, and more
Installing Nushell
Install Nushell from nushell.sh/book/installation:
# macOS
brew install nushell
# Windows
winget install nushell
# Linux (various methods available)
cargo install nu
After installation, run nu to start a Nushell session. You can use Nushell interactively or run
individual commands with nu -c "command".
Basic Filtering
Setup: Get JSON Output
All examples assume you have a workflow ID. Replace $WORKFLOW_ID with your actual ID:
# In Nushell, set your workflow ID
let wf = 123
List All Jobs
torc jobs list $wf -f json | from json | get jobs
This parses the JSON and extracts the jobs array into a table.
Filter by Status
Find all failed jobs:
torc jobs list $wf -f json | from json | get jobs | where status == "failed"
Find jobs that are ready or running:
torc jobs list $wf -f json | from json | get jobs | where status in ["ready", "running"]
Filter by Name Pattern
Find jobs with "train" in the name:
torc jobs list $wf -f json | from json | get jobs | where name =~ "train"
The =~ operator performs substring/regex matching.
Combine Conditions
Find failed jobs with "process" in the name:
torc jobs list $wf -f json | from json | get jobs | where status == "failed" and name =~ "process"
Find jobs that failed or were canceled:
torc jobs list $wf -f json | from json | get jobs | where status == "failed" or status == "canceled"
Selecting and Formatting Output
Select Specific Columns
Show only name and status:
torc jobs list $wf -f json | from json | get jobs | select name status
Sort Results
Sort by name:
torc jobs list $wf -f json | from json | get jobs | sort-by name
Sort failed jobs by ID (descending):
torc jobs list $wf -f json | from json | get jobs | where status == "failed" | sort-by id -r
Count Results
Count jobs by status:
torc jobs list $wf -f json | from json | get jobs | group-by status | transpose status jobs | each { |row| { status: $row.status, count: ($row.jobs | length) } }
Or more simply, count failed jobs:
torc jobs list $wf -f json | from json | get jobs | where status == "failed" | length
Analyzing Results
Find Jobs with Non-Zero Return Codes
torc results list $wf -f json | from json | get results | where return_code != 0
Find Results with Specific Errors
torc results list $wf -f json | from json | get results | where return_code != 0 | select job_id return_code
Join Jobs with Results
Get job names for failed results:
let jobs = (torc jobs list $wf -f json | from json | get jobs)
let results = (torc results list $wf -f json | from json | get results | where return_code != 0)
$results | each { |r|
let job = ($jobs | where id == $r.job_id | first)
{ name: $job.name, return_code: $r.return_code, job_id: $r.job_id }
}
Working with User Data
List User Data Entries
torc user-data list $wf -f json | from json | get user_data
Filter by Key
Find user data with a specific key:
torc user-data list $wf -f json | from json | get user_data | where key =~ "config"
Parse JSON Values
User data values are JSON strings. Parse and filter them:
torc user-data list $wf -f json | from json | get user_data | each { |ud|
{ key: $ud.key, value: ($ud.value | from json) }
}
Practical Examples
Example 1: Debug Failed Jobs
Find failed jobs and get their result details:
# Get failed job IDs
let failed_ids = (torc jobs list $wf -f json | from json | get jobs | where status == "failed" | get id)
# Show results for those jobs
torc results list $wf -f json | from json | get results | where job_id in $failed_ids | select job_id return_code
Example 2: Find Stuck Jobs
Find jobs that have been running for a long time (status is "running"):
torc jobs list $wf -f json | from json | get jobs | where status == "running" | select id name
Example 3: Parameter Sweep Analysis
For a parameterized workflow, find which parameter values failed:
torc jobs list $wf -f json | from json | get jobs | where status == "failed" and name =~ "lr" | get name
Example 4: Export to CSV
Export failed jobs to CSV for further analysis:
torc jobs list $wf -f json | from json | get jobs | where status == "failed" | to csv | save failed_jobs.csv
Quick Reference
| Operation | Nushell Command |
|---|---|
| Parse JSON | from json |
| Get field | get jobs |
| Filter rows | where status == "failed" |
| Select columns | select name status id |
| Sort | sort-by name |
| Sort descending | sort-by id -r |
| Count | length |
| Substring match | where name =~ "pattern" |
| Multiple conditions | where status == "failed" and name =~ "x" |
| In list | where status in ["ready", "running"] |
| Group by | group-by status |
| Save to file | save output.json |
| Convert to CSV | to csv |
Tips
- Use
nuinteractively: Start a Nushell session to explore data step by step - Tab completion: Nushell provides completions for commands and field names
- Pipeline debugging: Add
| first 5to see a sample before processing all data - Save queries: Create shell aliases or scripts for common filters
What You Learned
In this tutorial, you learned:
- Why Nushell is a great tool for filtering Torc CLI output
- How to filter jobs by status and name patterns
- How to analyze results and find failures
- How to work with user data
- Practical examples for debugging workflows
Next Steps
- Nushell Documentation - Learn more about Nushell's capabilities
- Torc CLI Reference - Full list of CLI commands and their JSON output
Shell Completions
Torc provides shell completion scripts to make working with the CLI faster and more convenient. Completions help you discover commands, avoid typos, and speed up your workflow.
Overview
Shell completions provide:
- Command completion - Tab-complete
torcsubcommands and options - Flag completion - Tab-complete command-line flags and their values
- Multi-shell support - Bash, Zsh, Fish, Elvish, and PowerShell
- Automatic updates - Completions are generated from the CLI structure
Generating Completions
Use the torc completions command to generate completion scripts for your shell:
# See available shells
torc completions --help
# Generate for a specific shell
torc completions bash
torc completions zsh
torc completions fish
torc completions elvish
torc completions powershell
Installation
Bash
User installation
# Create completions directory if it doesn't exist
mkdir -p ~/.local/share/bash-completion/completions
# Generate and install completions
torc completions bash > ~/.local/share/bash-completion/completions/torc
# Source the completion file in your current shell
source ~/.local/share/bash-completion/completions/torc
Verify installation:
# Restart your shell or source the completion file
source ~/.local/share/bash-completion/completions/torc
# Test completions
torc wor<TAB> # Should complete to "workflows"
torc workflows <TAB> # Should show workflow subcommands
Zsh
Option 1: User installation (recommended)
# Create completions directory in your home directory
mkdir -p ~/.zfunc
# Add to fpath in your ~/.zshrc if not already present
echo 'fpath=(~/.zfunc $fpath)' >> ~/.zshrc
echo 'autoload -Uz compinit && compinit' >> ~/.zshrc
# Generate and install completions
torc completions zsh > ~/.zfunc/_torc
# Restart shell or source ~/.zshrc
source ~/.zshrc
Option 2: Using custom location
# Generate to a custom location
mkdir -p ~/my-completions
torc completions zsh > ~/my-completions/_torc
# Add to ~/.zshrc
echo 'fpath=(~/my-completions $fpath)' >> ~/.zshrc
echo 'autoload -Uz compinit && compinit' >> ~/.zshrc
# Restart shell
exec zsh
Troubleshooting Zsh completions:
If completions aren't working, try rebuilding the completion cache:
# Remove completion cache
rm -f ~/.zcompdump
# Restart shell
exec zsh
Fish
# Fish automatically loads completions from ~/.config/fish/completions/
mkdir -p ~/.config/fish/completions
# Generate and install completions
torc completions fish > ~/.config/fish/completions/torc.fish
# Fish will automatically load the completions
# Test immediately (no shell restart needed)
torc wor<TAB>
Elvish
# Create completions directory
mkdir -p ~/.elvish/lib
# Generate completions
torc completions elvish > ~/.elvish/lib/torc.elv
# Add to your ~/.elvish/rc.elv
echo 'use torc' >> ~/.elvish/rc.elv
# Restart shell
PowerShell
Windows PowerShell / PowerShell Core:
# Create profile directory if it doesn't exist
New-Item -ItemType Directory -Force -Path (Split-Path -Parent $PROFILE)
# Generate completions to a file
torc completions powershell > $HOME\.config\torc_completions.ps1
# Add to your PowerShell profile
Add-Content -Path $PROFILE -Value '. $HOME\.config\torc_completions.ps1'
# Reload profile
. $PROFILE
Alternative: Source inline
# Generate and add directly to profile
torc completions powershell | Out-File -Append -FilePath $PROFILE
# Reload profile
. $PROFILE
Using Completions
Once installed, use Tab to trigger completions:
Command Completion
# Complete subcommands
torc <TAB>
# Shows: workflows, jobs, files, events, run, submit, tui, ...
torc work<TAB>
# Completes to: torc workflows
torc workflows <TAB>
# Shows: create, list, get, delete, submit, run, ...
Flag Completion
# Complete flags
torc --<TAB>
# Shows: --url, --username, --password, --format, --log-level, --help
torc workflows list --<TAB>
# Shows available flags for the list command
# Complete flag values (where applicable)
torc workflows list --format <TAB>
# Shows: table, json
Workflow ID Completion
# Some shells support dynamic completion
torc workflows get <TAB>
# May show available workflow IDs
Examples
Here are some common completion patterns:
# Discover available commands
torc <TAB><TAB>
# Complete command names
torc w<TAB> # workflows
torc wo<TAB> # workflows
torc j<TAB> # jobs
# Navigate subcommands
torc workflows <TAB> # create, list, get, delete, ...
torc jobs <TAB> # list, get, update, ...
# Complete flags
torc --u<TAB> # --url, --username
torc --url <type-url>
torc --format <TAB> # table, json
# Complex commands
torc workflows create --<TAB>
# Shows all available flags for the create command
Updating Completions
When you update Torc to a new version, regenerate the completion scripts to get the latest commands and flags:
# Bash
torc completions bash > ~/.local/share/bash-completion/completions/torc
source ~/.local/share/bash-completion/completions/torc
# Zsh
torc completions zsh > ~/.zfunc/_torc
rm -f ~/.zcompdump && exec zsh
# Fish
torc completions fish > ~/.config/fish/completions/torc.fish
# Fish reloads automatically
# PowerShell
torc completions powershell > $HOME\.config\torc_completions.ps1
. $PROFILE
Automation
You can automate completion installation in your dotfiles or setup scripts:
Bash Setup Script
#!/bin/bash
# install-torc-completions.sh
COMPLETION_DIR="$HOME/.local/share/bash-completion/completions"
mkdir -p "$COMPLETION_DIR"
if command -v torc &> /dev/null; then
torc completions bash > "$COMPLETION_DIR/torc"
echo "Torc completions installed for Bash"
echo "Run: source $COMPLETION_DIR/torc"
else
echo "Error: torc command not found"
exit 1
fi
Zsh Setup Script
#!/bin/zsh
# install-torc-completions.zsh
COMPLETION_DIR="$HOME/.zfunc"
mkdir -p "$COMPLETION_DIR"
if command -v torc &> /dev/null; then
torc completions zsh > "$COMPLETION_DIR/_torc"
# Add fpath to .zshrc if not already present
if ! grep -q "fpath=(.*\.zfunc" ~/.zshrc; then
echo 'fpath=(~/.zfunc $fpath)' >> ~/.zshrc
echo 'autoload -Uz compinit && compinit' >> ~/.zshrc
fi
echo "Torc completions installed for Zsh"
echo "Run: exec zsh"
else
echo "Error: torc command not found"
exit 1
fi
Post-Installation Check
#!/bin/bash
# verify-completions.sh
# Test if completions are working
if complete -p torc &> /dev/null; then
echo "✓ Torc completions are installed"
else
echo "✗ Torc completions are not installed"
echo "Run: torc completions bash > ~/.local/share/bash-completion/completions/torc"
fi
Troubleshooting
Completions Not Working
Problem: Tab completion doesn't show torc commands.
Solutions:
-
Verify torc is in your PATH:
which torc # Should show path to torc binary -
Check if completion file exists:
# Bash ls -l ~/.local/share/bash-completion/completions/torc # Zsh ls -l ~/.zfunc/_torc # Fish ls -l ~/.config/fish/completions/torc.fish -
Verify completion is loaded:
# Bash complete -p torc # Zsh which _torc -
Reload shell or source completion file:
# Bash source ~/.local/share/bash-completion/completions/torc # Zsh exec zsh # Fish (automatic)
Outdated Completions
Problem: New commands or flags don't show in completions.
Solution: Regenerate the completion file after updating Torc:
# Bash
torc completions bash > ~/.local/share/bash-completion/completions/torc
source ~/.local/share/bash-completion/completions/torc
# Zsh
torc completions zsh > ~/.zfunc/_torc
rm ~/.zcompdump && exec zsh
# Fish
torc completions fish > ~/.config/fish/completions/torc.fish
Permission Denied
Problem: Cannot write to system completion directory.
Solution: Use user-level completion directory or sudo:
# Use user directory (recommended)
mkdir -p ~/.local/share/bash-completion/completions
torc completions bash > ~/.local/share/bash-completion/completions/torc
# Or use sudo for system-wide
sudo torc completions bash > /etc/bash_completion.d/torc
Zsh "command not found: compdef"
Problem: Zsh completion system not initialized.
Solution: Add to your ~/.zshrc:
autoload -Uz compinit && compinit
PowerShell Execution Policy
Problem: Cannot run completion script due to execution policy.
Solution: Adjust execution policy:
# Check current policy
Get-ExecutionPolicy
# Set policy to allow local scripts
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Shell-Specific Features
Bash
- Case-insensitive completion (if configured in
.inputrc) - Partial matching support
- Menu completion available
Zsh
- Advanced completion with descriptions
- Correction suggestions
- Menu selection
- Color support for completions
Fish
- Rich descriptions for each option
- Real-time syntax highlighting
- Automatic paging for long completion lists
- Fuzzy matching support
PowerShell
- IntelliSense-style completions
- Parameter descriptions
- Type-aware completions
Best Practices
- Keep completions updated: Regenerate after each Torc update
- Use version control: Include completion installation in dotfiles
- Automate installation: Add to setup scripts for new machines
- Test after updates: Verify completions work after shell or Torc updates
- Document in team wikis: Help teammates set up completions
Additional Resources
- Bash Completion Documentation
- Zsh Completion System
- Fish Completion Tutorial
- PowerShell Tab Completion
Design & Architecture
Internal design documentation for developers.
- Client Architecture - Torc client design
- Workflow Actions - Workflow action system
- Web Dashboard Design - Dashboard architecture
- Server API Handler - Server implementation
- Central Database - Database schema and design
- Workflow Recovery Design - Recovery system architecture
- Failure Handler Design - Failure handler implementation
- AI-Assisted Recovery Design - AI-assisted error classification
- Workflow Graph - Dependency graph implementation
- Interface Architecture - Interface design patterns
Client
Torc provides client libraries in multiple languages for workflow management.
Rust Client (Primary)
The Rust client provides both CLI and library interfaces:
Workflow Creation
- Parse workflow specification files (JSON, JSON5, YAML, KDL)
- Expand parameterized job/file specifications
- Create all workflow components atomically via API calls
- Handle name-to-ID resolution for dependencies
Workflow Manager
- Start/restart/reinitialize workflow execution
- Track file changes and update database
- Detect changed user_data inputs
- Validate workflow state before initialization
API Integration
- Auto-generated client from OpenAPI spec
- Pagination support for large result sets
- Retry logic and error handling
Client Modes
The Rust client operates in multiple modes:
- CLI Mode - Command-line interface for interactive use
- Library Mode - Programmatic API for integration with other tools
- Specification Parser - Reads and expands workflow specifications
- API Client - HTTP client for communicating with the server
Python Client
The Python client (torc package) provides programmatic workflow management for Python users:
- OpenAPI-generated client for full API access
make_api()helper for easy server connectionmap_function_to_jobs()for mapping Python functions across parameters- Integration with Python data science and ML pipelines
See Creating Workflows for usage examples.
Julia Client
The Julia client (Torc.jl package) provides programmatic workflow management for Julia users:
- OpenAPI-generated client for full API access
make_api()helper for easy server connectionsend_api_command()wrapper with error handlingadd_jobs()for batch job creationmap_function_to_jobs()for mapping Julia functions across parameters
See Creating Workflows for usage examples.
Workflow Actions
Workflow actions enable automatic execution of commands and resource allocation in response to workflow lifecycle events. Actions provide hooks for setup, teardown, monitoring, and dynamic resource management throughout workflow execution.
Overview
Actions consist of three components:
- Trigger - The condition that activates the action
- Action Type - The operation to perform
- Configuration - Parameters specific to the action
actions:
- trigger_type: "on_workflow_start"
action_type: "run_commands"
commands:
- "mkdir -p output logs"
- "echo 'Workflow started' > logs/status.txt"
Trigger Types
Workflow Lifecycle Triggers
on_workflow_start
Executes once when the workflow is initialized.
When it fires: During initialize_jobs after jobs are transitioned from uninitialized to
ready/blocked states.
Typical use cases:
- Scheduling Slurm allocations
- Creating directory structures
- Copying initial data
- trigger_type: "on_workflow_start"
action_type: "run_commands"
commands:
- "mkdir -p output checkpoints temp"
- "echo 'Workflow started at $(date)' > workflow.log"
on_workflow_complete
Executes once when all jobs reach terminal states (completed, failed, or canceled).
When it fires: After the last job completes, as detected by the job runner.
Typical use cases:
- Archiving final results
- Uploading to remote storage
- Cleanup of temporary files
- Generating summary reports
- trigger_type: "on_workflow_complete"
action_type: "run_commands"
commands:
- "tar -czf results.tar.gz output/"
- "aws s3 cp results.tar.gz s3://bucket/results/"
- "rm -rf temp/"
Job-Based Triggers
on_jobs_ready
Executes when all specified jobs transition to the "ready" state.
When it fires: When the last specified job becomes ready to execute (all dependencies satisfied).
Typical use cases:
- Scheduling Slurm allocations
- Starting phase-specific monitoring
- Pre-computation setup
- Notifications before expensive operations
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
jobs: ["train_model_001", "train_model_002", "train_model_003"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
Important: The action triggers only when all matching jobs are ready, not individually as each job becomes ready.
on_jobs_complete
Executes when all specified jobs reach terminal states (completed, failed, or canceled).
When it fires: When the last specified job finishes execution.
Typical use cases:
- Scheduling Slurm allocations
- Cleaning up intermediate files
- Archiving phase results
- Freeing storage space
- Phase-specific reporting
- trigger_type: "on_jobs_complete"
action_type: "run_commands"
jobs: ["preprocess_1", "preprocess_2", "preprocess_3"]
commands:
- "echo 'Preprocessing phase complete' >> workflow.log"
- "rm -rf raw_data/"
Worker Lifecycle Triggers
Worker lifecycle triggers are persistent by default, meaning they execute once per worker (job runner), not once per workflow.
on_worker_start
Executes when each worker (job runner) starts.
When it fires: After a job runner starts and checks for workflow actions, before claiming any jobs.
Typical use cases:
- Worker-specific initialization
- Setting up worker-local logging
- Copying data to compute node local storage
- Initializing worker-specific resources
- Recording worker startup metrics
- trigger_type: "on_worker_start"
action_type: "run_commands"
persistent: true # Each worker executes this
commands:
- "echo 'Worker started on $(hostname) at $(date)' >> worker.log"
- "mkdir -p worker_temp"
on_worker_complete
Executes when each worker completes (exits the main loop).
When it fires: After a worker finishes processing jobs and before it shuts down.
Typical use cases:
- Worker-specific cleanup
- Uploading worker-specific logs
- Recording worker completion metrics
- Cleaning up worker-local resources
- trigger_type: "on_worker_complete"
action_type: "run_commands"
persistent: true # Each worker executes this
commands:
- "echo 'Worker completed on $(hostname) at $(date)' >> worker.log"
- "rm -rf worker_temp"
Job Selection
For on_jobs_ready and on_jobs_complete triggers, specify which jobs to monitor.
Exact Job Names
- trigger_type: "on_jobs_complete"
action_type: "run_commands"
jobs: ["job1", "job2", "job3"]
commands:
- "echo 'Specific jobs complete'"
Regular Expressions
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
job_name_regexes: ["train_model_[0-9]+", "eval_.*"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
Common regex patterns:
"train_.*"- All jobs starting with "train_""model_[0-9]+"- Jobs like "model_1", "model_2"".*_stage1"- All jobs ending with "_stage1""job_(a|b|c)"- Jobs "job_a", "job_b", or "job_c"
Combining Selection Methods
You can use both together - the action triggers when all matching jobs meet the condition:
jobs: ["critical_job"]
job_name_regexes: ["batch_.*"]
# Triggers when "critical_job" AND all "batch_*" jobs are ready/complete
Action Types
run_commands
Execute shell commands sequentially on a compute node.
Configuration:
- trigger_type: "on_workflow_complete"
action_type: "run_commands"
commands:
- "tar -czf results.tar.gz output/"
- "aws s3 cp results.tar.gz s3://bucket/"
Execution details:
- Commands run in the workflow's output directory
- Commands execute sequentially (one after another)
- If a command fails, the action fails (but workflow continues)
- Commands run on compute nodes, not the submission node
- Uses the shell environment of the job runner process
schedule_nodes
Dynamically allocate compute resources from a Slurm scheduler.
Configuration:
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
jobs: ["train_model_1", "train_model_2"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
start_one_worker_per_node: true
max_parallel_jobs: 8
Parameters:
scheduler(required) - Name of Slurm scheduler configuration (must exist inslurm_schedulers)scheduler_type(required) - Must be "slurm"num_allocations(required) - Number of Slurm allocation requests to submitstart_one_worker_per_node(optional) - Start one job runner per node (default: false)max_parallel_jobs(optional) - Maximum concurrent jobs per runner
Use cases:
- Just-in-time resource allocation
- Cost optimization (allocate only when needed)
- Separating workflow phases with different resource requirements
Complete Examples
Refer to this example
Execution Model
Action Claiming and Execution
- Atomic Claiming: Actions are claimed atomically by workers to prevent duplicate execution
- Non-Persistent Actions: Execute once per workflow (first worker to claim executes)
- Persistent Actions: Can be claimed and executed by multiple workers
- Trigger Counting: Job-based triggers increment a counter as jobs transition; action becomes pending when count reaches threshold
- Immediate Availability: Worker lifecycle actions are immediately available after workflow initialization
Action Lifecycle
[Workflow Created]
↓
[initialize_jobs called]
↓
├─→ on_workflow_start actions become pending
├─→ on_worker_start actions become pending (persistent)
├─→ on_worker_complete actions become pending (persistent)
└─→ on_jobs_ready actions wait for job transitions
↓
[Worker Claims and Executes Actions]
↓
[Jobs Execute]
↓
[Jobs Complete]
↓
├─→ on_jobs_complete actions become pending when all specified jobs complete
└─→ on_workflow_complete actions become pending when all jobs complete
↓
[Workers Exit]
↓
[on_worker_complete actions execute per worker]
Important Characteristics
- No Rollback: Failed actions don't affect workflow execution
- Compute Node Execution: Actions run on compute nodes via job runners
- One-Time Triggers: Non-persistent actions trigger once when conditions are first met
- No Inter-Action Dependencies: Actions don't depend on other actions
- Concurrent Workers: Multiple workers can execute different actions simultaneously
Workflow Reinitialization
When a workflow is reinitialized (e.g., after resetting failed jobs), actions are reset to allow them to trigger again:
- Executed flags are cleared: All actions can be claimed and executed again
- Trigger counts are recalculated: For
on_jobs_readyandon_jobs_completeactions, the trigger count is set based on current job states
Example scenario:
- job1 and job2 are independent jobs
- postprocess_job depends on both job1 and job2
- An
on_jobs_readyaction triggers when postprocess_job becomes ready
After first run completes:
- job1 fails, job2 succeeds
- User resets failed jobs and reinitializes
- job2 is already Completed, so it counts toward the trigger count
- When job1 completes in the second run, postprocess_job becomes ready
- The action triggers again because the trigger count reaches the required threshold
This ensures actions properly re-trigger after workflow reinitialization, even when some jobs remain in their completed state.
Limitations
- No Action Dependencies: Actions cannot depend on other actions completing
- No Conditional Execution: Actions cannot have conditional logic (use multiple actions with different job selections instead)
- No Action Retries: Failed actions are not automatically retried
- Single Action Type: Each action has one action_type (cannot combine run_commands and schedule_nodes)
- No Dynamic Job Selection: Job names/patterns are fixed at action creation time
For complex workflows requiring these features, consider:
- Using job dependencies to order operations
- Creating separate jobs for conditional logic
- Implementing retry logic within command scripts
- Creating multiple actions for different scenarios
Web Dashboard (torc-dash)
The torc-dash application is a web gateway that provides a browser-based UI for managing Torc
workflows. It bridges a web frontend with the torc ecosystem by proxying API requests and executing
CLI commands.
Architecture
flowchart LR
Browser["Browser<br/>(Web UI)"] <--> Dashboard["torc-dash<br/>(Gateway)"]
Dashboard <--> Server["torc-server<br/>(API)"]
Dashboard --> CLI["torc CLI<br/>(subprocess)"]
style Browser fill:#17a2b8,color:#fff
style Dashboard fill:#4a9eff,color:#fff
style Server fill:#28a745,color:#fff
style CLI fill:#4a9eff,color:#fff
The dashboard acts as a gateway layer that:
- Serves embedded static assets - HTML, CSS, and JavaScript bundled into the binary
- Proxies API requests - Forwards
/torc-service/*requests to a remote torc-server - Executes CLI commands - Runs
torcCLI as subprocesses for complex operations - Manages server lifecycle - Optionally spawns and manages a torc-server instance
Core Components
Embedded Static Assets
Uses the rust_embed crate to bundle all files from the static/ directory directly into the
binary at compile time:
#![allow(unused)] fn main() { #[derive(Embed)] #[folder = "static/"] struct Assets; }
This enables single-binary deployment with no external file dependencies.
Application State
Shared state across all request handlers:
#![allow(unused)] fn main() { struct AppState { api_url: String, // Remote torc-server URL client: reqwest::Client, // HTTP client for proxying torc_bin: String, // Path to torc CLI binary torc_server_bin: String, // Path to torc-server binary managed_server: Mutex<ManagedServer>, // Optional embedded server state } }
Standalone Mode
When launched with --standalone, torc-dash automatically spawns a torc-server subprocess:
- Starts torc-server with configurable port (0 for auto-detection)
- Reads
TORC_SERVER_PORT=<port>from stdout to discover actual port - Configures API URL to point to the managed server
- Tracks process ID for lifecycle management
This enables single-command deployment for local development or simple production setups.
Request Routing
Static File Routes
| Route | Handler | Purpose |
|---|---|---|
/ | index_handler | Serves index.html |
/static/* | static_handler | Serves embedded assets with MIME types |
API Proxy
All /torc-service/* requests are transparently proxied to the remote torc-server:
Browser: GET /torc-service/v1/workflows
↓
torc-dash: forwards to http://localhost:8080/torc-service/v1/workflows
↓
torc-server: responds with workflow list
↓
torc-dash: returns response to browser
The proxy preserves HTTP methods (GET, POST, PUT, PATCH, DELETE), headers, and request bodies.
CLI Command Endpoints
These endpoints execute the torc CLI as subprocesses, enabling operations that require local file
access or complex orchestration:
| Endpoint | CLI Command | Purpose |
|---|---|---|
POST /api/cli/create | torc workflows create | Create workflow from spec file |
POST /api/cli/run | torc workflows run | Run workflow locally |
POST /api/cli/submit | torc workflows submit | Submit to scheduler |
POST /api/cli/initialize | torc workflows initialize | Initialize job dependencies |
POST /api/cli/delete | torc workflows delete | Delete workflow |
POST /api/cli/reinitialize | torc workflows reinitialize | Reinitialize workflow |
POST /api/cli/reset-status | torc workflows reset-status | Reset job statuses |
GET /api/cli/run-stream | torc workflows run | SSE streaming execution |
Server Management Endpoints
| Endpoint | Purpose |
|---|---|
POST /api/server/start | Start a managed torc-server |
POST /api/server/stop | Stop the managed server |
GET /api/server/status | Check server running status |
Utility Endpoints
| Endpoint | Purpose |
|---|---|
POST /api/cli/read-file | Read local file contents |
POST /api/cli/plot-resources | Generate resource plots from DB |
POST /api/cli/list-resource-dbs | Find resource database files |
Key Features
Streaming Workflow Execution
The /api/cli/run-stream endpoint uses Server-Sent Events (SSE) to provide real-time feedback:
Event: start
Data: Running workflow abc123
Event: stdout
Data: Job job_1 started
Event: status
Data: Jobs: 3 running, 7 completed (total: 10)
Event: stdout
Data: Job job_1 completed
Event: end
Data: success
Event: exit_code
Data: 0
The stream includes:
- stdout/stderr from the torc CLI process
- Periodic status updates fetched from the API every 3 seconds
- Exit code when the process completes
CLI Execution Pattern
All CLI commands follow a consistent execution pattern:
#![allow(unused)] fn main() { async fn run_torc_command(torc_bin: &str, args: &[&str], api_url: &str) -> CliResponse { Command::new(torc_bin) .args(args) .env("TORC_API_URL", api_url) // Pass server URL to CLI .output() .await } }
Returns structured JSON:
{
"success": true,
"stdout": "Workflow created: abc123",
"stderr": "",
"exit_code": 0
}
Configuration Merging
Configuration is merged from multiple sources (highest to lowest priority):
- CLI arguments - Command-line flags
- Environment variables -
TORC_API_URL,TORC_BIN, etc. - Configuration file -
TorcConfigfrom~/.torc.tomlor similar
Design Rationale
Why Proxy Instead of Direct API Access?
- CORS avoidance - Browser same-origin policy doesn't apply to server-side requests
- Authentication layer - Can add authentication/authorization without modifying torc-server
- Request transformation - Can modify requests/responses as needed
- Logging and monitoring - Centralized request logging
Why CLI Delegation?
Complex operations like workflow creation are delegated to the existing torc CLI rather than
reimplementing:
- Code reuse - Leverages tested CLI implementation
- Local file access - CLI can read workflow specs from the filesystem
- Consistent behavior - Same behavior as command-line usage
- Maintenance - Single implementation to maintain
Why Standalone Mode?
- Single-binary deployment - One command starts everything needed
- Development convenience - Quick local testing without separate server
- Port auto-detection - Avoids port conflicts with port 0 support
Server API Handler
The server is a Rust async web service built with Tokio and uses:
- Multi-threaded Tokio runtime for concurrent request handling
- Modular API structure with separate modules per resource type (
workflows.rs,jobs.rs,files.rs, etc.) - OpenAPI-generated types for consistent API contracts
- Database-level locking (
BEGIN IMMEDIATE TRANSACTION) for critical sections
Key Endpoints
The server implements these key endpoints:
POST /workflows- Create new workflowsPOST /workflows/{id}/initialize_jobs- Build dependency graph and mark jobs readyPOST /workflows/{id}/claim_next_jobs- Thread-safe job allocation to workersPOST /jobs/{id}/manage_status_change- Update job status with cascade effectsPOST /workflows/{id}/process_changed_job_inputs- Detect changed inputs and reset jobs
Thread Safety
The claim_next_jobs endpoint uses database-level write locks to prevent multiple workers from
double-allocating jobs to different clients. This is critical for maintaining consistency in
distributed execution.
API Organization
Each resource type (workflows, jobs, files, events, etc.) has its own module in
server/src/bin/server/api/, keeping the codebase organized and maintainable. The main routing
logic delegates to these specialized modules.
Central Database
The SQLite database is the heart of Torc's coordination model. All workflow state lives in the database, enabling:
- Stateless clients and workers - All state persists in the database
- Multiple concurrent workers - Workers coordinate through database locks
- Fault tolerance - Workers can crash and restart; state is preserved
- Workflow resumption - Workflows can be stopped and restarted without losing progress
Core Database Tables
workflow- Top-level workflow records with name, user, descriptionworkflow_status- Workflow execution state (run_id, status)job- Individual computational tasks with commands and statusjob_internal- Internal job data (input hashes for change detection)job_depends_on- Explicit and implicit job dependenciesfile- File artifacts with paths and modification timesuser_data- JSON data artifacts for passing information between jobsjob_input_file,job_output_file- Job-file relationshipsjob_input_user_data,job_output_user_data- Job-user_data relationshipsresource_requirements- CPU, memory, GPU, runtime specificationscompute_node- Available compute resourcesscheduled_compute_node- Compute nodes allocated to workflowslocal_scheduler,slurm_scheduler- Execution environment configurationsresult- Job execution results (exit code, stdout, stderr)event- Audit log of workflow events
Foreign Key Cascades
The schema uses foreign key constraints with cascading deletes. Deleting a workflow automatically removes all associated jobs, files, events, and other related records, ensuring referential integrity.
Concurrency Model
SQLite uses database-level locking with BEGIN IMMEDIATE TRANSACTION to prevent race conditions in
critical sections, particularly during job allocation when multiple workers request jobs
simultaneously.
Workflow Recovery
Torc provides mechanisms for recovering workflows when Slurm allocations are preempted or fail
before completing all jobs. The torc slurm regenerate command creates new schedulers and
allocations for pending jobs.
The Recovery Problem
When running workflows on Slurm, allocations can fail or be preempted before all jobs complete. This leaves workflows in a partial state with:
- Ready/uninitialized jobs - Jobs that were waiting to run but never got scheduled
- Blocked jobs - Jobs whose dependencies haven't completed yet
- Orphaned running jobs - Jobs still marked as "running" in the database even though their Slurm allocation has terminated
Simply creating new Slurm schedulers and submitting allocations isn't enough because:
- Orphaned jobs block recovery: Jobs stuck in "running" status prevent the workflow from being considered complete, blocking recovery precondition checks
- Duplicate allocations: If the workflow had
on_workflow_startactions to schedule nodes, those actions would fire again when the workflow is reinitialized, creating duplicate allocations - Missing allocations for blocked jobs: Blocked jobs will eventually become ready, but there's no mechanism to schedule new allocations for them
Orphan Detection
Before recovery can proceed, orphaned jobs must be detected and their status corrected. This is
handled by the orphan detection module (src/client/commands/orphan_detection.rs).
How It Works
The orphan detection logic checks for three types of orphaned resources:
-
Active allocations with terminated Slurm jobs: ScheduledComputeNodes marked as "active" in the database, but whose Slurm job is no longer running (verified via
squeue) -
Pending allocations that disappeared: ScheduledComputeNodes marked as "pending" whose Slurm job no longer exists (cancelled or failed before starting)
-
Running jobs with no active compute nodes: Jobs marked as "running" but with no active compute nodes to process them (fallback for non-Slurm cases)
flowchart TD
A[Start Orphan Detection] --> B[List active ScheduledComputeNodes]
B --> C{For each Slurm allocation}
C --> D[Check squeue for job status]
D --> E{Job still running?}
E -->|Yes| C
E -->|No| F[Find jobs on this allocation]
F --> G[Mark jobs as failed]
G --> H[Update ScheduledComputeNode to complete]
H --> C
C --> I[List pending ScheduledComputeNodes]
I --> J{For each pending allocation}
J --> K[Check squeue for job status]
K --> L{Job exists?}
L -->|Yes| J
L -->|No| M[Update ScheduledComputeNode to complete]
M --> J
J --> N[Check for running jobs with no active nodes]
N --> O[Mark orphaned jobs as failed]
O --> P[Done]
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#6c757d,color:#fff
style D fill:#4a9eff,color:#fff
style E fill:#6c757d,color:#fff
style F fill:#4a9eff,color:#fff
style G fill:#dc3545,color:#fff
style H fill:#4a9eff,color:#fff
style I fill:#4a9eff,color:#fff
style J fill:#6c757d,color:#fff
style K fill:#4a9eff,color:#fff
style L fill:#6c757d,color:#fff
style M fill:#4a9eff,color:#fff
style N fill:#4a9eff,color:#fff
style O fill:#dc3545,color:#fff
style P fill:#28a745,color:#fff
Integration Points
Orphan detection is integrated into two commands:
-
torc recover: Runs orphan detection automatically as the first step before checking preconditions. This ensures that orphaned jobs don't block recovery. -
torc workflows sync-status: Standalone command to run orphan detection without triggering a full recovery. Useful for debugging or when you want to clean up orphaned jobs without submitting new allocations.
The torc watch Command
The torc watch command also performs orphan detection during its polling loop. When it detects
that no valid Slurm allocations exist (via a quick squeue check), it runs the full orphan
detection logic to clean up any orphaned jobs before checking if the workflow can make progress.
Recovery Actions
The recovery system uses ephemeral recovery actions to solve these problems.
How It Works
When torc slurm regenerate runs:
flowchart TD
A[torc slurm regenerate] --> B[Fetch pending jobs]
B --> C{Has pending jobs?}
C -->|No| D[Exit - nothing to do]
C -->|Yes| E[Build WorkflowGraph from pending jobs]
E --> F[Mark existing schedule_nodes actions as executed]
F --> G[Group jobs using scheduler_groups]
G --> H[Create schedulers for each group]
H --> I[Update jobs with scheduler assignments]
I --> J[Create on_jobs_ready recovery actions for deferred groups]
J --> K{Submit allocations?}
K -->|Yes| L[Submit Slurm allocations]
K -->|No| M[Done]
L --> M
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#6c757d,color:#fff
style D fill:#6c757d,color:#fff
style E fill:#4a9eff,color:#fff
style F fill:#4a9eff,color:#fff
style G fill:#4a9eff,color:#fff
style H fill:#4a9eff,color:#fff
style I fill:#4a9eff,color:#fff
style J fill:#ffc107,color:#000
style K fill:#6c757d,color:#fff
style L fill:#ffc107,color:#000
style M fill:#28a745,color:#fff
Step 1: Mark Existing Actions as Executed
All existing schedule_nodes actions are marked as executed using the claim_action API. This
prevents them from firing again and creating duplicate allocations:
sequenceDiagram
participant R as regenerate
participant S as Server
participant A as workflow_action table
R->>S: get_workflow_actions(workflow_id)
S-->>R: [action1, action2, ...]
loop For each schedule_nodes action
R->>S: claim_action(action_id)
S->>A: UPDATE executed=1, executed_at=NOW()
end
Step 2: Group Jobs Using WorkflowGraph
The system builds a WorkflowGraph from pending jobs and uses scheduler_groups() to group them by
(resource_requirements, has_dependencies). This aligns with the behavior of
torc workflows create-slurm:
- Jobs without dependencies: Can be scheduled immediately with
on_workflow_start - Jobs with dependencies (deferred): Need
on_jobs_readyrecovery actions to schedule when they become ready
flowchart TD
subgraph pending["Pending Jobs"]
A[Job A: no deps, rr=default]
B[Job B: no deps, rr=default]
C[Job C: depends on A, rr=default]
D[Job D: no deps, rr=gpu]
end
subgraph groups["Scheduler Groups"]
G1[Group 1: default, no deps<br/>Jobs: A, B]
G2[Group 2: default, has deps<br/>Jobs: C]
G3[Group 3: gpu, no deps<br/>Jobs: D]
end
A --> G1
B --> G1
C --> G2
D --> G3
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#ffc107,color:#000
style D fill:#17a2b8,color:#fff
style G1 fill:#28a745,color:#fff
style G2 fill:#28a745,color:#fff
style G3 fill:#28a745,color:#fff
Step 3: Create Recovery Actions for Deferred Groups
For groups with has_dependencies = true, the system creates on_jobs_ready recovery actions.
These actions:
- Have
is_recovery = trueto mark them as ephemeral - Use a
_deferredsuffix in the scheduler name - Trigger when the blocked jobs become ready
- Schedule additional Slurm allocations for those jobs
flowchart LR
subgraph workflow["Original Workflow"]
A[Job A: blocked] --> C[Job C: blocked]
B[Job B: blocked] --> C
end
subgraph actions["Recovery Actions"]
RA["on_jobs_ready: schedule_nodes<br/>job_ids: (A, B)<br/>is_recovery: true"]
RC["on_jobs_ready: schedule_nodes<br/>job_ids: (C)<br/>is_recovery: true"]
end
style A fill:#6c757d,color:#fff
style B fill:#6c757d,color:#fff
style C fill:#6c757d,color:#fff
style RA fill:#ffc107,color:#000
style RC fill:#ffc107,color:#000
Recovery Action Lifecycle
Recovery actions are ephemeral - they exist only during the recovery period:
stateDiagram-v2
[*] --> Created: regenerate creates action
Created --> Executed: Jobs become ready, action triggers
Executed --> Deleted: Workflow reinitialized
Created --> Deleted: Workflow reinitialized
classDef created fill:#ffc107,color:#000
classDef executed fill:#28a745,color:#fff
classDef deleted fill:#6c757d,color:#fff
class Created created
class Executed executed
class Deleted deleted
When a workflow is reinitialized (e.g., after resetting jobs), all recovery actions are deleted and original actions are reset to their initial state. This ensures a clean slate for the next run.
Database Schema
Recovery actions are tracked using the is_recovery column in the workflow_action table:
| Column | Type | Description |
|---|---|---|
is_recovery | INTEGER | 0 = normal action, 1 = recovery action |
Behavior Differences
| Operation | Normal Actions | Recovery Actions |
|---|---|---|
On reset_actions_for_reinitialize | Reset executed to 0 | Deleted entirely |
| Created by | Workflow spec | torc slurm regenerate |
| Purpose | Configured behavior | Temporary recovery |
Usage
# Regenerate schedulers for pending jobs
torc slurm regenerate <workflow_id> --account <account>
# With automatic submission
torc slurm regenerate <workflow_id> --account <account> --submit
# Using a specific HPC profile
torc slurm regenerate <workflow_id> --account <account> --profile kestrel
Implementation Details
The recovery logic is implemented in:
src/client/commands/orphan_detection.rs: Shared orphan detection logic used byrecover,watch, andworkflows sync-statussrc/client/commands/recover.rs: Main recovery command implementationsrc/client/commands/slurm.rs:handle_regeneratefunction for Slurm scheduler regenerationsrc/client/workflow_graph.rs:WorkflowGraph::from_jobs()andscheduler_groups()methodssrc/server/api/workflow_actions.rs:reset_actions_for_reinitializefunctionmigrations/20251225000000_add_is_recovery_to_workflow_action.up.sql: Schema migration
Key implementation notes:
- WorkflowGraph construction: A
WorkflowGraphis built from pending jobs usingfrom_jobs(), which reconstructs the dependency structure fromdepends_on_job_ids - Scheduler grouping: Jobs are grouped using
scheduler_groups()by(resource_requirements, has_dependencies), matchingcreate-slurmbehavior - Deferred schedulers: Groups with dependencies get a
_deferredsuffix in the scheduler name - Allocation calculation: Number of allocations is based on job count and resources per node
- Recovery actions: Only deferred groups (jobs with dependencies) get
on_jobs_readyrecovery actions
Failure Handler Design
This document describes the internal architecture and implementation of failure handlers in Torc. For a user-focused tutorial, see Configurable Failure Handlers.
Overview
Failure handlers provide per-job automatic retry logic based on exit codes. They allow workflows to recover from transient failures without manual intervention or workflow-level recovery heuristics.
flowchart LR
subgraph workflow["Workflow Specification"]
FH["failure_handlers:<br/>- name: handler1<br/> rules: [...]"]
JOB["jobs:<br/>- name: my_job<br/> failure_handler: handler1"]
end
subgraph server["Server"]
DB[(Database)]
API["REST API"]
end
subgraph client["Job Runner"]
RUNNER["JobRunner"]
RECOVERY["Recovery Logic"]
end
FH --> DB
JOB --> DB
RUNNER --> API
API --> DB
RUNNER --> RECOVERY
style FH fill:#4a9eff,color:#fff
style JOB fill:#4a9eff,color:#fff
style DB fill:#ffc107,color:#000
style API fill:#28a745,color:#fff
style RUNNER fill:#17a2b8,color:#fff
style RECOVERY fill:#dc3545,color:#fff
Problem Statement
When jobs fail, workflows traditionally have two recovery options:
- Manual intervention: User investigates and restarts failed jobs
- Workflow-level recovery:
torc watch --recoverapplies heuristics based on detected failure patterns (OOM, timeout, etc.)
Neither approach handles application-specific failures where:
- The job itself knows why it failed (via exit code)
- A specific recovery action can fix the issue
- Immediate retry is appropriate
Failure handlers solve this by allowing jobs to define exit-code-specific retry behavior with optional recovery scripts.
Architecture
Component Interaction
sequenceDiagram
participant WS as Workflow Spec
participant API as Server API
participant DB as Database
participant JR as JobRunner
participant RS as Recovery Script
participant JOB as Job Process
Note over WS,DB: Workflow Creation
WS->>API: Create workflow with failure_handlers
API->>DB: INSERT failure_handler
API->>DB: INSERT job with failure_handler_id
Note over JR,JOB: Job Execution
JR->>API: Claim job
JR->>JOB: Execute command
JOB-->>JR: Exit code (e.g., 10)
Note over JR,API: Failure Recovery
JR->>API: GET failure_handler
API->>DB: SELECT rules
DB-->>API: Rules JSON
API-->>JR: FailureHandlerModel
JR->>JR: Match exit code to rule
JR->>API: POST retry_job (reserves retry)
alt Recovery Script Defined
JR->>RS: Execute with env vars
RS-->>JR: Exit code
end
JR->>JR: Job returns to Ready queue
Data Model
erDiagram
WORKFLOW ||--o{ JOB : contains
WORKFLOW ||--o{ FAILURE_HANDLER : contains
FAILURE_HANDLER ||--o{ JOB : "referenced by"
WORKFLOW {
int id PK
string name
int status_id FK
}
FAILURE_HANDLER {
int id PK
int workflow_id FK
string name
string rules "JSON array"
}
JOB {
int id PK
int workflow_id FK
string name
string command
int status
int failure_handler_id FK "nullable"
int attempt_id "starts at 1"
}
Rule Matching
Failure handler rules are stored as a JSON array. When a job fails, rules are evaluated in a specific order to find a match.
Rule Structure
#![allow(unused)] fn main() { pub struct FailureHandlerRule { pub exit_codes: Vec<i32>, // Specific codes to match pub match_all_exit_codes: bool, // Catch-all flag pub recovery_script: Option<String>, pub max_retries: i32, // Default: 3 } }
Matching Priority
Rules are evaluated with specific matches taking priority over catch-all rules:
flowchart TD
START["Job fails with exit code X"]
SPECIFIC{"Find rule where<br/>exit_codes contains X?"}
CATCHALL{"Find rule where<br/>match_all_exit_codes = true?"}
FOUND["Rule matched"]
NONE["No matching rule<br/>Job marked Failed"]
START --> SPECIFIC
SPECIFIC -->|Found| FOUND
SPECIFIC -->|Not found| CATCHALL
CATCHALL -->|Found| FOUND
CATCHALL -->|Not found| NONE
style START fill:#dc3545,color:#fff
style SPECIFIC fill:#4a9eff,color:#fff
style CATCHALL fill:#ffc107,color:#000
style FOUND fill:#28a745,color:#fff
style NONE fill:#6c757d,color:#fff
This ensures that specific exit code handlers always take precedence, regardless of rule order in the JSON array.
Implementation (job_runner.rs):
#![allow(unused)] fn main() { let matching_rule = rules .iter() .find(|rule| rule.exit_codes.contains(&(exit_code as i32))) .or_else(|| rules.iter().find(|rule| rule.match_all_exit_codes)); }
Recovery Flow
The recovery process is designed to be atomic and safe:
flowchart TD
subgraph JobRunner["JobRunner (Client)"]
FAIL["Job fails"]
FETCH["Fetch failure handler"]
MATCH["Match rule to exit code"]
CHECK{"attempt_id<br/>< max_retries?"}
RESERVE["POST /jobs/{id}/retry/{run_id}<br/>Reserves retry slot"]
SCRIPT{"Recovery<br/>script defined?"}
RUN["Execute recovery script"]
DONE["Job queued for retry"]
FAILED["Mark job as Failed"]
end
subgraph Server["Server (API)"]
VALIDATE["Validate run_id matches"]
STATUS["Check job status"]
MAX["Validate max_retries"]
UPDATE["UPDATE job<br/>status=Ready<br/>attempt_id += 1"]
EVENT["INSERT event record"]
COMMIT["COMMIT transaction"]
end
FAIL --> FETCH
FETCH --> MATCH
MATCH --> CHECK
CHECK -->|Yes| RESERVE
CHECK -->|No| FAILED
RESERVE --> VALIDATE
VALIDATE --> STATUS
STATUS --> MAX
MAX --> UPDATE
UPDATE --> EVENT
EVENT --> COMMIT
COMMIT --> SCRIPT
SCRIPT -->|Yes| RUN
SCRIPT -->|No| DONE
RUN -->|Success or Failure| DONE
style FAIL fill:#dc3545,color:#fff
style RESERVE fill:#4a9eff,color:#fff
style RUN fill:#ffc107,color:#000
style DONE fill:#28a745,color:#fff
style FAILED fill:#6c757d,color:#fff
style UPDATE fill:#17a2b8,color:#fff
style COMMIT fill:#17a2b8,color:#fff
Key Design Decisions
-
Retry reservation before recovery script: The
retry_jobAPI is called before the recovery script runs. This ensures:- The retry slot is reserved atomically
- Recovery scripts don't run for retries that won't happen
- External resources modified by recovery scripts are consistent
-
Recovery script failure is non-fatal: If the recovery script fails, the job is still retried. This prevents recovery script bugs from blocking legitimate retries.
-
Transaction isolation: The
retry_jobAPI usesBEGIN IMMEDIATEto prevent race conditions where multiple processes might try to retry the same job.
API Endpoints
GET /failure_handlers/
Fetches a failure handler by ID.
Response:
{
"id": 1,
"workflow_id": 42,
"name": "simulation_recovery",
"rules": "[{\"exit_codes\":[10,11],\"max_retries\":3}]"
}
POST /jobs/{id}/retry/{run_id}?max_retries=N
Retries a failed job by resetting its status to Ready.
Query Parameters:
max_retries(required): Maximum retries allowed by the matching rule
Validations:
- Job must exist
run_idmust match workflow's current run- Job status must be Running, Failed, or Terminated
attempt_idmust be less thanmax_retries
Transaction Safety:
BEGIN IMMEDIATE; -- Acquire write lock
SELECT j.*, ws.run_id as workflow_run_id
FROM job j
JOIN workflow w ON j.workflow_id = w.id
JOIN workflow_status ws ON w.status_id = ws.id
WHERE j.id = ?;
-- Validate conditions...
UPDATE job SET status = 2, attempt_id = ? WHERE id = ?;
INSERT INTO event (workflow_id, timestamp, data) VALUES (?, ?, ?);
COMMIT;
Response:
{
"id": 123,
"workflow_id": 42,
"name": "my_job",
"status": "ready",
"attempt_id": 2
}
Recovery Script Execution
Recovery scripts run in a subprocess with environment variables providing context:
flowchart LR
subgraph env["Environment Variables"]
WID["TORC_WORKFLOW_ID"]
JID["TORC_JOB_ID"]
JN["TORC_JOB_NAME"]
URL["TORC_API_URL"]
OUT["TORC_OUTPUT_DIR"]
AID["TORC_ATTEMPT_ID"]
RC["TORC_RETURN_CODE"]
end
subgraph script["Recovery Script"]
SHELL["bash -c<br/>(or cmd /C on Windows)"]
CODE["User script code"]
end
env --> SHELL
SHELL --> CODE
style WID fill:#4a9eff,color:#fff
style JID fill:#4a9eff,color:#fff
style JN fill:#4a9eff,color:#fff
style URL fill:#4a9eff,color:#fff
style OUT fill:#4a9eff,color:#fff
style AID fill:#ffc107,color:#000
style RC fill:#dc3545,color:#fff
style SHELL fill:#6c757d,color:#fff
style CODE fill:#28a745,color:#fff
Log File Naming
Each job attempt produces separate log files to preserve history:
output/job_stdio/
├── job_wf{W}_j{J}_r{R}_a1.o # Attempt 1 stdout
├── job_wf{W}_j{J}_r{R}_a1.e # Attempt 1 stderr
├── job_wf{W}_j{J}_r{R}_a2.o # Attempt 2 stdout
├── job_wf{W}_j{J}_r{R}_a2.e # Attempt 2 stderr
└── ...
Where:
W= workflow_idJ= job_idR= run_idaN= attempt number
Database Schema
failure_handler Table
CREATE TABLE failure_handler (
id INTEGER PRIMARY KEY AUTOINCREMENT,
workflow_id INTEGER NOT NULL REFERENCES workflow(id) ON DELETE CASCADE,
name TEXT NOT NULL,
rules TEXT NOT NULL, -- JSON array of FailureHandlerRule
UNIQUE(workflow_id, name)
);
job Table (relevant columns)
ALTER TABLE job ADD COLUMN failure_handler_id INTEGER
REFERENCES failure_handler(id) ON DELETE SET NULL;
ALTER TABLE job ADD COLUMN attempt_id INTEGER NOT NULL DEFAULT 1;
Slurm Integration
When a job is retried, it returns to the Ready queue and will be picked up by any available compute node. For Slurm workflows, this may require additional allocations if existing nodes have terminated.
flowchart TD
RETRY["Job retried<br/>(status = Ready)"]
CHECK{"Compute nodes<br/>available?"}
RUN["Job runs on<br/>existing allocation"]
SCHEDULE["Auto-schedule triggers<br/>new Slurm allocation"]
WAIT["Job waits for<br/>allocation to start"]
EXEC["Job executes"]
RETRY --> CHECK
CHECK -->|Yes| RUN
CHECK -->|No| SCHEDULE
SCHEDULE --> WAIT
WAIT --> EXEC
RUN --> EXEC
style RETRY fill:#28a745,color:#fff
style CHECK fill:#6c757d,color:#fff
style RUN fill:#4a9eff,color:#fff
style SCHEDULE fill:#ffc107,color:#000
style WAIT fill:#17a2b8,color:#fff
style EXEC fill:#28a745,color:#fff
If auto_schedule_on_ready_jobs actions are configured, new Slurm allocations will be created
automatically when retried jobs become ready. See Workflow Actions for
details.
Implementation Files
| File | Purpose |
|---|---|
src/client/job_runner.rs | try_recover_job(), rule matching |
src/client/utils.rs | shell_command() cross-platform shell |
src/server/api/jobs.rs | retry_job() API endpoint |
src/server/api/failure_handlers.rs | CRUD operations for failure handlers |
src/client/workflow_spec.rs | Parsing failure handlers from specs |
migrations/20260110*.sql | Database schema for failure handlers |
Comparison with Workflow Recovery
| Aspect | Failure Handlers | Workflow Recovery (torc watch) |
|---|---|---|
| Scope | Per-job | Workflow-wide |
| Trigger | Specific exit codes | OOM detection, timeout patterns |
| Timing | Immediate (during job run) | After job completion |
| Recovery Action | Custom scripts | Resource adjustment, resubmission |
| Configuration | In workflow spec | Command-line flags |
| State | Preserved (same workflow run) | May start new run |
| Slurm | Reuses or auto-schedules nodes | Creates new schedulers |
Recommendation: Use both mechanisms together:
- Failure handlers for immediate, exit-code-specific recovery
torc watch --recoverfor workflow-level resource adjustments and allocation recovery
Recovery Outcome and pending_failed Status
When try_recover_job is called, it returns a RecoveryOutcome enum that determines the final job
status:
#![allow(unused)] fn main() { pub enum RecoveryOutcome { /// Job was successfully scheduled for retry Retried, /// No failure handler defined - use PendingFailed status NoHandler, /// Failure handler exists but no rule matched - use PendingFailed status NoMatchingRule, /// Max retries exceeded - use Failed status MaxRetriesExceeded, /// API call or other error - use Failed status Error(String), } }
Status Assignment Flow
flowchart TD
FAIL["Job fails"]
TRY["try_recover_job()"]
RETRIED{"Outcome?"}
READY["Status: ready<br/>attempt_id += 1"]
PENDING["Status: pending_failed"]
FAILED["Status: failed"]
FAIL --> TRY
TRY --> RETRIED
RETRIED -->|Retried| READY
RETRIED -->|NoHandler / NoMatchingRule| PENDING
RETRIED -->|MaxRetriesExceeded / Error| FAILED
style FAIL fill:#dc3545,color:#fff
style READY fill:#28a745,color:#fff
style PENDING fill:#ffc107,color:#000
style FAILED fill:#6c757d,color:#fff
pending_failed Status (value 10)
The pending_failed status is a new job state that indicates:
- The job failed with a non-zero exit code
- No failure handler rule matched the exit code
- The job is awaiting classification (retry or fail)
Key properties:
- Not terminal: Workflow is not considered complete while jobs are
pending_failed - Downstream blocked: Dependent jobs remain in
blockedstatus (not canceled) - Resettable:
reset-status --failed-onlyincludespending_failedjobs
Integration with AI-Assisted Recovery
Jobs in pending_failed status can be classified by an AI agent using MCP tools:
sequenceDiagram
participant JR as JobRunner
participant API as Torc API
participant MCP as torc-mcp-server
participant AI as AI Agent
JR->>API: complete_job(status=pending_failed)
Note over JR,API: Job awaiting classification
AI->>MCP: list_pending_failed_jobs(workflow_id)
MCP->>API: GET /jobs?status=pending_failed
API-->>MCP: Jobs with stderr content
MCP-->>AI: Pending jobs + stderr
AI->>AI: Analyze error patterns
AI->>MCP: classify_and_resolve_failures(classifications)
alt action = retry
MCP->>API: PUT /jobs/{id} status=ready
Note over API: Triggers re-execution
else action = fail
MCP->>API: PUT /jobs/{id} status=failed
Note over API: Triggers downstream cancellation
end
See AI-Assisted Recovery Design for full details.
AI-Assisted Recovery Design
🧪 EXPERIMENTAL: This feature is new and not yet well-tested. The API and behavior may change based on user feedback.
This document describes the architecture and implementation of AI-assisted failure recovery in Torc. For a user-focused tutorial, see AI-Assisted Failure Recovery.
Overview
AI-assisted recovery enables intelligent classification of job failures that can't be handled by
rule-based mechanisms (failure handlers, OOM/timeout detection). It introduces a new job status
(pending_failed) that defers the fail/retry decision to an AI agent.
flowchart TD
subgraph traditional["Traditional Recovery"]
FAIL1["Job fails"]
HANDLER{"Failure handler?"}
OOM{"OOM/timeout?"}
FAILED1["Status: failed"]
RETRY1["Retry"]
end
subgraph ai["AI-Assisted Recovery"]
FAIL2["Job fails"]
PENDING["Status: pending_failed"]
AGENT["AI agent classifies"]
FAILED2["Status: failed"]
RETRY2["Retry"]
end
FAIL1 --> HANDLER
HANDLER -->|Match| RETRY1
HANDLER -->|No match| OOM
OOM -->|Yes| RETRY1
OOM -->|No| FAILED1
FAIL2 --> PENDING
PENDING --> AGENT
AGENT -->|Permanent| FAILED2
AGENT -->|Transient| RETRY2
style FAIL1 fill:#dc3545,color:#fff
style FAIL2 fill:#dc3545,color:#fff
style PENDING fill:#ffc107,color:#000
style AGENT fill:#4a9eff,color:#fff
style FAILED1 fill:#6c757d,color:#fff
style FAILED2 fill:#6c757d,color:#fff
style RETRY1 fill:#28a745,color:#fff
style RETRY2 fill:#28a745,color:#fff
Problem Statement
Current recovery mechanisms have blind spots:
- Failure handlers: Require predefined exit codes. Many failures use generic exit code 1.
- OOM/timeout detection: Only handles resource exhaustion patterns.
--retry-unknown: Blindly retries all failures, wasting compute on unfixable bugs.
Real-world failures often require contextual analysis:
| Error | Analysis Required | Decision |
|---|---|---|
Connection refused to storage.internal:443 | Was the storage server down? | Retry if transient |
NCCL timeout after 1800 seconds | Is this a node failure or code bug? | Retry if node issue |
SyntaxError: invalid syntax | Is the code broken? | Fail - needs fix |
FileNotFoundError: input.csv | Missing input or wrong path? | Depends on context |
AI agents can analyze stderr, correlate with external systems, and make informed decisions.
Architecture
Component Overview
flowchart LR
subgraph client["Torc Client"]
RUNNER["JobRunner"]
WATCH["torc watch"]
RECOVER["torc recover"]
end
subgraph server["Torc Server"]
API["REST API"]
DB[(SQLite)]
end
subgraph mcp["MCP Layer"]
MCPSRV["torc-mcp-server"]
CUSTOM["Custom MCP servers"]
end
subgraph agent["AI Agent"]
LLM["Claude/Copilot/Custom"]
end
RUNNER --> API
WATCH --> RECOVER
RECOVER --> API
API --> DB
MCPSRV --> API
LLM --> MCPSRV
LLM --> CUSTOM
style RUNNER fill:#17a2b8,color:#fff
style WATCH fill:#17a2b8,color:#fff
style RECOVER fill:#17a2b8,color:#fff
style API fill:#28a745,color:#fff
style DB fill:#ffc107,color:#000
style MCPSRV fill:#4a9eff,color:#fff
style LLM fill:#dc3545,color:#fff
Data Flow
sequenceDiagram
participant JR as JobRunner
participant API as Torc API
participant DB as Database
participant MCP as torc-mcp-server
participant AI as AI Agent
Note over JR,DB: Job Failure
JR->>JR: Job exits with code 1
JR->>JR: No failure handler match
JR->>API: complete_job(status=pending_failed)
API->>DB: UPDATE job SET status=10
Note over AI,DB: AI Classification
AI->>MCP: list_pending_failed_jobs(workflow_id)
MCP->>API: GET /jobs?status=pending_failed
API->>DB: SELECT * FROM job WHERE status=10
DB-->>API: Jobs with pending_failed
API-->>MCP: Job list
MCP->>MCP: Read stderr files
MCP-->>AI: Jobs + stderr content
AI->>AI: Analyze errors
AI->>MCP: classify_and_resolve_failures(classifications)
alt action = retry
MCP->>API: PUT /jobs/{id} status=ready
API->>DB: UPDATE job SET status=2, attempt_id+=1
else action = fail
MCP->>API: PUT /jobs/{id} status=failed
API->>DB: UPDATE job SET status=6
Note over API,DB: Triggers downstream cancellation
end
Job Status: pending_failed
Status Values
| Value | Name | Description |
|---|---|---|
| 0 | uninitialized | Not yet initialized |
| 1 | blocked | Waiting on dependencies |
| 2 | ready | Ready to run |
| 3 | pending | Claimed by worker |
| 4 | running | Currently executing |
| 5 | completed | Finished successfully |
| 6 | failed | Failed (terminal) |
| 7 | canceled | Canceled by user |
| 8 | terminated | Killed by signal |
| 9 | disabled | Skipped |
| 10 | pending_failed | Awaiting AI classification |
Status Transitions
stateDiagram-v2
[*] --> uninitialized
uninitialized --> blocked : initialize
uninitialized --> ready : no dependencies
blocked --> ready : dependencies met
ready --> pending : claimed
pending --> running : started
running --> completed : exit 0
running --> failed : handler match + max retries
running --> pending_failed : no handler match
running --> ready : failure handler match
running --> terminated : signal
state "pending_failed" as pending_failed
pending_failed --> failed : AI classifies permanent
pending_failed --> ready : AI classifies transient
pending_failed --> uninitialized : reset-status
failed --> [*]
completed --> [*]
canceled --> [*]
terminated --> [*]
Workflow Completion Semantics
A workflow with pending_failed jobs is not complete:
#![allow(unused)] fn main() { fn is_workflow_complete(workflow_id: i64) -> bool { // Jobs in these statuses are "complete" let complete_statuses = [ JobStatus::Completed, JobStatus::Failed, JobStatus::Canceled, JobStatus::Terminated, JobStatus::Disabled, ]; // pending_failed is NOT in this list // So workflows with pending_failed jobs are incomplete !jobs.iter().any(|j| !complete_statuses.contains(&j.status)) } }
This ensures:
torc watchcontinues monitoring- Downstream jobs remain blocked (not canceled)
- The workflow doesn't appear "done" prematurely
Recovery Outcome Enum
The try_recover_job function returns detailed outcomes:
#![allow(unused)] fn main() { pub enum RecoveryOutcome { /// Job was successfully scheduled for retry Retried, /// No failure handler defined - use PendingFailed status NoHandler, /// Failure handler exists but no rule matched - use PendingFailed status NoMatchingRule, /// Max retries exceeded - use Failed status MaxRetriesExceeded, /// API call or other error - use Failed status Error(String), } }
Usage in handle_job_completion:
#![allow(unused)] fn main() { match self.try_recover_job(job_id, ...) { RecoveryOutcome::Retried => { // Job queued for retry, clean up return; } RecoveryOutcome::NoHandler | RecoveryOutcome::NoMatchingRule => { // Check if workflow has use_pending_failed enabled if self.workflow.use_pending_failed.unwrap_or(false) { // Use pending_failed for AI classification final_result.status = JobStatus::PendingFailed; } else { // Use failed status (default behavior) // (status already Failed) } } RecoveryOutcome::MaxRetriesExceeded | RecoveryOutcome::Error(_) => { // Use failed - no recovery possible // (status already Failed) } } }
Enabling AI-Assisted Recovery
AI-assisted recovery is opt-in per workflow using the use_pending_failed flag. By default,
jobs that fail without a matching failure handler get the Failed status.
Workflow Specification
Add use_pending_failed: true to your workflow spec to enable:
name: ml_training
use_pending_failed: true # Enable AI-assisted recovery
jobs:
- name: train_model
command: python train.py
Without this flag (or with use_pending_failed: false), jobs use the traditional behavior:
- Failure handler match → retry
- No failure handler →
Failedstatus - Max retries exceeded →
Failedstatus
With use_pending_failed: true:
- Failure handler match → retry
- No failure handler →
PendingFailedstatus (awaiting AI classification) - Max retries exceeded →
Failedstatus
Why Opt-In?
The default behavior prioritizes predictability and backward compatibility:
- Existing workflows continue to work - no breaking changes
- Clear failure semantics - jobs either retry or fail immediately
- No external dependencies - doesn't require AI agent integration
Opt-in when you want:
- Intelligent classification of ambiguous failures
- Human/AI review before retry decisions
- Reduced compute waste from blind retries
MCP Server Tools
list_pending_failed_jobs
Lists jobs awaiting classification with their stderr content.
Implementation:
#![allow(unused)] fn main() { pub fn list_pending_failed_jobs( config: &Configuration, workflow_id: i64, output_dir: &Path, ) -> Result<CallToolResult, McpError> { // 1. Query jobs with pending_failed status let jobs = paginate_jobs(config, workflow_id, JobListParams::new().with_status(JobStatus::PendingFailed)); // 2. For each job, fetch result and read stderr tail for job in &jobs { let result = get_latest_result(job.id); let stderr_path = get_job_stderr_path(output_dir, ...); let stderr_tail = read_last_n_lines(&stderr_path, 50); // Include in response } // 3. Return structured response with guidance } }
classify_and_resolve_failures
Applies AI classifications to jobs.
Classification struct:
#![allow(unused)] fn main() { pub struct FailureClassification { pub job_id: i64, pub action: String, // "retry" or "fail" pub memory: Option<String>, // Optional resource adjustment pub runtime: Option<String>, pub reason: Option<String>, // For audit trail } }
Implementation:
#![allow(unused)] fn main() { pub fn classify_and_resolve_failures( config: &Configuration, workflow_id: i64, classifications: Vec<FailureClassification>, dry_run: bool, ) -> Result<CallToolResult, McpError> { // 0. Validate workflow has use_pending_failed enabled let workflow = get_workflow(config, workflow_id)?; if !workflow.use_pending_failed.unwrap_or(false) { return Err(invalid_params( "Workflow does not have use_pending_failed enabled" )); } for classification in &classifications { // 1. Validate job is in pending_failed status // 2. Apply resource adjustments if specified // 3. Set status based on action: // - "retry": status = ready, attempt_id += 1 // - "fail": status = failed (triggers cascade) } } }
Validation:
The tool validates that the workflow has use_pending_failed: true before allowing any
classifications. This prevents accidental modification of workflows that don't opt into AI-assisted
recovery.
Integration with reset-status
The reset-status --failed-only command also resets pending_failed jobs:
-- reset_failed_jobs_only query
SELECT id, status FROM job
WHERE workflow_id = $1
AND status IN (
$failed_status,
$canceled_status,
$terminated_status,
$pending_failed_status -- Added
)
This allows users to reset pending_failed jobs without AI classification if desired.
Error Classification Patterns
The AI agent should recognize common patterns:
Transient Errors
#![allow(unused)] fn main() { const TRANSIENT_PATTERNS: &[&str] = &[ // Network "Connection refused", "Connection timed out", "Network is unreachable", "DNS resolution failed", "Service Unavailable", // GPU/HPC "NCCL timeout", "GPU communication error", "CUDA out of memory", // Could be transient if memory is shared // Hardware "EIO", "Input/output error", // Slurm "PREEMPTED", "NODE_FAIL", "TIMEOUT", // Slurm walltime, not job timeout ]; }
Permanent Errors
#![allow(unused)] fn main() { const PERMANENT_PATTERNS: &[&str] = &[ // Python "SyntaxError", "IndentationError", "ModuleNotFoundError", "ImportError", "NameError", "TypeError", "ValueError", // General "FileNotFoundError", // For input files "PermissionDenied", "AssertionError", "IndexError", "KeyError", ]; }
These patterns are guidance for AI agents, not hard-coded rules. The AI can use context to
override (e.g., FileNotFoundError for a file that should be created by an upstream job might be
transient if the upstream job is being retried).
Slurm Integration
When pending_failed jobs are classified as "retry", they return to ready status. For Slurm
workflows:
- If active allocations exist, jobs may run immediately
- If no allocations,
torc watch --auto-schedulewill create new ones - Manual recovery:
torc slurm regenerate --submit
Design Decisions
Why a New Status vs. a Flag?
Alternative considered: Add needs_classification: bool flag to jobs.
Decision: New status is cleaner because:
- Status is already used for state machine transitions
is_workflow_completenaturally excludespending_failed- No schema changes to existing status column
- Clearer semantics in logs and UI
Why Defer to AI vs. Built-in Heuristics?
Alternative considered: Build pattern matching into Torc directly.
Decision: AI-assisted approach because:
- Error patterns are domain-specific and evolving
- AI can use context (multiple errors, timing, external systems)
- Users can customize via custom MCP servers
- Avoids bloating Torc with error classification logic
Why Not Block on AI Response?
Alternative considered: Job runner waits for AI classification.
Decision: Asynchronous classification because:
- AI inference adds latency (seconds to minutes)
- AI service may be unavailable
- Human oversight is valuable for production workflows
- Jobs can accumulate for batch classification
CLI Integration
The torc recover and torc watch commands support automatic AI agent invocation:
Command-Line Options
| Option | Default | Description |
|---|---|---|
--ai-recovery | false | Enable AI-assisted classification |
--ai-agent | claude | AI agent CLI to invoke (currently only claude) |
Invocation Flow
When --ai-recovery is enabled:
#![allow(unused)] fn main() { pub fn invoke_ai_agent(workflow_id: i64, agent: &str, output_dir: &Path) -> Result<(), String> { // 1. Check if agent CLI is available (e.g., `which claude`) // 2. Build prompt with workflow context // 3. Spawn agent with --print flag for non-interactive mode // 4. Capture and log output // 5. Return success/failure } }
The prompt instructs the AI agent to:
- Call
list_pending_failed_jobsto get jobs with stderr - Analyze each job's error to classify as transient or permanent
- Call
classify_and_resolve_failureswith classifications
Agent Requirements
For the claude agent:
- Claude Code CLI must be installed (
claudecommand in PATH) - Torc MCP server must be configured in
~/.claude/mcp_servers.json - The
--printflag is used for non-interactive execution
Implementation Files
| File | Purpose |
|---|---|
src/models.rs | JobStatus::PendingFailed enum variant |
src/client/job_runner.rs | RecoveryOutcome enum, status assignment |
src/client/commands/recover.rs | invoke_ai_agent function, CLI integration |
src/server/api/jobs.rs | reset_failed_jobs_only includes pending_failed |
torc-mcp-server/src/tools.rs | MCP tool implementations |
torc-mcp-server/src/server.rs | MCP server handlers |
Future Enhancements
- Confidence thresholds: AI classifies with confidence score; low confidence escalates to user
- Learning from outcomes: Track whether AI classifications led to successful retries
- Batch scheduling optimization: AI recommends optimal Slurm allocations for retry jobs
- Custom MCP server examples: Templates for domain-specific error classification
Workflow Graph
The WorkflowGraph module provides a directed acyclic graph (DAG) representation of workflow jobs
and their dependencies. It serves as the core data structure for dependency analysis, scheduler
planning, and execution visualization.
Purpose
The graph abstraction addresses several key challenges:
- Unified Representation: Works with both workflow specifications (at creation time) and database models (at runtime), providing a consistent interface for graph algorithms
- Dependency Analysis: Enables topological sorting, level computation, and critical path detection
- Scheduler Planning: Groups jobs by resource requirements and dependency status for efficient scheduler generation
- Sub-workflow Detection: Identifies connected components that can be scheduled independently
Data Structures
JobNode
Represents a single job (or parameterized job template) in the graph:
#![allow(unused)] fn main() { pub struct JobNode { pub name: String, // Job name (may contain {param} placeholders) pub resource_requirements: Option<String>, // Resource requirements name pub instance_count: usize, // 1 for normal jobs, N for parameterized pub name_pattern: String, // Regex pattern matching all instances pub scheduler: Option<String>, // Assigned scheduler pub command: String, // Command to execute } }
WorkflowGraph
The main graph structure with bidirectional edges for efficient traversal:
#![allow(unused)] fn main() { pub struct WorkflowGraph { nodes: HashMap<String, JobNode>, // Jobs indexed by name depends_on: HashMap<String, HashSet<String>>, // Forward edges (blockers) depended_by: HashMap<String, HashSet<String>>, // Reverse edges (dependents) levels: Option<Vec<Vec<String>>>, // Cached topological levels components: Option<Vec<WorkflowComponent>>, // Cached connected components } }
SchedulerGroup
Groups jobs that share scheduling characteristics:
#![allow(unused)] fn main() { pub struct SchedulerGroup { pub resource_requirements: String, // Common RR name pub has_dependencies: bool, // Whether jobs have blockers pub job_count: usize, // Total instances across jobs pub job_name_patterns: Vec<String>, // Regex patterns for matching pub job_names: Vec<String>, // Job names in this group } }
Construction Methods
From Workflow Specification
#![allow(unused)] fn main() { WorkflowGraph::from_spec(&spec) -> Result<Self, Error> }
Builds the graph at workflow creation time:
- Creates nodes for each job specification
- Resolves explicit dependencies (
depends_on) - Resolves regex dependencies (
depends_on_regexes) - Computes implicit dependencies from input/output files and user data
From Database Models
#![allow(unused)] fn main() { WorkflowGraph::from_jobs(jobs, resource_requirements) -> Result<Self, Error> }
Builds the graph from fetched database records (used for recovery and visualization):
- Creates nodes from
JobModelrecords - Resolves dependencies via
depends_on_job_ids(if available) - Falls back to computing dependencies from file relationships
Key Operations
Topological Levels
Groups jobs by dependency depth for parallel execution planning:
- Level 0: Jobs with no dependencies (can start immediately)
- Level N: Jobs whose dependencies are all in levels < N
Used for execution ordering and TUI visualization.
Connected Components
Identifies independent sub-workflows using BFS traversal:
- Each component can be scheduled independently
- Enables parallel execution of unrelated job pipelines
- Useful for large workflows with multiple independent processing chains
Scheduler Groups
Groups jobs by (resource_requirements, has_dependencies) for scheduler generation:
- Jobs without dependencies: Submitted at workflow start
- Jobs with dependencies: Submitted on-demand when jobs become ready
- Enables the shared
generate_scheduler_plan()function used by bothtorc slurm generateandtorc slurm regenerate
Critical Path
Finds the longest path through the graph (by instance count):
- Identifies bottleneck jobs that limit parallelism
- Used for estimating minimum execution time
- Helps prioritize optimization efforts
Integration Points
Scheduler Plan Generation
The SchedulerPlan module uses WorkflowGraph::scheduler_groups() to generate Slurm schedulers:
#![allow(unused)] fn main() { let graph = WorkflowGraph::from_spec(&spec)?; let groups = graph.scheduler_groups(); let plan = generate_scheduler_plan(&graph, &resource_requirements, &profile, ...); }
Execution Plan Visualization
The execution plan display uses WorkflowGraph::from_jobs() for runtime visualization:
#![allow(unused)] fn main() { let graph = WorkflowGraph::from_jobs(&jobs, &resource_requirements)?; let levels = graph.topological_levels()?; // Render DAG visualization in TUI }
Recovery Scenarios
The regenerate command uses the graph to determine scheduler groupings for failed workflows:
#![allow(unused)] fn main() { let graph = WorkflowGraph::from_jobs(&jobs, &resource_requirements)?; let plan = generate_scheduler_plan(&graph, ...); // Apply plan to recreate schedulers and actions }
Design Decisions
Bidirectional Edges
The graph maintains both depends_on and depended_by maps for O(1) traversal in either direction.
This is critical for:
- Finding what becomes ready when a job completes
- Computing connected components efficiently
- Building subgraphs for partial analysis
Lazy Computation with Caching
Topological levels and connected components are computed on-demand and cached. This avoids unnecessary computation for simple queries while ensuring efficient repeated access.
Parameterized Job Handling
Parameterized jobs are represented as single nodes with instance_count > 1. The name_pattern
field provides a regex for matching expanded instances, enabling scheduler grouping without full
expansion.
Interface Architecture
This document describes the design decisions behind Torc's multi-interface architecture and the patterns used to expose functionality consistently across all user-facing interfaces.
Problem Statement
Torc needs to serve diverse user workflows:
- Automation scripts need programmatic access via CLI and API clients
- Interactive monitoring requires real-time updates via TUI and web dashboard
- AI assistants need structured tool access via the MCP protocol
- External integrations require language-agnostic HTTP APIs
Each interface has different requirements for output format, error handling, and interaction patterns, yet all must provide consistent access to core functionality.
Design Goals
- Consistency: All interfaces expose the same core operations with consistent semantics
- Single Source of Truth: Business logic lives in the server; clients are thin wrappers
- Interface-Appropriate UX: Each interface adapts presentation to its context
- Maintainability: Adding features should require minimal interface-specific code
- Discoverability: Users should easily find available operations in each interface
Solution Overview
The architecture follows a layered approach where the CLI serves as the foundation for Rust-based interfaces, while external clients communicate directly with the HTTP API.
graph TD
subgraph ui["User Interfaces"]
CLI["CLI<br/>(torc)"]
TUI["TUI<br/>(torc tui)"]
DASH_BE["Dashboard Backend<br/>(torc-dash)"]
DASH_FE["Dashboard Frontend<br/>(JavaScript)"]
MCP["MCP Server<br/>(torc-mcp-server)"]
PY["Python Client"]
JL["Julia Client"]
end
subgraph lib["Rust Client Library"]
API["src/client/apis/<br/>Generated OpenAPI client"]
CMD["src/client/commands/<br/>CLI command handlers"]
end
subgraph server["Server"]
HTTP["HTTP API<br/>(torc-server)"]
DB[(SQLite Database)]
end
CLI --> CMD
CMD --> API
TUI --> API
TUI --> HTTP
DASH_BE --> API
MCP --> API
DASH_FE --> HTTP
PY --> HTTP
JL --> HTTP
API --> HTTP
HTTP --> DB
style CLI fill:#4a9eff,color:#fff
style TUI fill:#17a2b8,color:#fff
style DASH_BE fill:#17a2b8,color:#fff
style DASH_FE fill:#17a2b8,color:#fff
style MCP fill:#6f42c1,color:#fff
style PY fill:#ffc107,color:#000
style JL fill:#ffc107,color:#000
style API fill:#4a9eff,color:#fff
style CMD fill:#4a9eff,color:#fff
style HTTP fill:#28a745,color:#fff
style DB fill:#28a745,color:#fff
Key architectural decisions:
- CLI as foundation: The CLI (
src/client/commands/) provides the command implementations that other Rust interfaces can reuse. - Shared Rust client library: The TUI, Dashboard backend, and MCP server all use the generated
Rust API client (
src/client/apis/) as the CLI does. This client library makes HTTP requests to the torc-server. - Direct HTTP access: The TUI, Dashboard JavaScript frontend, Python client, and Julia client also communicate directly with the HTTP API for certain operations.
Interface Implementations
CLI (Command Line Interface)
Location: src/client/commands/
Design Pattern: Subcommand dispatch with format-aware output
The CLI uses clap for argument parsing with a hierarchical command structure:
torc
├── workflows
│ ├── create
│ ├── list
│ ├── run
│ └── ...
├── jobs
│ ├── list
│ ├── get
│ └── ...
└── ...
Key Design Decisions:
-
Dual Output Formats: Every list/get command supports
--format table(human-readable) and--format json(machine-parseable). This enables both interactive use and scripting. -
Pagination Built-In: All list commands include
--offsetand--limitflags, mirroring the API's pagination model directly. -
Environment Variable Fallbacks: Common parameters like
--urlfall back to environment variables (TORC_API_URL), reducing repetition in scripts. -
Consistent Error Output: Errors write to stderr with context, while successful output goes to stdout, enabling clean piping.
Implementation Pattern:
#![allow(unused)] fn main() { pub fn handle_list(config: &Configuration, format: &str) { match list_items(config) { Ok(items) => match format { "json" => println!("{}", serde_json::to_string_pretty(&items).unwrap()), _ => display_table(&items), }, Err(e) => eprintln!("Error: {}", e), } } }
TUI (Terminal User Interface)
Location: src/tui/
Design Pattern: Component-based reactive UI with polling updates
Key Design Decisions:
-
Separation of Concerns:
app.rs: Application state and business logicui.rs: Rendering logic using ratatuiapi.rs: API client withanyhow::Resulterror handlingcomponents.rs: Reusable UI widgets (dialogs, lists)
-
Blocking API Client: Unlike the async server, the TUI uses
reqwest::blockingto simplify the event loop. API calls happen on the main thread between render cycles. -
Modal Dialogs: Confirmation dialogs for destructive actions (delete, cancel) prevent accidental data loss in the fast-paced terminal environment.
-
Vim-Style Navigation: Keyboard shortcuts follow vim conventions (j/k for navigation, Enter for selection) for power users.
State Management:
#![allow(unused)] fn main() { pub struct App { pub workflows: Vec<WorkflowModel>, pub selected_workflow: Option<usize>, pub detail_view: DetailViewType, pub confirmation_dialog: Option<ConfirmationDialog>, } impl App { pub fn handle_key_event(&mut self, key: KeyEvent) -> AppAction { if self.confirmation_dialog.is_some() { return self.handle_dialog_key(key); } // Normal key handling } } }
Dashboard (Web UI)
Location: torc-dash/src/
Design Pattern: Axum web server using the Rust client library
Key Design Decisions:
-
Shared Client Library: The dashboard uses the same Rust API client (
src/client/apis/) as the CLI, TUI, and MCP server, ensuring consistent behavior across all Rust-based interfaces. -
Embedded Assets: Static files (HTML, CSS, JS) are embedded at compile time using
rust-embed, producing a single binary for deployment. -
Server-Sent Events: Real-time updates use SSE for workflow status changes, avoiding the complexity of WebSocket state management.
-
Separate Binary: The dashboard runs as
torc-dash, not part of the maintorcbinary, allowing independent deployment and scaling.
API Integration Pattern:
#![allow(unused)] fn main() { async fn list_workflows( State(state): State<AppState>, ) -> Result<Json<Vec<WorkflowModel>>, StatusCode> { let workflows = default_api::list_workflows(&state.config, None, None, None, None) .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)?; Ok(Json(workflows.items)) } }
MCP Server (AI Assistant Interface)
Location: torc-mcp-server/src/
Design Pattern: Tool-based RPC with structured outputs
Key Design Decisions:
-
Blocking Client, Async Transport: The MCP server creates a blocking
reqwestclient before spawning the tokio runtime. This avoids nested runtime issues when the MCP transport is async but the Torc client expects blocking calls. -
Structured JSON Responses: Tool outputs are JSON objects with consistent fields, making them easy for AI models to parse and reason about.
-
Error as Content: Errors are returned as structured content (not transport failures), giving AI assistants context to retry or explain failures.
-
Operation Scoping: Tools are scoped to common high-level operations (list workflows, get status, run workflow) rather than exposing every API endpoint.
Tool Implementation:
#![allow(unused)] fn main() { pub fn list_workflows(config: &Configuration) -> Result<CallToolResult, McpError> { let workflows = default_api::list_workflows(config, None, None, None, None) .map_err(|e| McpError::internal_error( format!("Failed to list workflows: {}", e), None ))?; let summary: Vec<_> = workflows.items.iter().map(|w| json!({ "id": w.id, "name": w.name, "status": format!("{:?}", w.status), })).collect(); Ok(CallToolResult::success(vec![ Content::text(serde_json::to_string_pretty(&summary).unwrap()) ])) } }
Python/Julia API Clients
Location: Generated in python_client/ and julia_client/
Design Pattern: OpenAPI-generated clients with language-idiomatic wrappers
Key Design Decisions:
-
Generated Code: Core API clients are generated from
api/openapi.yamlusing openapi-generator. This ensures type safety and automatic updates when the API changes. -
No Manual Edits: Generated files in
openapi_client/directories should never be manually edited. Customizations go in wrapper modules. -
Synchronous and Async: Python client supports both sync and async usage patterns via the generated client's configuration.
Regeneration Workflow:
cd api
bash make_api_clients.sh # Regenerates both Python and Julia clients
Alternatives Considered
GraphQL Instead of REST
Rejected because:
- REST's simplicity matches Torc's CRUD-heavy operations
- OpenAPI provides excellent code generation for multiple languages
- Pagination and filtering are well-handled by query parameters
- GraphQL's flexibility isn't needed for the fixed data model
Single Unified Binary
Rejected because:
- The dashboard has significant web dependencies (static assets, tower middleware)
- Separate binaries allow independent deployment
- Feature flags still provide unified builds when desired
gRPC for Internal Communication
Rejected because:
- HTTP/JSON is more debuggable and accessible
- Browser-based dashboard would need a proxy anyway
- Python/Julia clients benefit from REST's simplicity
Implementation Guidelines
When adding a new feature that should be exposed to users:
- Start with the API: Define the endpoint in
api/openapi.yamlwith proper schemas - Implement server-side: Add handler in
src/server/api/ - Regenerate clients: Run
api/make_api_clients.sh - Add CLI command: Create handler in
src/client/commands/ - Update TUI if applicable: Add to relevant view in
src/tui/ - Update Dashboard if applicable: Add route in
torc-dash/src/ - Add MCP tool if user-facing: Add tool function in
torc-mcp-server/src/ - Document in all interfaces: Update reference docs for each affected interface
Summary
Torc's multi-interface architecture achieves consistency through:
- Centralized business logic in the server
- Generated API clients from a single OpenAPI spec
- Interface-specific adapters that translate between user expectations and API semantics
- Consistent data models shared across all implementations
Contributing
Contributions to Torc are welcome! This guide will help you get started.
Development Setup
- Fork and clone the repository:
git clone https://github.com/your-username/torc.git
cd torc
- Install Rust and dependencies:
Make sure you have Rust 1.70 or later installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Install SQLx CLI:
cargo install sqlx-cli --no-default-features --features sqlite
- Set up the database:
# Create .env file
echo "DATABASE_URL=sqlite:torc.db" > .env
# Run migrations
sqlx migrate run
- Build and test:
cargo build
cargo test
Making Changes
Code Style
Run formatting and linting before committing:
# Format code
cargo fmt
# Run clippy
cargo clippy --all --all-targets --all-features -- -D warnings
# Run all checks
cargo fmt --check && cargo clippy --all --all-targets --all-features -- -D warnings
Adding Tests
All new functionality should include tests:
# Run specific test
cargo test test_name -- --nocapture
# Run with logging
RUST_LOG=debug cargo test -- --nocapture
Database Migrations
If you need to modify the database schema:
# Create new migration
sqlx migrate add <migration_name>
# Edit the generated SQL file in migrations/
# Run migration
sqlx migrate run
# To revert
sqlx migrate revert
Submitting Changes
- Create a feature branch:
git checkout -b feature/my-new-feature
- Make your changes and commit:
git add .
git commit -m "Add feature: description"
- Ensure all tests pass:
cargo test
cargo fmt --check
cargo clippy --all-targets --all-features -- -D warnings
- Push to your fork:
git push origin feature/my-new-feature
- Open a Pull Request:
Go to the original repository and open a pull request with:
- Clear description of changes
- Reference to any related issues
- Test results
Pull Request Guidelines
- Keep PRs focused - One feature or fix per PR
- Add tests - All new code should be tested
- Update documentation - Update README.md, DOCUMENTATION.md, or inline docs as needed
- Follow style guidelines - Run
cargo fmtandcargo clippy - Write clear commit messages - Describe what and why, not just how
Areas for Contribution
High Priority
- Performance optimizations for large workflows
- Additional job runner implementations (Kubernetes, etc.)
- Improved error messages and logging
- Documentation improvements
Features
- Workflow visualization tools
- Job retry policies and error handling
- Workflow templates and libraries
- Integration with external systems
Testing
- Additional integration tests
- Performance benchmarks
- Stress testing with large workflows
Code of Conduct
Be respectful and constructive in all interactions. We're all here to make Torc better.
Questions?
- Open an issue for bugs or feature requests
- Start a discussion for questions or ideas
- Check existing issues and discussions first
License
By contributing, you agree that your contributions will be licensed under the BSD 3-Clause License.
Rust Developer Style Guide
This guide establishes coding standards, conventions, and workflows for Rust developers contributing to Torc. Following these guidelines ensures consistency across the codebase and streamlines the review process.
Pre-commit Hooks
Always rely on the pre-commit hooks provided. The repository uses cargo-husky to install Git
hooks automatically. Before each commit, the following checks run:
cargo fmt -- --check # Rust formatting
cargo clippy --all --all-targets --all-features -- -D warnings
dprint check # Markdown formatting
If any check fails, the commit is blocked. Fix the issues before committing.
Installing Pre-commit Hooks
Hooks are installed automatically when you run cargo build for the first time. If you need to
reinstall them manually:
cargo install cargo-husky
cargo build # Triggers hook installation
Code Formatting
Rust Formatting (rustfmt)
All Rust code must pass cargo fmt --check. Run cargo fmt before committing to auto-format your
code.
Key conventions enforced:
- 4-space indentation
- Max line width of 100 characters
- Consistent brace placement
- Sorted imports
Clippy Compliance
All code must compile without clippy warnings when run with -D warnings:
cargo clippy --all --all-targets --all-features -- -D warnings
Common clippy lints to watch for:
clippy::unwrap_used- Preferexpect()with descriptive messages or proper error handlingclippy::clone_on_copy- Avoid cloning Copy typesclippy::needless_return- Omit unnecessaryreturnkeywordsclippy::redundant_closure- Use method references where possible
Markdown Formatting (dprint)
All Markdown files in docs/ must comply with dprint formatting:
dprint check # Verify formatting
dprint fmt # Auto-format
Critical requirement: Maximum line length of 100 characters for all Markdown files.
The dprint.json configuration enforces:
{
"lineWidth": 100,
"markdown": {
"lineWidth": 100,
"textWrap": "always"
}
}
Documentation Standards
All features must be documented in Markdown in the docs/ directory following the
Diataxis framework:
Diataxis Categories
| Category | Location | Purpose |
|---|---|---|
| Tutorials | docs/src/tutorials/ | Learning-oriented, step-by-step lessons |
| How-To Guides | docs/src/how-to/ | Task-oriented, problem-solving guides |
| Explanation | docs/src/explanation/ | Understanding-oriented, conceptual discussions |
| Reference | docs/src/reference/ | Information-oriented, technical descriptions |
Design Documentation
Significant design choices must be documented in docs/src/explanation/design/. Each design
document should cover:
- Problem Statement: What problem does this solve?
- Design Goals: What are the requirements and constraints?
- Solution Overview: High-level architecture description
- Implementation Details: Key technical decisions and trade-offs
- Alternatives Considered: What other approaches were evaluated?
Existing design documents include:
server.md- API handler design and request processingdatabase.md- SQLite schema and concurrency modeldashboard.md- Web dashboard architecturerecovery.md- Workflow recovery mechanismsworkflow-graph.md- Dependency graph implementation
Documentation Workflow
- Write documentation alongside code changes
- Add new pages to
docs/src/SUMMARY.md - Run
dprint fmtto ensure formatting compliance - Build and preview with
mdbook serve docs/
Testing with rstest
All code must include tests using the rstest library for fixtures and parameterized testing.
Test Organization
tests/
├── common.rs # Shared test utilities and fixtures
├── test_full_workflows.rs # Integration tests
├── test_job_runner.rs # Job runner tests
└── scripts/ # Helper scripts for tests
Common Patterns
Fixture Pattern:
#![allow(unused)] fn main() { use rstest::rstest; use serial_test::serial; mod common; use common::{start_server, ServerProcess}; #[rstest] #[serial] fn test_workflow_creation(start_server: &ServerProcess) { let config = &start_server.config; // Test code using the server fixture } }
Parameterized Tests:
#![allow(unused)] fn main() { #[rstest] #[case(0, "immediate")] #[case(60, "one_minute")] #[case(3600, "one_hour")] #[serial] fn test_timeout_handling(#[case] timeout_secs: u64, #[case] description: &str) { // Test runs once for each case } }
Shared Test Utilities (tests/common.rs):
#![allow(unused)] fn main() { pub struct ServerProcess { pub config: Configuration, child: std::process::Child, } impl Drop for ServerProcess { fn drop(&mut self) { // Automatic cleanup on test completion let _ = self.child.kill(); } } pub fn start_server() -> ServerProcess { let port = find_available_port(); // Start server and wait for readiness } }
Test Guidelines
- Use
#[serial]for integration tests that share resources (ports, database) - Use descriptive
expect()messages instead of.unwrap() - Clean up resources using the Drop trait or explicit cleanup functions
- Test error conditions not just happy paths
- Keep tests focused - one behavior per test function
HTTP API Changes
Changes to the HTTP API require updating the OpenAPI specification and regenerating client libraries.
Workflow
-
Modify the OpenAPI spec:
# Edit the specification vim api/openapi.yaml -
Regenerate API clients:
cd api bash make_api_clients.shThis regenerates:
- Python client:
python_client/src/torc/openapi_client/ - Julia client:
julia_client/Torc/src/api/
- Python client:
-
Update Rust client code:
The Rust client in
src/client/apis/is generated separately. After modifying the OpenAPI spec, regenerate with the OpenAPI generator for Rust. -
Test all clients:
# Rust cargo test # Python cd python_client && pytest # Julia julia --project=julia_client/Torc -e "import Pkg; Pkg.test()"
OpenAPI Conventions
- Use descriptive
operationIdvalues (e.g.,create_workflow,list_jobs) - Include comprehensive request/response schemas
- Document all parameters with descriptions
- Use appropriate HTTP status codes (200, 400, 404, 500)
Feature Implementation Across Interfaces
When implementing a user-facing feature, ensure it is exposed through the appropriate interfaces. The following table shows where features should be implemented:
| Interface | Location | Primary Use Case |
|---|---|---|
| CLI | src/client/commands/ | Command-line automation, scripting |
| HTTP API | api/openapi.yaml | Python/Julia integration, external tools |
| Dashboard | torc-dash/src/ | Web-based monitoring and management |
| TUI | src/tui/ | Interactive terminal monitoring |
| MCP Server | torc-mcp-server/src/ | AI assistant integration |
CLI Implementation
Commands are implemented using clap with subcommand enums:
#![allow(unused)] fn main() { // In src/client/commands/<feature>.rs #[derive(Subcommand, Debug, Clone)] pub enum FeatureCommands { /// Create a new resource Create { /// Name of the resource #[arg(short, long)] name: String, }, /// List all resources List { #[arg(long, default_value = "table")] format: String, }, } pub fn handle_feature_commands( config: &Configuration, command: &FeatureCommands, format: &str, ) { match command { FeatureCommands::Create { name } => handle_create(config, name, format), FeatureCommands::List { format: fmt } => handle_list(config, fmt), } } }
CLI Conventions:
- Support both
--format tableand--format jsonoutput - Use
tabledfor table formatting with#[tabled(rename = "...")]for column headers - Include pagination support via
--offsetand--limitflags - Provide helpful error messages with context
HTTP API (Python/Julia)
After updating api/openapi.yaml, the Python and Julia clients are auto-generated. Ensure:
- All new endpoints have proper request/response schemas
- Query parameters are documented
- Error responses are specified
- Run
make_api_clients.shto regenerate clients
Dashboard (torc-dash)
The dashboard is an Axum-based web server with embedded static assets:
#![allow(unused)] fn main() { // In torc-dash/src/main.rs async fn handle_feature_list( State(state): State<AppState>, ) -> Result<Json<Vec<Feature>>, StatusCode> { // Proxy request to Torc API server let features = state.client .get(&format!("{}/features", state.api_url)) .send() .await .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)? .json() .await .map_err(|_| StatusCode::INTERNAL_SERVER_ERROR)?; Ok(Json(features)) } }
Dashboard Conventions:
- Proxy API requests to the Torc server
- Use Axum extractors for request handling
- Return JSON for API endpoints
- Serve static files for the frontend
TUI (Terminal User Interface)
The TUI uses ratatui with a component-based architecture:
#![allow(unused)] fn main() { // In src/tui/app.rs pub struct App { pub workflows: Vec<WorkflowModel>, pub selected_workflow: Option<usize>, pub detail_view: DetailViewType, } impl App { pub fn handle_key_event(&mut self, key: KeyEvent) -> AppAction { match key.code { KeyCode::Enter => self.select_current(), KeyCode::Char('r') => self.refresh_data(), KeyCode::Char('q') => AppAction::Quit, _ => AppAction::None, } } } }
TUI Conventions:
- Use
anyhow::Resultfor error handling - Separate state (
app.rs), rendering (ui.rs), and API calls (api.rs) - Support keyboard navigation with vim-style bindings
- Display confirmation dialogs for destructive actions
MCP Server (AI Assistant)
The MCP server exposes tools for AI assistants:
#![allow(unused)] fn main() { // In torc-mcp-server/src/main.rs pub fn get_workflow_status( config: &Configuration, workflow_id: i64, ) -> Result<CallToolResult, McpError> { let workflow = default_api::get_workflow(config, workflow_id) .map_err(|e| internal_error(format!("Failed to get workflow: {}", e)))?; let result = serde_json::json!({ "workflow_id": workflow.id, "name": workflow.name, "status": workflow.status, }); Ok(CallToolResult::success(vec![ rmcp::model::Content::text(serde_json::to_string_pretty(&result).unwrap_or_default()) ])) } }
MCP Conventions:
- Return structured JSON for tool results
- Use descriptive error messages via
McpError - Support common workflow operations (list, status, run, cancel)
- Keep tool descriptions clear for AI consumption
Error Handling Strategy
Application Code (CLI, TUI, binaries)
Use anyhow::Result for flexible error handling:
#![allow(unused)] fn main() { use anyhow::{Context, Result}; pub fn run_workflow(path: &Path) -> Result<()> { let spec = load_spec(path) .context("Failed to load workflow specification")?; create_workflow(&spec) .context("Failed to create workflow")?; Ok(()) } }
Library Code
Use typed errors with thiserror:
#![allow(unused)] fn main() { use thiserror::Error; #[derive(Error, Debug)] pub enum WorkflowError { #[error("Job {job_id} not found in workflow {workflow_id}")] JobNotFound { job_id: i64, workflow_id: i64 }, #[error("Invalid status transition from {from} to {to}")] InvalidTransition { from: String, to: String }, #[error("API error: {0}")] ApiError(#[from] reqwest::Error), } }
Test Code
Use .expect() with descriptive messages:
#![allow(unused)] fn main() { let workflow = create_workflow(&spec) .expect("Test workflow creation should succeed"); let job = get_job(config, job_id) .expect("Job should exist after creation"); }
Common Patterns
Configuration Priority
CLI arguments override environment variables, which override config files:
#![allow(unused)] fn main() { let api_url = cli_args.url .or_else(|| env::var("TORC_API_URL").ok()) .or_else(|| config.client.as_ref()?.api_url.clone()) .unwrap_or_else(|| "http://localhost:8080/torc-service/v1".to_string()); }
Table Display
Use the tabled crate for CLI table output:
#![allow(unused)] fn main() { use tabled::{Table, Tabled}; #[derive(Tabled)] struct JobRow { #[tabled(rename = "ID")] id: i64, #[tabled(rename = "Name")] name: String, #[tabled(rename = "Status")] status: String, } fn display_jobs(jobs: &[JobModel]) { let rows: Vec<JobRow> = jobs.iter().map(|j| JobRow { id: j.id.unwrap_or(0), name: j.name.clone(), status: format!("{:?}", j.status), }).collect(); println!("{}", Table::new(rows)); } }
Feature Flags
Use Cargo features to conditionally compile components:
#![allow(unused)] fn main() { // In Cargo.toml [features] default = ["client"] client = ["dep:reqwest", "dep:clap"] server = ["dep:sqlx", "dep:axum"] tui = ["client", "dep:ratatui"] // In code #[cfg(feature = "client")] pub mod client; #[cfg(feature = "server")] pub mod server; }
Async Runtime
Create blocking clients before spawning the async runtime to avoid nested runtime issues:
fn main() -> Result<()> { // Create blocking client BEFORE async runtime let client = reqwest::blocking::Client::new(); let server = MyServer::new(client); let runtime = tokio::runtime::Builder::new_multi_thread() .worker_threads(4) .enable_all() .build()?; runtime.block_on(async_main(server)) }
Logging
Use tracing for structured logging:
#![allow(unused)] fn main() { use tracing::{debug, info, warn, error, instrument}; #[instrument(skip(config))] pub fn process_job(config: &Configuration, job_id: i64) -> Result<()> { info!(job_id, "Processing job"); match run_job(job_id) { Ok(result) => { debug!(job_id, ?result, "Job completed successfully"); Ok(()) } Err(e) => { error!(job_id, error = %e, "Job failed"); Err(e) } } } }
Enable debug logging with:
RUST_LOG=debug cargo run
RUST_LOG=torc=debug,sqlx=warn cargo run # Fine-grained control
Summary Checklist
Before submitting a pull request, verify:
-
cargo fmt --checkpasses -
cargo clippy --all --all-targets --all-features -- -D warningspasses -
dprint checkpasses (for Markdown changes) -
All tests pass with
cargo test -
New features have tests using
rstest - Documentation added in appropriate Diataxis category
-
Design decisions documented in
docs/src/explanation/design/if applicable -
API changes reflected in
api/openapi.yaml -
Client libraries regenerated with
api/make_api_clients.sh - Feature exposed through appropriate interfaces (CLI, API, TUI, etc.)