Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Workflow Specification Reference

This page documents all data models used in workflow specification files. Workflow specs can be written in YAML, JSON, JSON5, or KDL formats.

WorkflowSpec

The top-level container for a complete workflow definition.

NameTypeDefaultDescription
namestringrequiredName of the workflow
userstringcurrent userUser who owns this workflow
descriptionstringnoneDescription of the workflow
projectstringnoneProject name or identifier for grouping workflows
metadatastringnoneArbitrary metadata as JSON string
parametersmap<string, string>noneShared parameters that can be used by jobs and files via use_parameters
jobs[JobSpec]requiredJobs that make up this workflow
files[FileSpec]noneFiles associated with this workflow
user_data[UserDataSpec]noneUser data associated with this workflow
resource_requirements[ResourceRequirementsSpec]noneResource requirements available for this workflow
failure_handlers[FailureHandlerSpec]noneFailure handlers available for this workflow
slurm_schedulers[SlurmSchedulerSpec]noneSlurm schedulers available for this workflow
slurm_defaultsSlurmDefaultsSpecnoneDefault Slurm parameters to apply to all schedulers
resource_monitorResourceMonitorConfignoneResource monitoring configuration
actions[WorkflowActionSpec]noneActions to execute based on workflow/job state transitions
use_pending_failedbooleanfalseUse PendingFailed status for failed jobs (enables AI-assisted recovery)
execution_configExecutionConfignoneExecution mode and termination settings
compute_node_wait_for_new_jobs_secondsintegernoneCompute nodes wait for new jobs this long before exiting
compute_node_ignore_workflow_completionbooleanfalseCompute nodes hold allocations even after workflow completes
compute_node_wait_for_healthy_database_minutesintegernoneCompute nodes wait this many minutes for database recovery
jobs_sort_methodClaimJobsSortMethodnoneMethod for sorting jobs when claiming them
enable_ro_cratebooleanfalseEnable automatic RO-Crate provenance tracking

Examples with project and metadata

The project and metadata fields are useful for organizing and categorizing workflows. For more detailed guidance on organizing workflows, see Organizing and Managing Workflows.

YAML example:

name: "ml_training_workflow"
project: "customer-churn-prediction"
metadata: '{"environment":"staging","version":"1.0.0","team":"ml-engineering"}'
description: "Train and evaluate churn prediction model"
jobs:
  - name: "preprocess"
    command: "python preprocess.py"
  - name: "train"
    command: "python train.py"
    depends_on: ["preprocess"]

JSON example:

{
  "name": "data_pipeline",
  "project": "analytics-platform",
  "metadata": "{\"cost_center\":\"eng-data\",\"priority\":\"high\"}",
  "description": "Daily data processing pipeline",
  "jobs": [
    {
      "name": "extract",
      "command": "python extract.py"
    }
  ]
}

JobSpec

Defines a single computational task within a workflow.

NameTypeDefaultDescription
namestringrequiredName of the job
commandstringrequiredCommand to execute for this job
invocation_scriptstringnoneOptional script for job invocation
resource_requirementsstringnoneName of a ResourceRequirementsSpec to use
failure_handlerstringnoneName of a FailureHandlerSpec to use
schedulerstringnoneName of the scheduler to use for this job
cancel_on_blocking_job_failurebooleanfalseCancel this job if a blocking job fails
depends_on[string]noneJob names that must complete before this job runs (exact matches)
depends_on_regexes[string]noneRegex patterns for job dependencies
input_files[string]noneFile names this job reads (exact matches)
input_file_regexes[string]noneRegex patterns for input files
output_files[string]noneFile names this job produces (exact matches)
output_file_regexes[string]noneRegex patterns for output files
input_user_data[string]noneUser data names this job reads (exact matches)
input_user_data_regexes[string]noneRegex patterns for input user data
output_user_data[string]noneUser data names this job produces (exact matches)
output_user_data_regexes[string]noneRegex patterns for output user data
parametersmap<string, string>noneLocal parameters for generating multiple jobs
parameter_modestring"product"How to combine parameters: "product" (Cartesian) or "zip"
use_parameters[string]noneWorkflow parameter names to use for this job
stdioStdioConfignonePer-job override for stdout/stderr capture (overrides workflow-level)

FileSpec

Defines input/output file artifacts that establish implicit job dependencies.

NameTypeDefaultDescription
namestringrequiredName of the file (used for referencing in jobs)
pathstringrequiredFile system path
parametersmap<string, string>noneParameters for generating multiple files
parameter_modestring"product"How to combine parameters: "product" (Cartesian) or "zip"
use_parameters[string]noneWorkflow parameter names to use for this file

UserDataSpec

Arbitrary JSON data that can establish dependencies between jobs.

NameTypeDefaultDescription
namestringnoneName of the user data (used for referencing in jobs)
dataJSONnoneThe data content as a JSON value
is_ephemeralbooleanfalseWhether the user data is ephemeral

ResourceRequirementsSpec

Defines compute resource requirements for jobs.

NameTypeDefaultDescription
namestringrequiredName of this resource configuration (referenced by jobs)
num_cpusintegerrequiredNumber of CPUs required
memorystringrequiredMemory requirement (e.g., "1m", "2g", "512k")
num_gpusinteger0Number of GPUs required
num_nodesinteger1Number of nodes per job (srun --nodes); allocation size is set via Slurm scheduler config
runtimestring"PT1H"Runtime limit in ISO8601 duration format (e.g., "PT30M", "PT2H")

FailureHandlerSpec

Defines error recovery strategies for jobs.

NameTypeDefaultDescription
namestringrequiredName of the failure handler (referenced by jobs)
rules[FailureHandlerRuleSpec]requiredRules for handling different exit codes

FailureHandlerRuleSpec

A single rule within a failure handler for handling specific exit codes.

NameTypeDefaultDescription
exit_codes[integer][]Exit codes that trigger this rule
match_all_exit_codesbooleanfalseIf true, matches any non-zero exit code
recovery_scriptstringnoneOptional script to run before retrying
max_retriesinteger3Maximum number of retry attempts

SlurmSchedulerSpec

Defines a Slurm HPC job scheduler configuration.

NameTypeDefaultDescription
namestringnoneName of the scheduler (used for referencing)
accountstringrequiredSlurm account
partitionstringnoneSlurm partition name
nodesinteger1Number of nodes to allocate
walltimestring"01:00:00"Wall time limit
memstringnoneMemory specification
gresstringnoneGeneric resources (e.g., GPUs)
qosstringnoneQuality of service
ntasks_per_nodeintegernoneNumber of tasks per node
tmpstringnoneTemporary storage specification
extrastringnoneAdditional Slurm parameters

ExecutionConfig

Controls how jobs are executed and terminated. Fields are grouped by which execution mode they apply to. Setting a field that doesn't match the effective mode produces a validation error at workflow creation time.

Shared fields (both modes)

NameTypeDefaultDescription
modestring"auto"Execution mode: "direct", "slurm", or "auto"
sigkill_headroom_secondsinteger60Seconds before end_time for SIGKILL or srun --time
timeout_exit_codeinteger152Exit code for timed-out jobs (matches Slurm TIMEOUT)
staggered_startbooleantrueStagger job runner startup to mitigate thundering herd
stdioStdioConfigsee belowWorkflow-level default for stdout/stderr capture

Direct mode fields

These fields only apply when the effective mode is direct. Setting them with mode: slurm (or mode: auto with slurm_schedulers) produces a validation error.

NameTypeDefaultDescription
limit_resourcesbooleantrueMonitor memory/CPU and kill jobs that exceed limits
termination_signalstring"SIGTERM"Signal to send before SIGKILL for graceful shutdown
sigterm_lead_secondsinteger30Seconds before SIGKILL to send the termination signal
oom_exit_codeinteger137Exit code for OOM-killed jobs (128 + SIGKILL)

Slurm mode fields

These fields only apply when the effective mode is slurm. Setting them with mode: direct (or mode: auto without slurm_schedulers) produces a validation error.

NameTypeDefaultDescription
srun_termination_signalstringnoneSignal spec for srun --signal=<value>
enable_cpu_bindbooleanfalseAllow Slurm CPU binding (--cpu-bind)

StdioConfig

Controls how stdout and stderr are captured for job processes.

NameTypeDefaultDescription
modeStdioMode"separate"How to capture stdout/stderr
delete_on_successbooleanfalseDelete captured files when a job completes successfully

StdioMode

ValueDescription
separateSeparate stdout (.o) and stderr (.e) files per job (default)
combinedCombine stdout and stderr into a single .log file per job
no_stdoutDiscard stdout (/dev/null); capture stderr only
no_stderrDiscard stderr (/dev/null); capture stdout only
noneDiscard both stdout and stderr

Per-job overrides can be set via the stdio field on individual JobSpec entries, which takes precedence over the workflow-level setting.

Stdio Examples

Combine stdout and stderr into a single file, and delete it on success:

execution_config:
  stdio:
    mode: combined
    delete_on_success: true

Suppress stdout for most jobs, but keep separate files for a specific job:

execution_config:
  stdio:
    mode: no_stdout

jobs:
  - name: preprocess
    command: python preprocess.py
  - name: train
    command: python train.py
    stdio:
      mode: separate

Execution Modes

ModeDescription
directTorc manages job execution directly. Use outside Slurm or when srun unreliable
slurmJobs wrapped with srun. Slurm manages resource limits and termination
autoUses slurm if SLURM_JOB_ID is set, otherwise direct (default)

Direct Mode Example

execution_config:
  mode: direct
  limit_resources: true
  termination_signal: SIGTERM
  sigterm_lead_seconds: 30
  sigkill_headroom_seconds: 60
  timeout_exit_code: 152
  oom_exit_code: 137

Slurm Mode Example

execution_config:
  mode: slurm
  srun_termination_signal: "TERM@120"
  sigkill_headroom_seconds: 180
  enable_cpu_bind: false

Termination Timeline (Direct Mode)

With sigkill_headroom_seconds=60 and sigterm_lead_seconds=30:

  1. end_time - 90s: Send SIGTERM (or configured termination_signal)
  2. end_time - 60s: Send SIGKILL to remaining jobs, set exit code to timeout_exit_code
  3. end_time: Job runner exits

Slurm Mode Headroom

In Slurm mode, sigkill_headroom_seconds controls srun --time. The step time limit is set to remaining_time - sigkill_headroom_seconds, allowing the job runner to detect completion before the allocation expires.

SlurmDefaultsSpec

Workflow-level default parameters applied to all Slurm schedulers. This is a map of parameter names to values.

Any valid sbatch long option can be specified (without the leading --), except for parameters managed by torc: partition, nodes, walltime, time, mem, gres, name, job-name.

The account parameter is allowed as a workflow-level default.

Example:

slurm_defaults:
  qos: "high"
  constraint: "cpu"
  mail-user: "user@example.com"
  mail-type: "END,FAIL"

WorkflowActionSpec

Defines conditional actions triggered by workflow or job state changes.

NameTypeDefaultDescription
trigger_typestringrequiredWhen to trigger: "on_workflow_start", "on_workflow_complete", "on_jobs_ready", "on_jobs_complete"
action_typestringrequiredWhat to do: "run_commands", "schedule_nodes"
jobs[string]noneFor job triggers: exact job names to match
job_name_regexes[string]noneFor job triggers: regex patterns to match job names
commands[string]noneFor run_commands: commands to execute
schedulerstringnoneFor schedule_nodes: scheduler name
scheduler_typestringnoneFor schedule_nodes: scheduler type ("slurm", "local")
num_allocationsintegernoneFor schedule_nodes: number of node allocations
start_one_worker_per_nodebooleanfalseFor schedule_nodes: launch one worker per node (direct mode only)
max_parallel_jobsintegernoneFor schedule_nodes: maximum parallel jobs
persistentbooleanfalseWhether the action persists and can be claimed by multiple workers

ResourceMonitorConfig

Configuration for resource usage monitoring.

NameTypeDefaultDescription
enabledbooleanfalseEnable resource monitoring
granularityMonitorGranularity"Summary"Level of detail for metrics collection
sample_interval_secondsinteger10Sampling interval in seconds
generate_plotsbooleanfalseGenerate resource usage plots

MonitorGranularity

Enum specifying the level of detail for resource monitoring.

ValueDescription
SummaryCollect summary statistics only
TimeSeriesCollect detailed time series data

ClaimJobsSortMethod

Enum specifying how jobs are sorted when being claimed by workers.

ValueDescription
noneNo sorting (default)
gpus_runtime_memorySort by GPUs, then runtime, then memory
gpus_memory_runtimeSort by GPUs, then memory, then runtime

Parameter Formats

Parameters support several formats for generating multiple jobs or files:

FormatExampleDescription
Integer range"1:100"Inclusive range from 1 to 100
Integer range with step"0:100:10"Range with step size
Float range"0.0:1.0:0.1"Float range with step
Integer list"[1,5,10,100]"Explicit list of integers
Float list"[0.1,0.5,0.9]"Explicit list of floats
String list"['adam','sgd','rmsprop']"Explicit list of strings

Template substitution in strings:

  • Basic: {param_name} - Replace with parameter value
  • Formatted integer: {i:03d} - Zero-padded (001, 042, 100)
  • Formatted float: {lr:.4f} - Precision (0.0010, 0.1000)

See the Job Parameterization reference for more details.