Workflow Specification Reference

This page documents all data models used in workflow specification files. Workflow specs can be written in YAML, JSON, JSON5, or KDL formats.

WorkflowSpec

The top-level container for a complete workflow definition.

Name	Type	Default	Description
`name`	string	required	Name of the workflow
`user`	string	current user	User who owns this workflow
`description`	string	none	Description of the workflow
`project`	string	none	Project name or identifier for grouping workflows
`metadata`	string	none	Arbitrary metadata as JSON string
`parameters`	map<string, string>	none	Shared parameters that can be used by jobs and files via `use_parameters`
`variables`	map<string, string>	none	Workflow-level constants substituted via `{name}` into every string field
`env`	map<string, string>	none	Environment variables exported for every job in the workflow
`jobs`	[JobSpec]	required	Jobs that make up this workflow
`files`	[FileSpec]	none	Files associated with this workflow
`user_data`	[UserDataSpec]	none	User data associated with this workflow
`resource_requirements`	[ResourceRequirementsSpec]	none	Resource requirements available for this workflow
`failure_handlers`	[FailureHandlerSpec]	none	Failure handlers available for this workflow
`slurm_schedulers`	[SlurmSchedulerSpec]	none	Slurm schedulers available for this workflow
`slurm_defaults`	SlurmDefaultsSpec	none	Default Slurm parameters to apply to all schedulers
`resource_monitor`	ResourceMonitorConfig	none	Resource monitoring configuration
`actions`	[WorkflowActionSpec]	none	Actions to execute based on workflow/job state transitions
`use_pending_failed`	boolean	false	Use PendingFailed status for failed jobs (enables AI-assisted recovery)
`execution_config`	ExecutionConfig	none	Execution mode and termination settings
`compute_node_wait_for_new_jobs_seconds`	integer	none	Compute nodes wait for new jobs this long before exiting
`compute_node_ignore_workflow_completion`	boolean	false	Compute nodes hold allocations even after workflow completes
`compute_node_wait_for_healthy_database_minutes`	integer	none	Compute nodes wait this many minutes for database recovery
`enable_ro_crate`	boolean	false	Enable automatic RO-Crate provenance tracking
`access_groups`	[string]	none	Access group names granted shared access to this workflow on creation

Examples with project, metadata, and shared environment

The project and metadata fields are useful for organizing and categorizing workflows. For more detailed guidance on organizing workflows, see Organizing and Managing Workflows.

Use the top-level env map to export environment variables for every job in the workflow. Jobs can also define their own env map to add variables or override workflow-level values with the same key.

YAML example:

name: "ml_training_workflow"
project: "customer-churn-prediction"
metadata: '{"environment":"staging","version":"1.0.0","team":"ml-engineering"}'
description: "Train and evaluate churn prediction model"
env:
  DATA_ROOT: "/scratch/churn"
  LOG_LEVEL: "info"
jobs:
  - name: "preprocess"
    command: "python preprocess.py"
  - name: "train"
    command: "python train.py"
    env:
      LOG_LEVEL: "debug"
    depends_on: ["preprocess"]

JSON example:

{
  "name": "data_pipeline",
  "project": "analytics-platform",
  "metadata": "{\"cost_center\":\"eng-data\",\"priority\":\"high\"}",
  "description": "Daily data processing pipeline",
  "env": {
    "DATA_ROOT": "/scratch/analytics",
    "LOG_LEVEL": "info"
  },
  "jobs": [
    {
      "name": "extract",
      "command": "python extract.py",
      "env": {
        "LOG_LEVEL": "debug"
      }
    }
  ]
}

Declare access_groups on the spec to grant teammates access at creation time instead of running torc access-groups add-workflow afterwards. Each name must match an existing group; an unknown name fails the create with a 422 and the workflow row is not created.

name: shared_workflow
access_groups:
  - naerm
  - data-team
jobs:
  - name: analyze
    command: python analyze.py

Use torc access-groups add-workflow / remove-workflow for post-creation changes; the YAML field is a declarative shortcut, not the ongoing-management API.

JobSpec

Defines a single computational task within a workflow.

Name	Type	Default	Description
`name`	string	required	Name of the job
`command`	string	required	Command to execute for this job
`priority`	integer	`0`	Scheduling priority; higher values are claimed before lower values
`invocation_script`	string	none	Optional script for job invocation
`env`	map<string, string>	none	Environment variables exported for this job
`resource_requirements`	string	none	Name of a ResourceRequirementsSpec to use
`failure_handler`	string	none	Name of a FailureHandlerSpec to use
`scheduler`	string	none	Name of the scheduler to use for this job
`cancel_on_blocking_job_failure`	boolean	false	Cancel this job if a blocking job fails
`depends_on`	[string]	none	Job names that must complete before this job runs (exact matches)
`depends_on_regexes`	[string]	none	Regex patterns for job dependencies
`input_files`	[string]	none	File names this job reads (exact matches)
`input_file_regexes`	[string]	none	Regex patterns for input files
`output_files`	[string]	none	File names this job produces (exact matches)
`output_file_regexes`	[string]	none	Regex patterns for output files
`input_user_data`	[string]	none	User data names this job reads (exact matches)
`input_user_data_regexes`	[string]	none	Regex patterns for input user data
`output_user_data`	[string]	none	User data names this job produces (exact matches)
`output_user_data_regexes`	[string]	none	Regex patterns for output user data
`parameters`	map<string, string>	none	Local parameters for generating multiple jobs
`parameter_mode`	string	`"product"`	How to combine parameters: `"product"` (Cartesian) or `"zip"`
`use_parameters`	[string]	none	Workflow parameter names to use for this job
`stdio`	StdioConfig	none	Per-job override for stdout/stderr capture (overrides workflow-level)

When both workflow-level and job-level env maps are present, Torc merges them when the job is created and stores the effective environment on the job. Job-level values win on key conflicts. Torc exports these variables before running the job's invocation_script, if any.

FileSpec

Defines input/output file artifacts that establish implicit job dependencies.

Name	Type	Default	Description
`name`	string	required	Name of the file (used for referencing in jobs)
`path`	string	required	File system path
`identifier`	string	none	Stable RO-Crate `@id` for an input file (DOI, PURL, URN, …). See RO-Crate identifiers
`parameters`	map<string, string>	none	Parameters for generating multiple files
`parameter_mode`	string	`"product"`	How to combine parameters: `"product"` (Cartesian) or `"zip"`
`use_parameters`	[string]	none	Workflow parameter names to use for this file

RO-Crate identifiers for input files

By default, the RO-Crate @id of an input file's File entity is its filesystem path. When that's the wrong identifier — e.g. the file is a published dataset with a DOI, or you want to reference a long-lived URN rather than a transient local path — set identifier on the FileSpec. The workflow must also set enable_ro_crate: true at the top level; identifiers are rejected on workflows that opt out of automatic RO-Crate provenance.

name: my_workflow
enable_ro_crate: true

files:
  - name: reference_genome
    path: data/grch38.fa
    identifier: https://doi.org/10.5524/100001

The exported entity uses the supplied identifier as @id; the local path is preserved as sameAs so consumers can still locate the bytes on disk:

{
  "@id": "https://doi.org/10.5524/100001",
  "@type": ["File", "prov:Entity"],
  "name": "grch38.fa",
  "sameAs": { "@id": "data/grch38.fa" }
}

CreateAction provenance references (prov:used / object) that would otherwise serialize as the file path are rewritten to the identifier at export time, so the exported @graph stays self-consistent.

For parameterized files, include the same placeholders in identifier that you use in name and path:

name: parameterized_workflow
enable_ro_crate: true

files:
  - name: input_{i}
    path: data/input_{i}.csv
    identifier: "urn:dataset:run-2026:{i}"
    parameters:
      i: "1:10"

Spec-load validation rejects the following at parse time:

identifier on any file when the workflow does not set enable_ro_crate: true.
Duplicate identifiers (after parameter expansion).
An identifier equal to another file's path (would collide in the same workflow-scoped uniqueness index).
Identifiers matching Torc's reserved IDs / prefixes: #torc-… (workflow/run provenance), #software-… (software agents), #job-… (CreateActions), ro-crate-metadata.json, and ./ (synthetic export roots).
identifier on output files, including files referenced as both input AND output (the output completion path always rewrites entity_id back to the file path).
identifier on files that are not referenced by any job as an input AND have no explicit st_mtime (a file with st_mtime set in the spec counts as a pre-existing input even without a job reference).

UserDataSpec

Arbitrary JSON data that can establish dependencies between jobs.

Name	Type	Default	Description
`name`	string	none	Name of the user data (used for referencing in jobs)
`data`	JSON	none	The data content as a JSON value
`is_ephemeral`	boolean	false	Whether the user data is ephemeral

ResourceRequirementsSpec

Defines compute resource requirements for jobs.

Name	Type	Default	Description
`name`	string	required	Name of this resource configuration (referenced by jobs)
`num_cpus`	integer	required	Number of CPUs required
`memory`	string	required	Memory requirement (e.g., `"1m"`, `"2g"`, `"512k"`)
`num_gpus`	integer	`0`	Number of GPUs required
`num_nodes`	integer	`1`	Number of nodes per job (`srun --nodes`); allocation size is set via Slurm scheduler config
`runtime`	string	`"PT1H"`	Runtime limit in ISO8601 duration format (e.g., `"PT30M"`, `"PT2H"`)

FailureHandlerSpec

Defines error recovery strategies for jobs.

Name	Type	Default	Description
`name`	string	required	Name of the failure handler (referenced by jobs)
`rules`	[FailureHandlerRuleSpec]	required	Rules for handling different exit codes

FailureHandlerRuleSpec

A single rule within a failure handler for handling specific exit codes.

Name	Type	Default	Description
`exit_codes`	[integer]	`[]`	Exit codes that trigger this rule
`match_all_exit_codes`	boolean	`false`	If true, matches any non-zero exit code
`recovery_script`	string	none	Optional script to run before retrying
`max_retries`	integer	`3`	Maximum number of retry attempts

SlurmSchedulerSpec

Defines a Slurm HPC job scheduler configuration.

Name	Type	Default	Description
`name`	string	none	Name of the scheduler (used for referencing)
`account`	string	required	Slurm account
`partition`	string	none	Slurm partition name
`nodes`	integer	`1`	Number of nodes to allocate
`walltime`	string	`"01:00:00"`	Wall time limit
`mem`	string	none	Memory specification
`gres`	string	none	Generic resources (e.g., GPUs)
`qos`	string	none	Quality of service
`ntasks_per_node`	integer	none	Number of tasks per node
`tmp`	string	none	Temporary storage specification
`extra`	string	none	Additional Slurm parameters

ExecutionConfig

Controls how jobs are executed and terminated. Fields are grouped by which execution mode they apply to. Setting a field that doesn't match the effective mode produces a validation error at workflow creation time.

Shared fields (both modes)

Name	Type	Default	Description
`mode`	string	`"direct"`	Execution mode: `"direct"`, `"slurm"`, or `"auto"`
`sigkill_headroom_seconds`	integer	`60`	Seconds before end_time for SIGKILL or srun --time
`timeout_exit_code`	integer	`152`	Exit code for timed-out jobs (matches Slurm TIMEOUT)
`staggered_start`	boolean	`true`	Stagger job runner startup to mitigate thundering herd
`stdio`	StdioConfig	see below	Workflow-level default for stdout/stderr capture

Direct mode fields

These fields only apply when the effective mode is direct. Setting them with mode: slurm produces a validation error. When mode: auto, validation checks the effective mode based on whether Slurm schedulers are present in the spec.

Name	Type	Default	Description
`limit_resources`	boolean	`true`	Monitor memory/CPU and kill jobs that exceed limits
`termination_signal`	string	`"SIGTERM"`	Signal to send before SIGKILL for graceful shutdown
`sigterm_lead_seconds`	integer	`30`	Seconds before SIGKILL to send the termination signal
`oom_exit_code`	integer	`137`	Exit code for OOM-killed jobs (128 + SIGKILL)

Slurm mode fields

These fields only apply when the effective mode is slurm. Setting them with mode: direct produces a validation error. When mode: auto, validation checks the effective mode based on whether Slurm schedulers are present in the spec.

Name	Type	Default	Description
`srun_termination_signal`	string	none	Signal spec for `srun --signal=<value>`
`enable_cpu_bind`	boolean	`false`	Allow Slurm CPU binding (`--cpu-bind`)

Worker-per-node Slurm launch field

This field applies only when execution_config.mode: direct is combined with a schedule_nodes action that sets start_one_worker_per_node: true. In that mode, Torc launches one torc-slurm-job-runner per allocated node with an outer srun, and this field is passed to that launcher command.

Name	Type	Default	Description
`srun_mpi`	string	none	MPI mode for the outer job-runner launch: `srun --mpi=<value> ...`

StdioConfig

Controls how stdout and stderr are captured for job processes.

Name	Type	Default	Description
`mode`	StdioMode	`"separate"`	How to capture stdout/stderr
`delete_on_success`	boolean	`false`	Delete captured files when a job completes successfully

StdioMode

Value	Description
`separate`	Separate stdout (`.o`) and stderr (`.e`) files per job (default)
`combined`	Combine stdout and stderr into a single `.log` file per job
`no_stdout`	Discard stdout (`/dev/null`); capture stderr only
`no_stderr`	Discard stderr (`/dev/null`); capture stdout only
`none`	Discard both stdout and stderr

Per-job overrides can be set via the stdio field on individual JobSpec entries, which takes precedence over the workflow-level setting.

Stdio Examples

Combine stdout and stderr into a single file, and delete it on success:

execution_config:
  stdio:
    mode: combined
    delete_on_success: true

Suppress stdout for most jobs, but keep separate files for a specific job:

execution_config:
  stdio:
    mode: no_stdout

jobs:
  - name: preprocess
    command: python preprocess.py
  - name: train
    command: python train.py
    stdio:
      mode: separate

Execution Modes

Mode	Description
`direct`	Torc manages job execution directly (default). Works everywhere
`slurm`	Jobs wrapped with `srun`. Slurm manages resource limits and termination
`auto`	Selects `slurm` if `SLURM_JOB_ID` is set, otherwise `direct`

Warning: auto will silently select slurm mode when running inside a Slurm allocation. Prefer setting the mode explicitly to avoid unexpected behavior.

Direct Mode Example

execution_config:
  mode: direct
  limit_resources: true
  termination_signal: SIGTERM
  sigterm_lead_seconds: 30
  sigkill_headroom_seconds: 60
  timeout_exit_code: 152
  oom_exit_code: 137

Direct Mode Worker-Per-Node Example

execution_config:
  mode: direct
  srun_mpi: "none"

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: multi_node
    scheduler_type: slurm
    start_one_worker_per_node: true

This launches the job runners with:

srun --ntasks-per-node=1 --mpi=none torc-slurm-job-runner ...

Slurm Mode Example

execution_config:
  mode: slurm
  srun_termination_signal: "TERM@120"
  sigkill_headroom_seconds: 180
  enable_cpu_bind: false

Termination Timeline (Direct Mode)

With sigkill_headroom_seconds=60 and sigterm_lead_seconds=30:

end_time - 90s: Send SIGTERM (or configured termination_signal)
end_time - 60s: Send SIGKILL to remaining jobs, set exit code to timeout_exit_code
end_time: Job runner exits

Slurm Mode Headroom

In Slurm mode, sigkill_headroom_seconds controls srun --time. The step time limit is set to remaining_time - sigkill_headroom_seconds, allowing the job runner to detect completion before the allocation expires.

SlurmDefaultsSpec

Workflow-level default parameters applied to all Slurm schedulers. This is a map of parameter names to values.

Any valid sbatch long option can be specified (without the leading --), except for parameters managed by torc: partition, nodes, walltime, time, mem, gres, name, job-name.

The account parameter is allowed as a workflow-level default.

Example:

slurm_defaults:
  qos: "high"
  constraint: "cpu"
  mail-user: "user@example.com"
  mail-type: "END,FAIL"

WorkflowActionSpec

Defines conditional actions triggered by workflow or job state changes.

Name	Type	Default	Description
`trigger_type`	string	required	When to trigger: `"on_workflow_start"`, `"on_workflow_complete"`, `"on_jobs_ready"`, `"on_jobs_complete"`
`action_type`	string	required	What to do: `"run_commands"`, `"schedule_nodes"`
`jobs`	[string]	none	For job triggers: exact job names to match
`job_name_regexes`	[string]	none	For job triggers: regex patterns to match job names
`commands`	[string]	none	For `run_commands`: commands to execute
`scheduler`	string	none	For `schedule_nodes`: scheduler name
`scheduler_type`	string	none	For `schedule_nodes`: scheduler type (`"slurm"`, `"local"`)
`num_allocations`	integer	none	For `schedule_nodes`: number of node allocations
`start_one_worker_per_node`	boolean	false	For `schedule_nodes`: launch one worker per node (direct mode only)
`max_parallel_jobs`	integer	none	For `schedule_nodes`: maximum parallel jobs
`persistent`	boolean	false	Whether the action persists and can be claimed by multiple workers

ResourceMonitorConfig

Configuration for resource usage monitoring.

Name	Type	Default	Description
`sample_interval_seconds`	integer	`10`	Shared sampling interval in seconds
`flush_interval_seconds`	integer	`300`	Flush interval for buffered time-series samples
`generate_plots`	boolean	`false`	Generate resource usage plots
`jobs`	JobMonitorConfig	none	Per-job monitoring config
`compute_node`	ComputeNodeMonitorConfig	none	Overall compute-node monitoring

Example with per-job summary monitoring and overall compute-node time-series monitoring:

resource_monitor:
  sample_interval_seconds: 5
  flush_interval_seconds: 300
  jobs:
    enabled: true
    granularity: summary
  compute_node:
    enabled: true
    granularity: time_series
    cpu: true
    memory: true

jobs.granularity controls per-job monitoring. compute_node.granularity independently controls overall compute-node monitoring, so a workflow can store per-job summaries while collecting compute-node time series. Both scopes use the parent sample_interval_seconds; there is not a separate sampling interval per scope. When time-series monitoring is enabled, samples are flushed to SQLite every flush_interval_seconds, which trades lower commit overhead for a bounded amount of crash-exposed buffered data. The current compute-node monitor records CPU and memory; GPU monitoring is expected to use the same nested compute_node configuration in a future release.

For backwards compatibility, top-level enabled and granularity fields are still accepted and apply to job monitoring when jobs is omitted. New workflow specs should use jobs.

JobMonitorConfig

Configuration for per-job CPU and memory monitoring.

Name	Type	Default	Description
`enabled`	boolean	`false`	Enable job monitoring
`granularity`	MonitorGranularity	`"Summary"`	Summary or time-series collection

ComputeNodeMonitorConfig

Configuration for opt-in overall compute-node CPU and memory monitoring.

Name	Type	Default	Description
`enabled`	boolean	`false`	Enable compute-node monitoring
`granularity`	MonitorGranularity	`"Summary"`	Summary or time-series collection
`cpu`	boolean	`true`	Record overall CPU utilization
`memory`	boolean	`true`	Record overall memory usage

MonitorGranularity

Enum specifying the level of detail for resource monitoring.

Value	Description
`Summary`	Collect summary statistics only
`TimeSeries`	Collect detailed time series data

Job Priority

Use priority when some ready jobs should be claimed before others.

Higher values are preferred over lower values
The default is 0
claim_next_jobs uses a stable tie-breaker for jobs with the same priority
claim_jobs_based_on_resources prefers GPU jobs first within the same priority
Priority affects both claim_next_jobs and claim_jobs_based_on_resources

Example:

jobs:
  - name: urgent_step
    command: ./run_urgent.sh
    priority: 100

  - name: background_step
    command: ./run_background.sh
    priority: 10

Parameter Formats

Parameters support several formats for generating multiple jobs or files:

Format	Example	Description
Integer range	`"1:100"`	Inclusive range from 1 to 100
Integer range with step	`"0:100:10"`	Range with step size
Float range	`"0.0:1.0:0.1"`	Float range with step
Integer list	`"[1,5,10,100]"`	Explicit list of integers
Float list	`"[0.1,0.5,0.9]"`	Explicit list of floats
String list	`"['adam','sgd','rmsprop']"`	Explicit list of strings

Template substitution in strings:

Basic: {param_name} - Replace with parameter value
Formatted integer: {i:03d} - Zero-padded (001, 042, 100)
Formatted float: {lr:.4f} - Precision (0.0010, 0.1000)

See the Job Parameterization reference for more details.

Keyboard shortcuts

Torc Documentation