Tutorial: AI-Assisted Workflow Management

This tutorial shows how to use AI assistants to manage Torc workflows using natural language.

What You'll Learn

Set up an AI assistant to work with Torc
Create and manage workflows through conversation
Debug failures by asking questions

Prerequisites

Torc installed and server running
Claude Code or VS Code with GitHub Copilot

What Can AI Assistants Do?

With Torc's AI integration, you can manage workflows using natural language:

Task	Example
Create workflows	"Create a workflow with 10 parallel jobs"
Check status	"What's the status of workflow 42?"
Debug failures	"Why did a job in workflow 5 fail?"
Fix problems	"Restart the failed jobs with doubled memory"
Investigate	"Check if any jobs exceeded their memory limits"
Optimize resources	"Analyze usage and suggest better resource groupings"
Plan allocations	"Should I use one big allocation or many small ones?"

Quick Setup: Claude Code

Step 1: Add the Torc MCP server to your project:

claude mcp add torc \
  --scope project \
  -e TORC_API_URL=http://localhost:8080/torc-service/v1 \
  -e TORC_OUTPUT_DIR=./output \
  -- /path/to/torc-mcp-server

Step 2: Verify it's configured:

claude mcp list

Step 3: Start a new Claude Code session and try:

"What workflows exist?"

Quick Setup: VS Code + Copilot

Step 1: Create .vscode/mcp.json in your project:

{
  "servers": {
    "torc": {
      "command": "/path/to/torc-mcp-server",
      "env": {
        "TORC_API_URL": "http://localhost:8080/torc-service/v1",
        "TORC_OUTPUT_DIR": "./output"
      }
    }
  }
}

Step 2: In Copilot Chat, use Agent Mode and try:

"What workflows exist?"

Example: Creating a Workflow

> Create a workflow with 3 stages: a preprocess script, 10 work jobs that each take an index, and a postprocess job. The work jobs each require
 10 CPUs and 20 GB of memory. The other jobs have trivial resource requirements.

⏺ I'll create a workflow spec file with the 3-stage fan-out/fan-in pattern you described.

⏺ Created the workflow spec at ./workflow.json with:

  Structure:
  - Stage 1: preprocess - runs first, outputs preprocess_output.txt
  - Stage 2: work_0 through work_9 - 10 parallel jobs, each reads preprocess output
  - Stage 3: postprocess - waits for all work jobs to complete (fan-in pattern)

  Resource Requirements:
  | Name  | CPUs | Memory | Runtime |
  |-------|------|--------|---------|
  | small | 1    | 1g     | 1 hour  |
  | large | 10   | 20g    | 2 hours |

  Next steps:
  1. Edit the spec to replace placeholder commands with your actual scripts
  2. Run locally: torc run workflow.json
  3. Or submit to Slurm: torc submit workflow.json (requires adding an account)

You can create much more complex workflows. Output is not shown, but this prompt will create a workflow with job dependencies defined implicitly through job-file relationships:

Create a workflow with four stages of jobs. There are 3 initial jobs each dependent on an input file. Each of those jobs fans out into 10 work jobs through
  output files. Each of those jobs creates a file. In the third stage there is a script that aggregates each of the 10 jobs into one file. The final
  postprocess job aggregates those three files into one. I will run it with Slurm: account=my_account. The work jobs require 10 cpus and 30 GB of
  memory. The other jobs have trivial resource requirements.

Example: Adjusting jobs' resource requirements

Suppose that jobs fail because they use too much CPU or memory.

> Check the resource utilization for workflow 20

⏺ torc - check_resource_utilization (MCP)(workflow_id: 20)
  ⎿  {
       "over_utilization_count": 1,
       "run_id": null,
     … +17 lines (ctrl+o to expand)

⏺ Resource Utilization for Workflow 20

  Summary: 3 job results analyzed, 1 resource violation found

  CPU Over-Utilization

  | Job ID | Job Name      | Specified      | Peak Used | Over By |
  |--------|---------------|----------------|-----------|---------|
  | 264    | cpu_heavy_job | 3 cores (300%) | 602.5%    | +100.8% |

  The cpu_heavy_job used approximately 6 CPU cores at peak, but was only allocated 3 cores.

  Would you like me to update the resource requirements for this job? I can increase the CPU allocation to 7 cores to provide headroom for future runs.

> yes

⏺ torc - update_job_resources (MCP)(job_id: 264, num_cpus: 7)
  ⎿  {
       "job_id": 264,
       "resource_requirements_id": 45,
     … +8 lines (ctrl+o to expand)

⏺ Done. Updated cpu_heavy_job (job 264) resources:

  | Setting | Value      |
  |---------|------------|
  | CPUs    | 7 (was 3)  |
  | Memory  | 1g         |
  | Runtime | 15 minutes |

  The job is now configured with enough CPU capacity for future runs.

Example: Regrouping resources based on actual usage

When jobs are initially assigned to broad resource groups, actual usage may reveal that a finer grouping would be more efficient. The AI can analyze per-job resource measurements and propose new groups.

In this example, 6 data-processing jobs are split across 2 resource groups (standard and high_mem), but actual memory usage falls into 3 distinct clusters.

> Analyze resource usage for workflow 42 and suggest better groupings.

⏺ torc - analyze_resource_usage (MCP)(workflow_id: 42)
  ⎿  {
       "workflow_id": 42,
       "total_jobs": 6,
       "total_jobs_with_results": 6,
       "resource_groups": [ ...
     … +80 lines (ctrl+o to expand)

⏺ Resource Usage Analysis for Workflow 42

  Current grouping: 2 groups (standard, high_mem), both configured with 2g memory.

  Actual peak memory by job:

  | Job              | Current RR | Peak Memory |
  |------------------|------------|-------------|
  | process_chunk_0  | standard   | 210 MB      |
  | process_chunk_1  | standard   | 525 MB      |
  | process_chunk_2  | standard   | 210 MB      |
  | process_chunk_3  | high_mem   | 1050 MB     |
  | process_chunk_4  | high_mem   | 1049 MB     |
  | process_chunk_5  | high_mem   | 524 MB      |

  I can see 3 natural clusters:
  - ~200 MB: process_chunk_0, process_chunk_2
  - ~500 MB: process_chunk_1, process_chunk_5
  - ~1000 MB: process_chunk_3, process_chunk_4

  The current 2-group split doesn't match these clusters — chunk_1 (500 MB) is
  in the "standard" group alongside 200 MB jobs, and chunk_5 (500 MB) is in
  "high_mem" alongside 1000 MB jobs.

  Would you like me to propose a 3-group regrouping with appropriate memory
  allocations?

> yes

⏺ torc - regroup_job_resources (MCP)(
    workflow_id: 42,
    dry_run: true,
    groups: [
      { name: "small", memory: "300m", num_cpus: 1, runtime: "PT5M",
        job_ids: [301, 303] },
      { name: "medium", memory: "700m", num_cpus: 1, runtime: "PT5M",
        job_ids: [302, 306] },
      { name: "large", memory: "1300m", num_cpus: 1, runtime: "PT5M",
        job_ids: [304, 305] }
    ])
  ⎿  {
       "workflow_id": 42,
       "dry_run": true,
       "groups": [ ...
     … +40 lines (ctrl+o to expand)

⏺ Regrouping Preview (Dry Run)

  | Group  | Memory | Jobs                             |
  |--------|--------|----------------------------------|
  | small  | 300m   | process_chunk_0, process_chunk_2 |
  | medium | 700m   | process_chunk_1, process_chunk_5 |
  | large  | 1300m  | process_chunk_3, process_chunk_4 |

  Each group has ~30% headroom above peak usage.
  New RR records will be created; existing RRs are not modified.

  Would you like me to apply this regrouping?

> yes

⏺ torc - regroup_job_resources (MCP)(workflow_id: 42, dry_run: false, groups: [...])
  ⎿  {
       "success": true,
       "total_jobs_updated": 6,
     … +20 lines (ctrl+o to expand)

⏺ Regrouping Complete ✓

  Created 3 new resource requirement groups and reassigned all 6 jobs:
  - small (300m): 2 jobs
  - medium (700m): 2 jobs
  - large (1300m): 2 jobs

  On the next run, each job will request only the memory it actually needs.

Example: Planning Slurm allocation strategy

When submitting a workflow to Slurm, the AI can analyze the cluster state and recommend whether to use a single large allocation or many small ones.

> I'm about to submit this workflow to Slurm. Should I use one big
> allocation or many small ones? My account is "climate_sim".

⏺ torc - plan_allocations (MCP)(spec_json: {...}, account: "climate_sim")
  ⎿  { "recommendations": [...], "cluster_state": [...], ... }

⏺ Allocation Strategy Analysis

  Workflow: 167 parallel work jobs, 1 preprocess, 1 postprocess
  Cluster: short partition — 42 idle, 180 allocated, 12 pending jobs

  Recommendation: Single large allocation (1 x 167 nodes)

  | Strategy     | Est. Start | Est. Completion |
  |--------------|------------|-----------------|
  | Single large | ~30 min    | ~4h 30min       |
  | Many small   | ~5 min     | ~6h 30min       |

  Reason: Slurm's backfill scheduler gives the large allocation a
  reserved slot. Many small allocations would start faster initially,
  but fair-share degradation delays the last jobs by ~2 hours.

  Your DAG has max_parallelism=167, so all work jobs can run
  simultaneously — a single allocation fully utilizes all nodes.

> Go ahead and generate it with the single allocation.

⏺ Running: torc slurm generate --account climate_sim --single-allocation workflow.yaml
  ⎿  Generated scheduler plan with 1 allocation x 167 nodes.

⏺ Done. Submit with: torc submit workflow.yaml

For background on the tradeoffs, see Allocation Strategies.

Example: Recovering a failed workflow that timed out

The workflow in tests/workflows/timeout_auto_recovery_test is used to test torc's auto-recovery of walltime timeouts. In this example, the workflow was left in a failed state.

Here is a conversation in Claude Code after running the workflow:

> analyze the logs for workflow 6

Workflow 6 Log Analysis Summary

  Workflow: timeout_auto_recovery_test - Test workflow for automatic timeout recovery

  Job Results

  | Job | Name       | Status        | Runtime | Return Code  |
  |-----|------------|---------------|---------|--------------|
  | 31  | (fast job) | ✅ Completed  | 1.0 min | 0            |
  | 32  | job_slow   | ⛔ Terminated | 5.9 min | -9 (SIGKILL) |

  Terminated Job Details (job_slow)

  The job was killed by signal 9 (SIGKILL) after running for ~6 minutes, which matches the expected behavior for a timeout test.

  stdout log shows the job ran through minute 6 before being killed:
  Slow job starting at Sat Dec 27 15:40:59 MST 2025
  This job will run for 10 minutes (exceeds 5 min runtime and 8 min walltime)
  It should be killed by Slurm and recovered by the watcher
  Minute 1 of 10...
  Minute 2 of 10...
  ...
  Minute 6 of 10...

  stderr log: Empty (no errors written)

  Conclusion

  The timeout auto-recovery test is working as expected:
  - The slow job exceeded its 5-minute runtime limit
  - It was killed by the system (SIGKILL, return code -9) at ~6 minutes
  - Peak memory usage was ~4.1 MB (no memory issues)

Example: Debugging a failed workflow with over-consumption of memory

The workflow in tests/workflows/oom_auto_recovery_test is used to test torc's diagnosis of memory over-consumption errors.

Here is a conversation in Claude Code after running the workflow:

Analyze the logs for workflow 16

● torc - analyze_workflow_logs (MCP)(workflow_id: 16) ⎿ { "error_count": 0, "errors_by_type": [], … +7 lines (ctrl+o to expand)

● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)

● The workflow has 3 failed jobs out of 10 total, but the log analyzer didn't find error patterns. Let me dig deeper:

● torc - list_failed_jobs (MCP)(workflow_id: 16) ⎿ { "failed_job_count": 3, "failed_jobs": [ … +18 lines (ctrl+o to expand)

● torc - check_resource_utilization (MCP)(workflow_id: 16, include_failed: true) ⎿ { "failed_jobs": [ { … +137 lines (ctrl+o to expand)

● Workflow 16 Analysis Summary

Workflow: oom_test - "Test workflow with an OOM" Status: 7 completed, 3 failed (10 total jobs)

Root Cause: Out of Memory (OOM)

All 3 failed jobs were killed due to OOM (return code 137):

Job ID	Name	Peak Memory	Configured	Over-Utilization
103	work_1	N/A (killed early)	10g	SIGKILL 137
109	work_7	29.91 GB	10g	+199.1%
110	work_8	26.89 GB	10g	+168.9%

Critical Warning: More Failures Expected

9 jobs exceeded their 10 GB memory limit - including 6 jobs that happened to complete but used ~30 GB each:

work_2 through work_6, work_9, work_10 all peaked at ~30.01 GB (+200.1% over limit)
These completed due to available system memory but are also misconfigured

Recommendation

Increase memory allocation to 32g for all work jobs
Use automated recovery - I can run a dry-run of the recovery tool to preview fixes

Would you like me to run recover_workflow with dry_run=true to preview the recovery actions?

yes

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true) ⎿ { "dry_run": true, "memory_multiplier": 1.5, … +9 lines (ctrl+o to expand)

● Recovery Preview (Dry Run)

Proposed Actions:

3 failed jobs would have memory increased (1.5x multiplier: 10g → 15g)
Failed jobs would be reset to ready status
Slurm schedulers would be regenerated

⚠️ Important Consideration

The default 1.5x memory multiplier would increase memory from 10g → 15g, but the jobs actually peaked at ~30 GB. This means:

15g may still cause OOM failures
A 3.5x multiplier (10g → 35g) would be safer based on actual usage

Would you like me to proceed with:

Default recovery (1.5x = 15g) - may require another recovery cycle
Custom recovery with a higher memory multiplier (e.g., 3.5x = 35g)

2

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true, memory_multiplier: 3.5) ⎿ { "dry_run": true, "memory_multiplier": 3.5, … +9 lines (ctrl+o to expand)

● Recovery Preview (3.5x Memory Multiplier)

Proposed Actions:

3 failed jobs would have memory increased: 10g → 35g
Failed jobs would be reset to ready status
Slurm schedulers would be regenerated

This should safely cover the ~30 GB peak memory usage observed.

Would you like me to proceed with these recovery actions?

yes

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: false, memory_multiplier: 3.5) ⎿ { "dry_run": false, "memory_multiplier": 3.5, … +7 lines (ctrl+o to expand)

● Recovery Complete ✓

Actions taken:

3 jobs had memory increased from 10g → 35g
Failed jobs reset to ready status
Slurm schedulers regenerated and submitted

● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)

● Current Status:

Status	Count
Completed	7
Ready	3

The 3 previously failed jobs (work_1, work_7, work_8) are now ready to run with 35g memory. Slurm allocations have been submitted and will execute these jobs automatically.

What You Learned

How to configure AI assistants with Torc (one command for Claude Code, one file for VS Code)
How to create workflows through conversation
How to debug and fix failures using natural language

Next Steps

Configuring AI Assistants — Full configuration options, all tools, troubleshooting
Automatic Failure Recovery — Use torc watch for automated recovery
Configuration Files — Set up Torc configuration

Keyboard shortcuts

Torc Documentation