Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tutorial: AI-Assisted Workflow Management

This tutorial shows how to use AI assistants to manage Torc workflows using natural language.

What You'll Learn

  • Set up an AI assistant to work with Torc
  • Create and manage workflows through conversation
  • Debug failures by asking questions

Prerequisites

  • Torc installed and server running
  • Claude Code or VS Code with GitHub Copilot

What Can AI Assistants Do?

With Torc's AI integration, you can manage workflows using natural language:

TaskExample
Create workflows"Create a workflow with 10 parallel jobs"
Check status"What's the status of workflow 42?"
Debug failures"Why did a job in workflow 5 fail?"
Fix problems"Restart the failed jobs with doubled memory"
Investigate"Check if any jobs exceeded their memory limits"

Quick Setup: Claude Code

Step 1: Add the Torc MCP server to your project:

claude mcp add torc \
  --scope project \
  -e TORC_API_URL=http://localhost:8080/torc-service/v1 \
  -e TORC_OUTPUT_DIR=./output \
  -- /path/to/torc-mcp-server

Step 2: Verify it's configured:

claude mcp list

Step 3: Start a new Claude Code session and try:

"What workflows exist?"


Quick Setup: VS Code + Copilot

Step 1: Create .vscode/mcp.json in your project:

{
  "servers": {
    "torc": {
      "command": "/path/to/torc-mcp-server",
      "env": {
        "TORC_API_URL": "http://localhost:8080/torc-service/v1",
        "TORC_OUTPUT_DIR": "./output"
      }
    }
  }
}

Step 2: In Copilot Chat, use Agent Mode and try:

"What workflows exist?"


Example: Creating a Workflow

> Create a workflow with 3 stages: a preprocess script, 10 work jobs that each take an index, and a postprocess job. The work jobs each require
 10 CPUs and 20 GB of memory. The other jobs have trivial resource requirements.

⏺ I'll create a workflow spec file with the 3-stage fan-out/fan-in pattern you described.

⏺ Created the workflow spec at ./workflow.json with:

  Structure:
  - Stage 1: preprocess - runs first, outputs preprocess_output.txt
  - Stage 2: work_0 through work_9 - 10 parallel jobs, each reads preprocess output
  - Stage 3: postprocess - waits for all work jobs to complete (fan-in pattern)

  Resource Requirements:
  | Name  | CPUs | Memory | Runtime |
  |-------|------|--------|---------|
  | small | 1    | 1g     | 1 hour  |
  | large | 10   | 20g    | 2 hours |

  Next steps:
  1. Edit the spec to replace placeholder commands with your actual scripts
  2. Run locally: torc run workflow.json
  3. Or submit to Slurm: torc submit workflow.json (requires adding an account)

You can create much more complex workflows. Output is not shown, but this prompt will create a workflow with job dependencies defined implicitly through job-file relationships:

Create a workflow with four stages of jobs. There are 3 initial jobs each dependent on an input file. Each of those jobs fans out into 10 work jobs through
  output files. Each of those jobs creates a file. In the third stage there is a script that aggregates each of the 10 jobs into one file. The final
  postprocess job aggregates those three files into one. I will run it with Slurm: account=my_account. The work jobs require 10 cpus and 30 GB of
  memory. The other jobs have trivial resource requirements.

Example: Adjusting jobs' resource requirements

Suppose that jobs fail because they use too much CPU or memory.

> Check the resource utilization for workflow 20

⏺ torc - check_resource_utilization (MCP)(workflow_id: 20)
  ⎿  {
       "over_utilization_count": 1,
       "run_id": null,
     … +17 lines (ctrl+o to expand)

⏺ Resource Utilization for Workflow 20

  Summary: 3 job results analyzed, 1 resource violation found

  CPU Over-Utilization

  | Job ID | Job Name      | Specified      | Peak Used | Over By |
  |--------|---------------|----------------|-----------|---------|
  | 264    | cpu_heavy_job | 3 cores (300%) | 602.5%    | +100.8% |

  The cpu_heavy_job used approximately 6 CPU cores at peak, but was only allocated 3 cores.

  Would you like me to update the resource requirements for this job? I can increase the CPU allocation to 7 cores to provide headroom for future runs.

> yes

⏺ torc - update_job_resources (MCP)(job_id: 264, num_cpus: 7)
  ⎿  {
       "job_id": 264,
       "resource_requirements_id": 45,
     … +8 lines (ctrl+o to expand)

⏺ Done. Updated cpu_heavy_job (job 264) resources:

  | Setting | Value      |
  |---------|------------|
  | CPUs    | 7 (was 3)  |
  | Memory  | 1g         |
  | Runtime | 15 minutes |

  The job is now configured with enough CPU capacity for future runs.

Example: Recovering a failed workflow that timed out

The workflow in tests/workflows/timeout_auto_recovery_test is used to test torc's auto-recovery of walltime timeouts. In this example, the workflow was left in a failed state.

Here is a conversation in Claude Code after running the workflow:

> analyze the logs for workflow 6

Workflow 6 Log Analysis Summary

  Workflow: timeout_auto_recovery_test - Test workflow for automatic timeout recovery

  Job Results

  | Job | Name       | Status        | Runtime | Return Code  |
  |-----|------------|---------------|---------|--------------|
  | 31  | (fast job) | ✅ Completed  | 1.0 min | 0            |
  | 32  | job_slow   | ⛔ Terminated | 5.9 min | -9 (SIGKILL) |

  Terminated Job Details (job_slow)

  The job was killed by signal 9 (SIGKILL) after running for ~6 minutes, which matches the expected behavior for a timeout test.

  stdout log shows the job ran through minute 6 before being killed:
  Slow job starting at Sat Dec 27 15:40:59 MST 2025
  This job will run for 10 minutes (exceeds 5 min runtime and 8 min walltime)
  It should be killed by Slurm and recovered by the watcher
  Minute 1 of 10...
  Minute 2 of 10...
  ...
  Minute 6 of 10...

  stderr log: Empty (no errors written)

  Conclusion

  The timeout auto-recovery test is working as expected:
  - The slow job exceeded its 5-minute runtime limit
  - It was killed by the system (SIGKILL, return code -9) at ~6 minutes
  - Peak memory usage was ~4.1 MB (no memory issues)

Example: Debugging a failed workflow with over-consumption of memory

The workflow in tests/workflows/oom_auto_recovery_test is used to test torc's diagnosis of memory over-consumption errors.

Here is a conversation in Claude Code after running the workflow:

Analyze the logs for workflow 16

● torc - analyze_workflow_logs (MCP)(workflow_id: 16) ⎿  { "error_count": 0, "errors_by_type": [], … +7 lines (ctrl+o to expand)

● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿  { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)

● The workflow has 3 failed jobs out of 10 total, but the log analyzer didn't find error patterns. Let me dig deeper:

● torc - list_failed_jobs (MCP)(workflow_id: 16) ⎿  { "failed_job_count": 3, "failed_jobs": [ … +18 lines (ctrl+o to expand)

● torc - check_resource_utilization (MCP)(workflow_id: 16, include_failed: true) ⎿  { "failed_jobs": [ { … +137 lines (ctrl+o to expand)

● Workflow 16 Analysis Summary

Workflow: oom_test - "Test workflow with an OOM" Status: 7 completed, 3 failed (10 total jobs)

Root Cause: Out of Memory (OOM)

All 3 failed jobs were killed due to OOM (return code 137):

Job IDNamePeak MemoryConfiguredOver-Utilization
103work_1N/A (killed early)10gSIGKILL 137
109work_729.91 GB10g+199.1%
110work_826.89 GB10g+168.9%

Critical Warning: More Failures Expected

9 jobs exceeded their 10 GB memory limit - including 6 jobs that happened to complete but used ~30 GB each:

  • work_2 through work_6, work_9, work_10 all peaked at ~30.01 GB (+200.1% over limit)
  • These completed due to available system memory but are also misconfigured

Recommendation

  1. Increase memory allocation to 32g for all work jobs
  2. Use automated recovery - I can run a dry-run of the recovery tool to preview fixes

Would you like me to run recover_workflow with dry_run=true to preview the recovery actions?

yes

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true) ⎿  { "dry_run": true, "memory_multiplier": 1.5, … +9 lines (ctrl+o to expand)

● Recovery Preview (Dry Run)

Proposed Actions:

  • 3 failed jobs would have memory increased (1.5x multiplier: 10g → 15g)
  • Failed jobs would be reset to ready status
  • Slurm schedulers would be regenerated

⚠️ Important Consideration

The default 1.5x memory multiplier would increase memory from 10g → 15g, but the jobs actually peaked at ~30 GB. This means:

  • 15g may still cause OOM failures
  • A 3.5x multiplier (10g → 35g) would be safer based on actual usage

Would you like me to proceed with:

  1. Default recovery (1.5x = 15g) - may require another recovery cycle
  2. Custom recovery with a higher memory multiplier (e.g., 3.5x = 35g)

2

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true, memory_multiplier: 3.5) ⎿  { "dry_run": true, "memory_multiplier": 3.5, … +9 lines (ctrl+o to expand)

● Recovery Preview (3.5x Memory Multiplier)

Proposed Actions:

  • 3 failed jobs would have memory increased: 10g → 35g
  • Failed jobs would be reset to ready status
  • Slurm schedulers would be regenerated

This should safely cover the ~30 GB peak memory usage observed.

Would you like me to proceed with these recovery actions?

yes

● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: false, memory_multiplier: 3.5) ⎿  { "dry_run": false, "memory_multiplier": 3.5, … +7 lines (ctrl+o to expand)

● Recovery Complete ✓

Actions taken:

  • 3 jobs had memory increased from 10g → 35g
  • Failed jobs reset to ready status
  • Slurm schedulers regenerated and submitted

● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿  { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)

● Current Status:

StatusCount
Completed7
Ready3

The 3 previously failed jobs (work_1, work_7, work_8) are now ready to run with 35g memory. Slurm allocations have been submitted and will execute these jobs automatically.


What You Learned

  • How to configure AI assistants with Torc (one command for Claude Code, one file for VS Code)
  • How to create workflows through conversation
  • How to debug and fix failures using natural language

Next Steps