Tutorial: AI-Assisted Workflow Management
This tutorial shows how to use AI assistants to manage Torc workflows using natural language.
What You'll Learn
- Set up an AI assistant to work with Torc
- Create and manage workflows through conversation
- Debug failures by asking questions
Prerequisites
- Torc installed and server running
- Claude Code or VS Code with GitHub Copilot
What Can AI Assistants Do?
With Torc's AI integration, you can manage workflows using natural language:
| Task | Example |
|---|---|
| Create workflows | "Create a workflow with 10 parallel jobs" |
| Check status | "What's the status of workflow 42?" |
| Debug failures | "Why did a job in workflow 5 fail?" |
| Fix problems | "Restart the failed jobs with doubled memory" |
| Investigate | "Check if any jobs exceeded their memory limits" |
Quick Setup: Claude Code
Step 1: Add the Torc MCP server to your project:
claude mcp add torc \
--scope project \
-e TORC_API_URL=http://localhost:8080/torc-service/v1 \
-e TORC_OUTPUT_DIR=./output \
-- /path/to/torc-mcp-server
Step 2: Verify it's configured:
claude mcp list
Step 3: Start a new Claude Code session and try:
"What workflows exist?"
Quick Setup: VS Code + Copilot
Step 1: Create .vscode/mcp.json in your project:
{
"servers": {
"torc": {
"command": "/path/to/torc-mcp-server",
"env": {
"TORC_API_URL": "http://localhost:8080/torc-service/v1",
"TORC_OUTPUT_DIR": "./output"
}
}
}
}
Step 2: In Copilot Chat, use Agent Mode and try:
"What workflows exist?"
Example: Creating a Workflow
> Create a workflow with 3 stages: a preprocess script, 10 work jobs that each take an index, and a postprocess job. The work jobs each require
10 CPUs and 20 GB of memory. The other jobs have trivial resource requirements.
⏺ I'll create a workflow spec file with the 3-stage fan-out/fan-in pattern you described.
⏺ Created the workflow spec at ./workflow.json with:
Structure:
- Stage 1: preprocess - runs first, outputs preprocess_output.txt
- Stage 2: work_0 through work_9 - 10 parallel jobs, each reads preprocess output
- Stage 3: postprocess - waits for all work jobs to complete (fan-in pattern)
Resource Requirements:
| Name | CPUs | Memory | Runtime |
|-------|------|--------|---------|
| small | 1 | 1g | 1 hour |
| large | 10 | 20g | 2 hours |
Next steps:
1. Edit the spec to replace placeholder commands with your actual scripts
2. Run locally: torc run workflow.json
3. Or submit to Slurm: torc submit workflow.json (requires adding an account)
You can create much more complex workflows. Output is not shown, but this prompt will create a workflow with job dependencies defined implicitly through job-file relationships:
Create a workflow with four stages of jobs. There are 3 initial jobs each dependent on an input file. Each of those jobs fans out into 10 work jobs through
output files. Each of those jobs creates a file. In the third stage there is a script that aggregates each of the 10 jobs into one file. The final
postprocess job aggregates those three files into one. I will run it with Slurm: account=my_account. The work jobs require 10 cpus and 30 GB of
memory. The other jobs have trivial resource requirements.
Example: Adjusting jobs' resource requirements
Suppose that jobs fail because they use too much CPU or memory.
> Check the resource utilization for workflow 20
⏺ torc - check_resource_utilization (MCP)(workflow_id: 20)
⎿ {
"over_utilization_count": 1,
"run_id": null,
… +17 lines (ctrl+o to expand)
⏺ Resource Utilization for Workflow 20
Summary: 3 job results analyzed, 1 resource violation found
CPU Over-Utilization
| Job ID | Job Name | Specified | Peak Used | Over By |
|--------|---------------|----------------|-----------|---------|
| 264 | cpu_heavy_job | 3 cores (300%) | 602.5% | +100.8% |
The cpu_heavy_job used approximately 6 CPU cores at peak, but was only allocated 3 cores.
Would you like me to update the resource requirements for this job? I can increase the CPU allocation to 7 cores to provide headroom for future runs.
> yes
⏺ torc - update_job_resources (MCP)(job_id: 264, num_cpus: 7)
⎿ {
"job_id": 264,
"resource_requirements_id": 45,
… +8 lines (ctrl+o to expand)
⏺ Done. Updated cpu_heavy_job (job 264) resources:
| Setting | Value |
|---------|------------|
| CPUs | 7 (was 3) |
| Memory | 1g |
| Runtime | 15 minutes |
The job is now configured with enough CPU capacity for future runs.
Example: Recovering a failed workflow that timed out
The workflow in tests/workflows/timeout_auto_recovery_test is used to test torc's auto-recovery of
walltime timeouts. In this example, the workflow was left in a failed state.
Here is a conversation in Claude Code after running the workflow:
> analyze the logs for workflow 6
Workflow 6 Log Analysis Summary
Workflow: timeout_auto_recovery_test - Test workflow for automatic timeout recovery
Job Results
| Job | Name | Status | Runtime | Return Code |
|-----|------------|---------------|---------|--------------|
| 31 | (fast job) | ✅ Completed | 1.0 min | 0 |
| 32 | job_slow | ⛔ Terminated | 5.9 min | -9 (SIGKILL) |
Terminated Job Details (job_slow)
The job was killed by signal 9 (SIGKILL) after running for ~6 minutes, which matches the expected behavior for a timeout test.
stdout log shows the job ran through minute 6 before being killed:
Slow job starting at Sat Dec 27 15:40:59 MST 2025
This job will run for 10 minutes (exceeds 5 min runtime and 8 min walltime)
It should be killed by Slurm and recovered by the watcher
Minute 1 of 10...
Minute 2 of 10...
...
Minute 6 of 10...
stderr log: Empty (no errors written)
Conclusion
The timeout auto-recovery test is working as expected:
- The slow job exceeded its 5-minute runtime limit
- It was killed by the system (SIGKILL, return code -9) at ~6 minutes
- Peak memory usage was ~4.1 MB (no memory issues)
Example: Debugging a failed workflow with over-consumption of memory
The workflow in tests/workflows/oom_auto_recovery_test is used to test torc's diagnosis of memory
over-consumption errors.
Here is a conversation in Claude Code after running the workflow:
Analyze the logs for workflow 16
● torc - analyze_workflow_logs (MCP)(workflow_id: 16) ⎿ { "error_count": 0, "errors_by_type": [], … +7 lines (ctrl+o to expand)
● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)
● The workflow has 3 failed jobs out of 10 total, but the log analyzer didn't find error patterns. Let me dig deeper:
● torc - list_failed_jobs (MCP)(workflow_id: 16) ⎿ { "failed_job_count": 3, "failed_jobs": [ … +18 lines (ctrl+o to expand)
● torc - check_resource_utilization (MCP)(workflow_id: 16, include_failed: true) ⎿ { "failed_jobs": [ { … +137 lines (ctrl+o to expand)
● Workflow 16 Analysis Summary
Workflow: oom_test - "Test workflow with an OOM" Status: 7 completed, 3 failed (10 total jobs)
Root Cause: Out of Memory (OOM)
All 3 failed jobs were killed due to OOM (return code 137):
| Job ID | Name | Peak Memory | Configured | Over-Utilization |
|---|---|---|---|---|
| 103 | work_1 | N/A (killed early) | 10g | SIGKILL 137 |
| 109 | work_7 | 29.91 GB | 10g | +199.1% |
| 110 | work_8 | 26.89 GB | 10g | +168.9% |
Critical Warning: More Failures Expected
9 jobs exceeded their 10 GB memory limit - including 6 jobs that happened to complete but used ~30 GB each:
- work_2 through work_6, work_9, work_10 all peaked at ~30.01 GB (+200.1% over limit)
- These completed due to available system memory but are also misconfigured
Recommendation
- Increase memory allocation to 32g for all work jobs
- Use automated recovery - I can run a dry-run of the recovery tool to preview fixes
Would you like me to run recover_workflow with dry_run=true to preview the recovery actions?
yes
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true) ⎿ { "dry_run": true, "memory_multiplier": 1.5, … +9 lines (ctrl+o to expand)
● Recovery Preview (Dry Run)
Proposed Actions:
- 3 failed jobs would have memory increased (1.5x multiplier: 10g → 15g)
- Failed jobs would be reset to ready status
- Slurm schedulers would be regenerated
⚠️ Important Consideration
The default 1.5x memory multiplier would increase memory from 10g → 15g, but the jobs actually peaked at ~30 GB. This means:
- 15g may still cause OOM failures
- A 3.5x multiplier (10g → 35g) would be safer based on actual usage
Would you like me to proceed with:
- Default recovery (1.5x = 15g) - may require another recovery cycle
- Custom recovery with a higher memory multiplier (e.g., 3.5x = 35g)
2
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true, memory_multiplier: 3.5) ⎿ { "dry_run": true, "memory_multiplier": 3.5, … +9 lines (ctrl+o to expand)
● Recovery Preview (3.5x Memory Multiplier)
Proposed Actions:
- 3 failed jobs would have memory increased: 10g → 35g
- Failed jobs would be reset to ready status
- Slurm schedulers would be regenerated
This should safely cover the ~30 GB peak memory usage observed.
Would you like me to proceed with these recovery actions?
yes
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: false, memory_multiplier: 3.5) ⎿ { "dry_run": false, "memory_multiplier": 3.5, … +7 lines (ctrl+o to expand)
● Recovery Complete ✓
Actions taken:
- 3 jobs had memory increased from 10g → 35g
- Failed jobs reset to ready status
- Slurm schedulers regenerated and submitted
● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM", "job_counts_by_status": { … +8 lines (ctrl+o to expand)
● Current Status:
| Status | Count |
|---|---|
| Completed | 7 |
| Ready | 3 |
The 3 previously failed jobs (work_1, work_7, work_8) are now ready to run with 35g memory. Slurm allocations have been submitted and will execute these jobs automatically.
What You Learned
- How to configure AI assistants with Torc (one command for Claude Code, one file for VS Code)
- How to create workflows through conversation
- How to debug and fix failures using natural language
Next Steps
- Configuring AI Assistants — Full configuration options, all tools, troubleshooting
- Automatic Failure Recovery — Use
torc watchfor automated recovery - Configuration Files — Set up Torc configuration