Tutorial: AI-Assisted Failure Recovery
๐งช EXPERIMENTAL: This feature is new and not yet well-tested. The API and behavior may change based on user feedback.
This tutorial shows how to use AI agents (Claude Code, GitHub Copilot, or custom MCP clients) to intelligently classify and recover from workflow failures that can't be handled by rule-based mechanisms.
Learning Objectives
By the end of this tutorial, you will:
- Understand the
pending_failedjob status - Configure workflows for AI-assisted recovery
- Use the torc MCP server with AI agents
- Classify transient vs permanent errors
Prerequisites
- Torc installed with the client feature
- A running Torc server
- An MCP-compatible AI agent (Claude Code, GitHub Copilot, or custom)
torc-mcp-serverinstalled
The Problem
Traditional recovery mechanisms have limitations:
| Mechanism | Limitation |
|---|---|
| Failure handlers | Require predefined exit codes |
torc watch --recover | Only handles OOM and timeout patterns |
--retry-unknown | Blindly retries everything |
Some failures require intelligent classification:
- Transient errors: Network timeouts, node failures, service outages - should retry
- Permanent errors: Code bugs, missing files, invalid inputs - should fail
AI agents can analyze error logs and make these distinctions.
How It Works
flowchart TD
JOB["Job exits with<br/>non-zero code"]
HANDLER{"Failure handler<br/>matches?"}
RETRY["Retry via<br/>failure handler"]
PENDING["Status: pending_failed<br/>Awaiting classification"]
WATCH["torc watch/recover<br/>+ AI agent"]
CLASSIFY["AI analyzes stderr"]
TRANSIENT["Transient error<br/>โ Retry"]
PERMANENT["Permanent error<br/>โ Fail"]
JOB --> HANDLER
HANDLER -->|Yes| RETRY
HANDLER -->|No| PENDING
PENDING --> WATCH
WATCH --> CLASSIFY
CLASSIFY --> TRANSIENT
CLASSIFY --> PERMANENT
style JOB fill:#dc3545,color:#fff
style PENDING fill:#ffc107,color:#000
style CLASSIFY fill:#4a9eff,color:#fff
style TRANSIENT fill:#28a745,color:#fff
style PERMANENT fill:#6c757d,color:#fff
When a job fails without a matching failure handler rule, it enters the pending_failed status
instead of failed. This prevents immediate downstream job cancellation and gives the AI agent time
to classify the error.
Quick Start
Option A: Automatic AI Agent Invocation (Recommended)
Use the --ai-recovery flag to automatically invoke the Claude CLI for classification:
# One-shot recovery with AI classification
torc recover 123 --ai-recovery
# Continuous monitoring with AI classification
torc watch 123 --ai-recovery
# Specify a different AI agent
torc recover 123 --ai-recovery --ai-agent claude # Default
torc recover 123 --ai-recovery --ai-agent copilot # GitHub Copilot
When --ai-recovery is enabled:
- Torc detects jobs in
pending_failedstatus - Automatically invokes the AI agent CLI with the torc MCP server
- AI agent analyzes stderr and classifies each job as transient (retry) or permanent (fail)
- Classifications are applied via MCP tools
- Recovery continues with the newly classified jobs
Requirements:
- Claude: Claude Code CLI installed (
claudecommand available) - GitHub Copilot: GitHub CLI with Copilot installed (
gh copilotcommand available) - Torc MCP server configured in your AI agent's MCP settings
Option B: Manual AI Agent Invocation
If you prefer manual control, configure your AI agent and invoke it yourself.
1. Start the MCP Server
torc-mcp-server --url http://localhost:8080/torc-service/v1
2. Configure Your AI Agent
Add the torc MCP server to your agent's configuration:
Claude Code (~/.claude/mcp_servers.json):
{
"mcpServers": {
"torc": {
"command": "torc-mcp-server",
"args": ["--url", "http://localhost:8080/torc-service/v1"]
}
}
}
GitHub Copilot (.github/copilot/mcp-config.json or global config):
{
"mcpServers": {
"torc": {
"command": "torc-mcp-server",
"args": ["--url", "http://localhost:8080/torc-service/v1"]
}
}
}
3. Run a Workflow
torc run my_workflow.yaml
4. Monitor with AI Recovery
When jobs fail, use your AI agent to:
-
List pending failures:
Agent: Use list_pending_failed_jobs with workflow_id=123 -
Analyze the errors:
Agent: The stderr shows "Connection refused to storage.example.com:443" This is a transient network error - the storage server was temporarily down. -
Classify and resolve:
Agent: Use classify_and_resolve_failures to retry these jobs
MCP Tools
The torc MCP server provides these tools for AI-assisted recovery:
list_pending_failed_jobs
Lists jobs with pending_failed status, including their stderr output.
Input:
{
"workflow_id": 123
}
Output:
{
"workflow_id": 123,
"pending_failed_count": 2,
"pending_failed_jobs": [
{
"job_id": 456,
"name": "process_data",
"return_code": 1,
"stderr_tail": "ConnectionError: Connection refused..."
}
],
"guidance": "Analyze the stderr output to classify each failure..."
}
classify_and_resolve_failures
Applies classifications to pending_failed jobs.
Input:
{
"workflow_id": 123,
"classifications": [
{
"job_id": 456,
"action": "retry",
"reason": "Transient network error - storage server was down"
},
{
"job_id": 789,
"action": "fail",
"reason": "SyntaxError in user code - requires fix"
}
],
"dry_run": true
}
Actions:
retry: Reset toreadystatus with bumpedattempt_idfail: Set tofailedstatus (triggers downstream cancellation)
Optional resource adjustments:
{
"job_id": 456,
"action": "retry",
"memory": "16g",
"runtime": "PT4H",
"reason": "OOM detected in stderr, increasing memory"
}
Error Classification Guide
Transient Errors (Should Retry)
| Error Pattern | Category |
|---|---|
Connection refused, Connection timed out | Network |
NCCL timeout, GPU communication error | GPU/HPC |
EIO, Input/output error | Hardware |
Slurm: node failure, PREEMPTED | HPC scheduling |
Service Unavailable, 503 | External service |
Permanent Errors (Should Fail)
| Error Pattern | Category |
|---|---|
SyntaxError, IndentationError | Code bug |
ModuleNotFoundError, ImportError | Missing dependency |
FileNotFoundError (for input files) | Missing data |
IndexError, KeyError | Logic error |
PermissionDenied (consistent) | Configuration issue |
Integration with Existing Recovery
AI-assisted recovery works alongside other mechanisms:
failure_handlers:
- name: known_errors
rules:
# Known recoverable exit codes handled immediately
- exit_codes: [10, 11]
recovery_script: ./recover.sh
max_retries: 3
# Unknown errors go to pending_failed for AI classification
When a job fails with an exit code not covered by the failure handler, it becomes pending_failed
instead of failed.
Using with torc recover
The torc recover command can work with AI agents:
# Check for pending_failed jobs
torc jobs list $WORKFLOW_ID --status pending_failed
# Use AI agent to classify and resolve
# (via MCP tools list_pending_failed_jobs and classify_and_resolve_failures)
# After AI classification, reinitialize and continue
torc workflows reinitialize $WORKFLOW_ID
torc workflows run $WORKFLOW_ID
Custom MCP Servers
You can create domain-specific MCP servers that the AI agent uses alongside torc:
flowchart LR
subgraph agent["AI Agent"]
LLM["Claude/Copilot"]
end
subgraph mcp["MCP Servers"]
TORC["torc-mcp-server"]
CUSTOM["custom-mcp-server"]
end
subgraph systems["Systems"]
API["Torc API"]
DOMAIN["Domain APIs"]
end
LLM --> TORC
LLM --> CUSTOM
TORC --> API
CUSTOM --> DOMAIN
style LLM fill:#4a9eff,color:#fff
style TORC fill:#28a745,color:#fff
style CUSTOM fill:#ffc107,color:#000
For example, a custom MCP server could:
- Check external service status (was the storage server down?)
- Look up known error patterns in a database
- Query infrastructure monitoring systems
Workflow Reset
If AI classification doesn't happen (e.g., no agent is running), jobs stay in pending_failed
status indefinitely. To reset them:
# Reset all failed/pending_failed jobs
torc workflows reset-status $WORKFLOW_ID --failed-only
Example Session
Here's an example conversation with an AI agent:
User: My workflow 123 has some failures. Can you help?
Agent: Let me check what's happening.
[Uses get_workflow_status tool]
The workflow has 3 jobs in pending_failed status.
Agent: [Uses list_pending_failed_jobs tool]
I found 3 jobs awaiting classification:
Job 456 (process_batch_1): stderr shows
"ConnectionError: Connection refused to storage.internal:443"
Job 457 (process_batch_2): stderr shows the same connection error
Job 458 (validate_data): stderr shows
"ValueError: Column 'timestamp' has invalid format"
Agent: Analysis:
- Jobs 456, 457: Transient network error. The storage server was
temporarily unreachable. These should be retried.
- Job 458: Permanent error. There's a data format issue that
requires fixing the input data.
Agent: [Uses classify_and_resolve_failures with dry_run=true]
Here's my proposed classification:
- Job 456: retry (transient network error)
- Job 457: retry (transient network error)
- Job 458: fail (invalid input data)
Should I apply these classifications?
User: Yes, go ahead.
Agent: [Uses classify_and_resolve_failures with dry_run=false]
Done! Jobs 456 and 457 are now ready to retry.
Job 458 has been marked as failed.
Note: You'll need to fix the timestamp format in your input data
before rerunning job 458.
Summary
AI-assisted recovery provides intelligent failure classification:
pending_failedstatus: Defers classification for AI analysis- MCP tools:
list_pending_failed_jobs,classify_and_resolve_failures - Error patterns: Transient vs permanent classification
- Integration: Works with failure handlers and
torc recover