How to Debug a Failed Job
Systematically diagnose why a job failed.
Step 1: Identify the Failed Job
torc jobs list <workflow_id> --status failed
Note the job ID and name.
Step 2: Check the Exit Code
torc results get <workflow_id> --job-id <job_id>
Common exit codes:
| Code | Meaning |
|---|---|
| 1 | General error |
| 2 | Misuse of shell command |
| 126 | Permission denied |
| 127 | Command not found |
| 137 | Killed (SIGKILL) — often OOM |
| 139 | Segmentation fault |
| 143 | Terminated (SIGTERM) |
Step 3: Read the Logs
# Get log paths
torc reports results <workflow_id> --job-id <job_id>
# View stderr (usually contains error messages)
cat output/job_stdio/job_wf43_j15_r1_a1.e
# View stdout
cat output/job_stdio/job_wf43_j15_r1_a1.o
Step 4: Check Resource Usage
Did the job exceed its resource limits?
torc reports check-resource-utilization <workflow_id>
Look for:
- Memory exceeded — Job was likely OOM-killed (exit code 137)
- Runtime exceeded — Job was terminated for running too long
Step 5: Reproduce Locally
Get the exact command that was run:
torc jobs get <job_id>
Try running it manually to see the error:
# Copy the command from the output and run it
python process.py --input data.csv
Common Fixes
| Problem | Solution |
|---|---|
| OOM killed | Increase memory in resource requirements |
| File not found | Verify input files exist, check dependencies |
| Permission denied | Check file permissions, execution bits |
| Timeout | Increase runtime in resource requirements |
Step 6: Fix and Retry
After fixing the issue:
# Reinitialize to reset failed jobs
torc workflows reset-status --failed --reinitialize <workflow_id>
# Run again
torc workflows run <workflow_id>
torc submit-slurm <workflow_id>
See Also
- View Job Logs — Finding log files
- Check Resource Utilization — Resource analysis
- Debugging Workflows — Comprehensive debugging guide