How to Debug a Failed Job

Systematically diagnose why a job failed.

Step 1: Identify the Failed Job

torc jobs list <workflow_id> --status failed

Note the job ID and name.

Step 2: Check the Exit Code

torc results get <workflow_id> --job-id <job_id>

Common exit codes:

Code	Meaning
1	General error
2	Misuse of shell command
126	Permission denied
127	Command not found
137	Killed (SIGKILL) — often OOM
139	Segmentation fault
143	Terminated (SIGTERM)

Step 3: Read the Logs

# Get log paths
torc reports results <workflow_id> --job-id <job_id>

# View stderr (usually contains error messages)
cat output/job_stdio/job_wf43_j15_r1_a1.e

# View stdout
cat output/job_stdio/job_wf43_j15_r1_a1.o

Step 4: Check Resource Usage

Did the job exceed its resource limits?

torc reports check-resource-utilization <workflow_id>

Look for:

Memory exceeded — Job was likely OOM-killed (exit code 137)
Runtime exceeded — Job was terminated for running too long

Step 5: Reproduce Locally

Get the exact command that was run:

torc jobs get <job_id>

Try running it manually to see the error:

# Copy the command from the output and run it
python process.py --input data.csv

Common Fixes

Problem	Solution
OOM killed	Increase `memory` in resource requirements
File not found	Verify input files exist, check dependencies
Permission denied	Check file permissions, execution bits
Timeout	Increase `runtime` in resource requirements

Step 6: Fix and Retry

After fixing the issue:

# Reinitialize to reset failed jobs
torc workflows reset-status --failed --reinitialize <workflow_id>

# Run again
torc workflows run <workflow_id>
torc submit-slurm <workflow_id>

Torc Documentation