Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Working with Logs

Torc provides tools for bundling and analyzing workflow logs. These are useful for:

  • Sharing logs with colleagues for help debugging
  • Archiving completed workflow logs for later reference
  • Scanning for errors across all log files at once

Log File Overview

Torc generates several types of log files during workflow execution:

Log TypePath PatternContents
Job stdoutoutput/job_stdio/job_wf<id>_j<job>_r<run>.oStandard output from job commands
Job stderroutput/job_stdio/job_wf<id>_j<job>_r<run>.eError output, stack traces
Job runneroutput/job_runner_*.logTorc job runner internal logs
Slurm stdoutoutput/slurm_output_wf<id>_sl<slurm_id>.oSlurm job allocation output
Slurm stderroutput/slurm_output_wf<id>_sl<slurm_id>.eSlurm-specific errors
Slurm envoutput/slurm_env_*.logSlurm environment variables
dmesgoutput/dmesg_slurm_*.logKernel messages (on failure)

For detailed information about log file contents, see Debugging Workflows and Debugging Slurm Workflows.

Bundling Logs

The torc logs bundle command packages all logs for a workflow into a compressed tarball:

# Bundle all logs for a workflow
torc logs bundle <workflow_id>

# Specify custom output directory (where logs are located)
torc logs bundle <workflow_id> --output-dir /path/to/output

# Save bundle to a specific directory
torc logs bundle <workflow_id> --bundle-dir /path/to/bundles

This creates a wf<id>.tar.gz file containing:

  • All job stdout/stderr files (job_wf*_j*_r*.o/e)
  • Job runner logs (job_runner_*.log)
  • Slurm output files (slurm_output_wf*_sl*.o/e)
  • Slurm environment logs (slurm_env_wf*_sl*.log)
  • dmesg logs (dmesg_slurm_wf*_sl*.log)
  • Bundle metadata (workflow info, collection timestamp)

Example: Sharing Logs

# Bundle workflow logs
torc logs bundle 123 --bundle-dir ./bundles

# Share the bundle
ls ./bundles/
# wf123.tar.gz

# Recipient can extract and analyze
tar -xzf wf123.tar.gz
torc logs analyze wf123/

Analyzing Logs

The torc logs analyze command scans log files for known error patterns:

# Analyze a log bundle tarball
torc logs analyze wf123.tar.gz

# Analyze a log directory directly (auto-detects workflow if only one present)
torc logs analyze output/

# Analyze a directory with multiple workflows (specify which one)
torc logs analyze output/ --workflow-id 123

Detected Error Patterns

The analyzer scans for common failure patterns including:

Memory Errors:

  • Out of memory, OOM kills
  • std::bad_alloc (C++)
  • MemoryError (Python)

Slurm Errors:

  • Time limit exceeded
  • Node failures
  • Preemption

GPU/CUDA Errors:

  • CUDA out of memory
  • GPU memory exceeded

Crashes:

  • Segmentation faults
  • Bus errors
  • Signal kills

Python Errors:

  • Tracebacks
  • Import errors

File System Errors:

  • No space left on device
  • Permission denied

Network Errors:

  • Connection refused/timed out

Example Output

Log Analysis Results
====================

Analyzing: output/

Files with detected errors:

  output/job_stdio/job_wf123_j456_r1.e
    Line 42: MemoryError: Unable to allocate 8.00 GiB
    Severity: critical
    Type: Python Memory Error

  output/slurm_output_wf123_sl789.e
    Line 15: slurmstepd: error: Detected 1 oom-kill event(s)
    Severity: critical
    Type: Out of Memory (OOM) Kill

Summary:
  Total files scanned: 24
  Files with errors: 2
  Error types found: MemoryError, OOM Kill

Excluding Files

Environment variable files (slurm_env_*.log) are automatically excluded from error analysis since they contain configuration data, not error logs.

Workflow: Bundle and Share

A common pattern when asking for help:

# 1. Bundle the workflow logs
torc logs bundle <workflow_id>

# 2. Analyze locally first to understand the issue
torc logs analyze wf<id>.tar.gz

# 3. Share the bundle with your colleague/support
#    They can extract and analyze:
tar -xzf wf<id>.tar.gz
torc logs analyze wf<id>/
  • torc reports results: Generate JSON report with all log file paths
  • torc results list: View summary table of job return codes
  • torc slurm parse-logs: Parse Slurm logs for error patterns (Slurm-specific)
  • torc slurm sacct: Collect Slurm accounting data

See Also