Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Working with Logs

Torc provides tools for bundling and analyzing workflow logs. These are useful for:

  • Sharing logs with colleagues for help debugging
  • Archiving completed workflow logs for later reference
  • Scanning for errors across all log files at once

Log File Overview

Torc generates several types of log files during workflow execution:

Log TypePath PatternContents
Job stdoutoutput/job_stdio/job_wf<id>_j<job>_r<run>_a<attempt>.oStandard output from job commands
Job stderroutput/job_stdio/job_wf<id>_j<job>_r<run>_a<attempt>.eError output, stack traces
Job combinedoutput/job_stdio/job_wf<id>_j<job>_r<run>_a<attempt>.logCombined stdout+stderr (combined mode)
Job runneroutput/job_runner_*.logTorc job runner internal logs
Slurm stdoutoutput/slurm_output_wf<id>_sl<slurm_id>.oSlurm job allocation output
Slurm stderroutput/slurm_output_wf<id>_sl<slurm_id>.eSlurm-specific errors
Slurm envoutput/slurm_env_*.logSlurm environment variables
dmesgoutput/dmesg_slurm_*.logKernel messages (on failure)

Note: The file extensions depend on the stdio configuration. In separate mode (default), jobs produce .o and .e files. In combined mode, a single .log file is created. Modes like no_stdout, no_stderr, or none suppress some or all output files. If delete_on_success is enabled, files are removed when a job completes with exit code 0.

For detailed information about log file contents, see Debugging Workflows and Debugging Slurm Workflows.

Bundling Logs

The torc logs bundle command packages all logs for a workflow into a compressed tarball:

# Bundle all logs for a workflow
torc logs bundle <workflow_id>

# Specify custom output directory (where logs are located)
torc logs bundle <workflow_id> --output-dir /path/to/output

# Save bundle to a specific directory
torc logs bundle <workflow_id> --bundle-dir /path/to/bundles

This creates a wf<id>.tar.gz file containing:

  • All job stdout/stderr files (job_wf*_j*_r*.o/e)
  • Job runner logs (job_runner_*.log)
  • Slurm output files (slurm_output_wf*_sl*.o/e)
  • Slurm environment logs (slurm_env_wf*_sl*.log)
  • dmesg logs (dmesg_slurm_wf*_sl*.log)
  • Bundle metadata (workflow info, collection timestamp)

Example: Sharing Logs

# Bundle workflow logs
torc logs bundle 123 --bundle-dir ./bundles

# Share the bundle
ls ./bundles/
# wf123.tar.gz

# Recipient can extract and analyze
tar -xzf wf123.tar.gz
torc logs analyze wf123/

Analyzing Logs

The torc logs analyze command scans log files for known error patterns:

# Analyze a log bundle tarball
torc logs analyze wf123.tar.gz

# Analyze a log directory directly (auto-detects workflow if only one present)
torc logs analyze output/

# Analyze a directory with multiple workflows (specify which one)
torc logs analyze output/ --workflow-id 123

Detected Error Patterns

The analyzer scans for common failure patterns including:

Memory Errors:

  • Out of memory, OOM kills
  • std::bad_alloc (C++)
  • MemoryError (Python)

Slurm Errors:

  • Time limit exceeded
  • Node failures
  • Preemption

GPU/CUDA Errors:

  • CUDA out of memory
  • GPU memory exceeded

Crashes:

  • Segmentation faults
  • Bus errors
  • Signal kills

Python Errors:

  • Tracebacks
  • Import errors

File System Errors:

  • No space left on device
  • Permission denied

Network Errors:

  • Connection refused/timed out

Example Output

Log Analysis Results
====================

Analyzing: output/

Files with detected errors:

  output/job_stdio/job_wf123_j456_r1_a1.e
    Line 42: MemoryError: Unable to allocate 8.00 GiB
    Severity: critical
    Type: Python Memory Error

  output/slurm_output_wf123_sl789.e
    Line 15: slurmstepd: error: Detected 1 oom-kill event(s)
    Severity: critical
    Type: Out of Memory (OOM) Kill

Summary:
  Total files scanned: 24
  Files with errors: 2
  Error types found: MemoryError, OOM Kill

Excluding Files

Environment variable files (slurm_env_*.log) are automatically excluded from error analysis since they contain configuration data, not error logs.

Workflow: Bundle and Share

A common pattern when asking for help:

# 1. Bundle the workflow logs
torc logs bundle <workflow_id>

# 2. Analyze locally first to understand the issue
torc logs analyze wf<id>.tar.gz

# 3. Share the bundle with your colleague/support
#    They can extract and analyze:
tar -xzf wf<id>.tar.gz
torc logs analyze wf<id>/
  • torc reports results: Generate JSON report with all log file paths
  • torc results list: View summary table of job return codes
  • torc slurm parse-logs: Parse Slurm logs for error patterns (Slurm-specific)
  • torc slurm sacct: Collect Slurm accounting data

See Also