Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

How to Check Resource Utilization

Compare actual resource usage against specified requirements to identify jobs that exceeded their limits.

Quick Start

torc reports check-resource-utilization <workflow_id>

Example output:

⚠ Found 2 resource over-utilization violations:

Job ID | Job Name    | Resource | Specified | Peak Used | Over-Utilization
-------|-------------|----------|-----------|-----------|------------------
15     | train_model | Memory   | 8.00 GB   | 10.50 GB  | +31.3%
15     | train_model | Runtime  | 2h 0m 0s  | 2h 45m 0s | +37.5%

Show All Jobs

Include jobs that stayed within limits:

torc reports check-resource-utilization <workflow_id> --all

Check a Specific Run

For workflows that have been reinitialized multiple times:

torc reports check-resource-utilization <workflow_id> --run-id 2

Automatically Correct Requirements

Use the separate correct-resources command to automatically adjust resource allocations based on actual resource measurements:

torc workflows correct-resources <workflow_id>

This command performs two types of corrections:

Upscaling (over-utilized resources)

Analyzes completed and failed jobs to detect:

  • Memory violations — Jobs using more memory than allocated
  • CPU violations — Jobs using more CPU than allocated
  • Runtime violations — Jobs running longer than allocated time

Downsizing (under-utilized resources)

Analyzes successfully completed jobs (return code 0) to detect resources that are significantly over-allocated. A resource is downsized only when:

  • All jobs sharing that resource requirement completed successfully
  • All jobs have peak usage data for that resource type
  • The savings exceed minimum thresholds (1 GB for memory, 5 percentage points for CPU, 30 minutes for runtime)
  • No job sharing that resource requirement had a violation

Failed jobs are excluded from downsizing analysis because they may terminate early with under-reported peak usage.

The command will:

  • Calculate new requirements using actual peak usage data
  • Apply a 1.2x safety multiplier to each resource (configurable)
  • Update the workflow's resource requirements for future runs

Example:

Resource Correction Summary:
  Workflow: 5
  Jobs analyzed: 3
  Resource requirements updated: 2
  Upscale:
    Memory corrections: 1
    Runtime corrections: 1
    CPU corrections: 1
  Downscale:
    Memory reductions: 2
    Runtime reductions: 2
    CPU reductions: 0

Preview Changes Without Applying

Use --dry-run to see what changes would be made:

torc workflows correct-resources <workflow_id> --dry-run

Correct Only Specific Jobs

To update only certain jobs (by ID). This filters both upscaling and downsizing:

torc workflows correct-resources <workflow_id> --job-ids 15,16,18

Disable Downsizing

To only upscale over-utilized resources without reducing over-allocated ones:

torc workflows correct-resources <workflow_id> --no-downsize

Custom Correction Multipliers

Adjust the safety margins independently (all default to 1.2x):

torc workflows correct-resources <workflow_id> \
  --memory-multiplier 1.5 \
  --cpu-multiplier 1.3 \
  --runtime-multiplier 1.4

Manual Adjustment

For more control, update your workflow specification with a buffer:

resource_requirements:
  - name: training
    memory: 12g       # 10.5 GB peak + 15% buffer
    runtime: PT3H     # 2h 45m actual + buffer
    num_cpus: 7       # Enough for peak CPU usage

Guidelines:

  • Memory: Add 10-20% above peak usage
  • Runtime: Add 15-30% above actual duration
  • CPU: Round up to accommodate peak percentage (e.g., 501% CPU → 6 cores)

See Also