Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Resource Monitoring Reference

Technical reference for Torc's resource monitoring system.

Configuration Options

The resource_monitor section in workflow specifications accepts the following fields:

FieldTypeDefaultDescription
enabledbooleantrueEnable or disable monitoring
granularitystring"summary""summary" or "time_series"
sample_interval_secondsinteger10Seconds between resource samples
generate_plotsbooleanfalseReserved for future use

Granularity Modes

Summary mode ("summary"):

  • Stores only peak and average values per job
  • Metrics stored in the main database results table
  • Minimal storage overhead

Time series mode ("time_series"):

  • Stores samples at regular intervals
  • Creates separate SQLite database per workflow run
  • Database location: <output_dir>/resource_utilization/resource_metrics_<hostname>_<workflow_id>_<run_id>.db

Sample Interval Guidelines

Job DurationRecommended Interval
< 1 hour1-2 seconds
1-4 hours5 seconds (default)
> 4 hours10-30 seconds

Time Series Database Schema

job_resource_samples Table

ColumnTypeDescription
idINTEGERPrimary key
job_idINTEGERTorc job ID
timestampREALUnix timestamp
cpu_percentREALCPU utilization percentage
memory_bytesINTEGERMemory usage in bytes
num_processesINTEGERProcess count including children

job_metadata Table

ColumnTypeDescription
job_idINTEGERPrimary key, Torc job ID
job_nameTEXTHuman-readable job name

Summary Metrics in Results

When using summary mode, the following fields are added to job results:

FieldTypeDescription
peak_cpu_percentfloatMaximum CPU percentage observed
avg_cpu_percentfloatAverage CPU percentage
peak_memory_gbfloatMaximum memory in GB
avg_memory_gbfloatAverage memory in GB

check-resource-utilization JSON Output

When using --format json:

{
  "workflow_id": 123,
  "run_id": null,
  "total_results": 10,
  "over_utilization_count": 3,
  "violations": [
    {
      "job_id": 15,
      "job_name": "train_model",
      "resource_type": "Memory",
      "specified": "8.00 GB",
      "peak_used": "10.50 GB",
      "over_utilization": "+31.3%"
    }
  ]
}
FieldDescription
workflow_idWorkflow being analyzed
run_idSpecific run ID if provided, otherwise null for latest
total_resultsTotal number of completed jobs analyzed
over_utilization_countNumber of violations found
violationsArray of violation details

Violation Object

FieldDescription
job_idJob ID with violation
job_nameHuman-readable job name
resource_type"Memory", "CPU", or "Runtime"
specifiedResource requirement from workflow spec
peak_usedActual peak usage observed
over_utilizationPercentage over/under specification

correct-resources JSON Output

When using torc -f json workflows correct-resources:

{
  "status": "success",
  "workflow_id": 123,
  "dry_run": false,
  "no_downsize": false,
  "memory_multiplier": 1.2,
  "cpu_multiplier": 1.2,
  "runtime_multiplier": 1.2,
  "resource_requirements_updated": 2,
  "jobs_analyzed": 5,
  "memory_corrections": 1,
  "runtime_corrections": 1,
  "cpu_corrections": 1,
  "downsize_memory_corrections": 2,
  "downsize_runtime_corrections": 2,
  "downsize_cpu_corrections": 0,
  "adjustments": [
    {
      "resource_requirements_id": 10,
      "direction": "upscale",
      "job_ids": [15],
      "job_names": ["train_model"],
      "memory_adjusted": true,
      "original_memory": "8g",
      "new_memory": "13g",
      "max_peak_memory_bytes": 10500000000
    },
    {
      "resource_requirements_id": 11,
      "direction": "downscale",
      "job_ids": [20, 21],
      "job_names": ["preprocess_a", "preprocess_b"],
      "memory_adjusted": true,
      "original_memory": "32g",
      "new_memory": "3g",
      "max_peak_memory_bytes": 2147483648,
      "runtime_adjusted": true,
      "original_runtime": "PT4H",
      "new_runtime": "PT12M"
    }
  ]
}

Top-Level Fields

FieldDescription
memory_multiplierMemory safety multiplier used
cpu_multiplierCPU safety multiplier used
runtime_multiplierRuntime safety multiplier used
resource_requirements_updatedNumber of resource requirements changed
jobs_analyzedNumber of jobs with violations analyzed
memory_correctionsJobs affected by memory upscaling
runtime_correctionsJobs affected by runtime upscaling
cpu_correctionsJobs affected by CPU upscaling
downsize_memory_correctionsJobs affected by memory downsizing
downsize_runtime_correctionsJobs affected by runtime downsizing
downsize_cpu_correctionsJobs affected by CPU downsizing
adjustmentsArray of per-resource-requirement adjustment details

Adjustment Object

FieldDescription
resource_requirements_idID of the resource requirement being adjusted
direction"upscale" or "downscale"
job_idsJob IDs sharing this resource requirement
job_namesHuman-readable job names
memory_adjustedWhether memory was changed
original_memoryPrevious memory setting (if adjusted)
new_memoryNew memory setting (if adjusted)
max_peak_memory_bytesMaximum peak memory observed across jobs
runtime_adjustedWhether runtime was changed
original_runtimePrevious runtime setting (if adjusted)
new_runtimeNew runtime setting (if adjusted)
cpu_adjustedWhether CPU count was changed (omitted when false)
original_cpusPrevious CPU count (if adjusted)
new_cpusNew CPU count (if adjusted)
max_peak_cpu_percentMaximum peak CPU percentage observed across jobs

plot-resources Output Files

FileDescription
resource_plot_job_<id>.htmlPer-job timeline with CPU, memory, process count
resource_plot_cpu_all_jobs.htmlCPU comparison across all jobs
resource_plot_memory_all_jobs.htmlMemory comparison across all jobs
resource_plot_summary.htmlBar chart dashboard of peak vs average

All plots are self-contained HTML files using Plotly.js with:

  • Interactive hover tooltips
  • Zoom and pan controls
  • Legend toggling
  • Export options (PNG, SVG)

Monitored Metrics

MetricUnitDescription
CPU percentage%Total CPU utilization across all cores
Memory usagebytesResident memory consumption
Process countcountNumber of processes in job's process tree

Process Tree Tracking

The monitoring system automatically tracks child processes spawned by jobs. When a job creates worker processes (e.g., Python multiprocessing), all descendants are included in the aggregated metrics.

Slurm Accounting Stats

When running inside a Slurm allocation, Torc calls sacct after each job step completes and stores the results in the slurm_stats table. These complement the sysinfo-based metrics above with Slurm-native cgroup measurements.

Fields

Fieldsacct sourceDescription
max_rss_bytesMaxRSSPeak resident-set size (from cgroups)
max_vm_size_bytesMaxVMSizePeak virtual memory size
max_disk_read_bytesMaxDiskReadPeak disk read bytes
max_disk_write_bytesMaxDiskWritePeak disk write bytes
ave_cpu_secondsAveCPUAverage CPU time in seconds
node_listNodeListNodes used by the job step

Additional identifying fields stored per record: workflow_id, job_id, run_id, attempt_id, slurm_job_id.

Fields are null when:

  • The job ran locally (no SLURM_JOB_ID in the environment)
  • sacct is not available on the node
  • The step was not found in the Slurm accounting database at collection time

Viewing Stats

torc slurm stats <workflow_id>
torc slurm stats <workflow_id> --job-id <job_id>
torc -f json slurm stats <workflow_id>

Performance Characteristics

  • Single background monitoring thread regardless of job count
  • Typical overhead: <1% CPU even with 1-second sampling
  • Uses native OS APIs via the sysinfo crate
  • Non-blocking async design