Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Resource Monitoring Reference

Technical reference for Torc's resource monitoring system.

Configuration Options

The resource_monitor section has one shared sampling interval and separate nested scopes for jobs and compute nodes:

resource_monitor:
  sample_interval_seconds: 5
  flush_interval_seconds: 300
  jobs:
    enabled: true
    granularity: summary
  compute_node:
    enabled: true
    granularity: time_series
FieldTypeDefaultDescription
sample_interval_secondsinteger10Seconds between all resource samples
flush_interval_secondsinteger300Seconds between batched SQLite flushes for time-series samples
generate_plotsbooleanfalseEmit HTML plots after the job runner exits
jobsJobMonitorConfignonePer-job CPU and memory monitoring
compute_nodeComputeNodeMonitorConfignoneOverall compute-node CPU/memory monitoring

For backwards compatibility, top-level enabled and granularity fields are still accepted and apply to job monitoring when jobs is omitted:

resource_monitor:
  enabled: true
  granularity: time_series
  sample_interval_seconds: 5
  flush_interval_seconds: 300

New workflow specs should use the explicit jobs block.

flush_interval_seconds only affects time-series persistence. Samples are still collected every sample_interval_seconds, but they are buffered in memory and written to SQLite in batches to reduce transaction overhead.

Job Monitoring

resource_monitor.jobs controls per-job CPU and memory monitoring. Summary mode stores peak and average values on job results. Time-series mode also stores per-sample values in a resource metrics database.

Compute Node Monitoring

To opt in to overall compute-node CPU and memory monitoring, add a nested compute_node block. The compute-node monitor supports granularity: "summary" and granularity: "time_series". Summary mode stores peak and average values for the runner lifetime on the compute node record. Time-series mode also stores per-sample values. The current compute-node monitor records CPU and memory; GPU monitoring is reserved for a future extension.

Granularity Modes

Summary mode ("summary"):

  • Stores only peak and average values per job
  • Metrics stored in the main database results table
  • Minimal storage overhead

Time series mode ("time_series"):

  • Stores samples at regular intervals
  • Creates separate SQLite database per workflow run
  • Database location: <output_dir>/resource_utilization/resource_metrics_<hostname>_<workflow_id>_<run_id>.db

Sample Interval Guidelines

Job DurationRecommended Interval
< 1 hour1-2 seconds
1-4 hours5 seconds (default)
> 4 hours10-30 seconds

Time Series Database Schema

job_resource_samples Table

ColumnTypeDescription
idINTEGERPrimary key
job_idINTEGERTorc job ID
timestampREALUnix timestamp
cpu_percentREALCPU utilization percentage
memory_bytesINTEGERMemory usage in bytes
num_processesINTEGERProcess count including children

job_metadata Table

ColumnTypeDescription
job_idINTEGERPrimary key, Torc job ID
job_nameTEXTHuman-readable job name

system_resource_samples Table

This table is always created in the resource metrics database, but rows are only written when resource_monitor.compute_node is enabled with granularity set to "time_series". If compute-node monitoring is disabled or summary-only, the table remains empty.

ColumnTypeDescription
timestampINTEGERUnix timestamp
cpu_percentREALOverall CPU utilization
memory_bytesINTEGERUsed system memory in bytes
total_memory_bytesINTEGERTotal system memory in bytes

system_resource_summary Table

This table is always created in the resource metrics database, but a row is only written when compute-node time-series monitoring is enabled. Summary-only compute-node monitoring stores these values on the compute node record instead, leaving this table empty.

ColumnTypeDescription
sample_countINTEGERNumber of system samples
peak_cpu_percentREALPeak overall CPU utilization
avg_cpu_percentREALAverage CPU utilization
peak_memory_bytesINTEGERPeak used system memory
avg_memory_bytesINTEGERAverage used system memory

Compute Node Summary Fields

When resource_monitor.compute_node.enabled is true, Torc stores overall summary metrics on the compute node record:

FieldDescription
sample_countNumber of system samples
peak_cpu_percentPeak overall CPU utilization
avg_cpu_percentAverage CPU utilization
peak_memory_bytesPeak used system memory
avg_memory_bytesAverage used system memory

These fields are shown by torc compute-nodes get, torc compute-nodes list, the TUI compute nodes view, and the dashboard compute nodes table.

Summary Metrics in Results

When using summary mode, the following fields are added to job results:

FieldTypeDescription
peak_cpu_percentfloatMaximum CPU percentage observed
avg_cpu_percentfloatAverage CPU percentage
peak_memory_gbfloatMaximum memory in GB
avg_memory_gbfloatAverage memory in GB

check-resource-utilization JSON Output

When using --format json:

{
  "workflow_id": 123,
  "run_id": null,
  "total_results": 10,
  "over_utilization_count": 3,
  "violations": [
    {
      "job_id": 15,
      "job_name": "train_model",
      "resource_type": "Memory",
      "specified": "8.00 GB",
      "peak_used": "10.50 GB",
      "over_utilization": "+31.3%"
    }
  ]
}
FieldDescription
workflow_idWorkflow being analyzed
run_idSpecific run ID if provided, otherwise null for latest
total_resultsTotal number of completed jobs analyzed
over_utilization_countNumber of violations found
violationsArray of violation details

Violation Object

FieldDescription
job_idJob ID with violation
job_nameHuman-readable job name
resource_type"Memory", "CPU", or "Runtime"
specifiedResource requirement from workflow spec
peak_usedActual peak usage observed
over_utilizationPercentage over/under specification

correct-resources JSON Output

When using torc -f json workflows correct-resources:

{
  "status": "success",
  "workflow_id": 123,
  "dry_run": false,
  "no_downsize": false,
  "memory_multiplier": 1.2,
  "cpu_multiplier": 1.2,
  "runtime_multiplier": 1.2,
  "resource_requirements_updated": 2,
  "jobs_analyzed": 5,
  "memory_corrections": 1,
  "runtime_corrections": 1,
  "cpu_corrections": 1,
  "downsize_memory_corrections": 2,
  "downsize_runtime_corrections": 2,
  "downsize_cpu_corrections": 0,
  "adjustments": [
    {
      "resource_requirements_id": 10,
      "direction": "upscale",
      "job_ids": [15],
      "job_names": ["train_model"],
      "memory_adjusted": true,
      "original_memory": "8g",
      "new_memory": "13g",
      "max_peak_memory_bytes": 10500000000
    },
    {
      "resource_requirements_id": 11,
      "direction": "downscale",
      "job_ids": [20, 21],
      "job_names": ["preprocess_a", "preprocess_b"],
      "memory_adjusted": true,
      "original_memory": "32g",
      "new_memory": "3g",
      "max_peak_memory_bytes": 2147483648,
      "runtime_adjusted": true,
      "original_runtime": "PT4H",
      "new_runtime": "PT12M"
    }
  ]
}

Top-Level Fields

FieldDescription
memory_multiplierMemory safety multiplier used
cpu_multiplierCPU safety multiplier used
runtime_multiplierRuntime safety multiplier used
resource_requirements_updatedNumber of resource requirements changed
jobs_analyzedNumber of jobs with violations analyzed
memory_correctionsJobs affected by memory upscaling
runtime_correctionsJobs affected by runtime upscaling
cpu_correctionsJobs affected by CPU upscaling
downsize_memory_correctionsJobs affected by memory downsizing
downsize_runtime_correctionsJobs affected by runtime downsizing
downsize_cpu_correctionsJobs affected by CPU downsizing
adjustmentsArray of per-resource-requirement adjustment details

Adjustment Object

FieldDescription
resource_requirements_idID of the resource requirement being adjusted
direction"upscale" or "downscale"
job_idsJob IDs sharing this resource requirement
job_namesHuman-readable job names
memory_adjustedWhether memory was changed
original_memoryPrevious memory setting (if adjusted)
new_memoryNew memory setting (if adjusted)
max_peak_memory_bytesMaximum peak memory observed across jobs
runtime_adjustedWhether runtime was changed
original_runtimePrevious runtime setting (if adjusted)
new_runtimeNew runtime setting (if adjusted)
cpu_adjustedWhether CPU count was changed (omitted when false)
original_cpusPrevious CPU count (if adjusted)
new_cpusNew CPU count (if adjusted)
max_peak_cpu_percentMaximum peak CPU percentage observed across jobs

plot-resources Output Files

FileDescription
resource_plot_job_<id>.htmlPer-job timeline with CPU, memory, process count
resource_plot_cpu_all_jobs.htmlCPU comparison across all jobs
resource_plot_memory_all_jobs.htmlMemory comparison across all jobs
resource_plot_summary.htmlBar chart dashboard of peak vs average
resource_plot_system_timeline.htmlOverall system CPU and memory over time
resource_plot_system_summary.htmlOverall system peak and average values

All plots are self-contained HTML files using Plotly.js with:

  • Interactive hover tooltips
  • Zoom and pan controls
  • Legend toggling
  • Export options (PNG, SVG)

Monitored Metrics

MetricUnitDescription
CPU percentage%Total CPU utilization across all cores
Memory usagebytesResident memory consumption
Process countcountNumber of processes in job's process tree

Process Tree Tracking

The monitoring system automatically tracks child processes spawned by jobs. When a job creates worker processes (e.g., Python multiprocessing), all descendants are included in the aggregated metrics.

Slurm Accounting Stats

When running inside a Slurm allocation, Torc calls sacct after each job step completes and stores the results in the slurm_stats table. These complement the sysinfo-based metrics above with Slurm-native cgroup measurements.

Fields

Fieldsacct sourceDescription
max_rss_bytesMaxRSSPeak resident-set size (from cgroups)
max_vm_size_bytesMaxVMSizePeak virtual memory size
max_disk_read_bytesMaxDiskReadPeak disk read bytes
max_disk_write_bytesMaxDiskWritePeak disk write bytes
ave_cpu_secondsAveCPUAverage CPU time in seconds
node_listNodeListNodes used by the job step

Additional identifying fields stored per record: workflow_id, job_id, run_id, attempt_id, slurm_job_id.

Fields are null when:

  • The job ran locally (no SLURM_JOB_ID in the environment)
  • sacct is not available on the node
  • The step was not found in the Slurm accounting database at collection time

Viewing Stats

torc slurm stats <workflow_id>
torc slurm stats <workflow_id> --job-id <job_id>
torc -f json slurm stats <workflow_id>

Performance Characteristics

  • Single background monitoring thread regardless of job count
  • Typical overhead: <1% CPU even with 1-second sampling
  • Uses native OS APIs via the sysinfo crate
  • Non-blocking async design