Creating a Custom HPC Profile

This tutorial walks you through creating a custom HPC profile for a cluster that Torc doesn't have built-in support for.

Before You Start

Request Built-in Support First!

If your HPC system is widely used, consider requesting that Torc developers add it as a built-in profile. This benefits everyone using that system.

Open an issue at github.com/NatLabRockies/torc/issues with:

Your HPC system name and organization

Partition names and their resource limits (CPUs, memory, walltime, GPUs)

How to detect the system (environment variable or hostname pattern)

Any special requirements (minimum nodes, exclusive partitions, etc.)

Built-in profiles are maintained by the Torc team and stay up-to-date as systems change.

Dynamic Slurm Support (The Easiest Way)

Before creating a custom profile, try using Torc's Dynamic Slurm Support. Torc can automatically query your cluster to discover its partitions and resource limits.

If you are on a Slurm system, you can use Torc immediately without any configuration:

Auto-detection: Torc automatically falls back to dynamic Slurm detection if no other profile matches.
Explicit use: You can force dynamic detection by using --profile slurm in any command.

Verify it works on your system:

# Show partitions detected from your Slurm cluster
torc hpc partitions slurm

If the detected partitions look correct, you don't need to create a custom profile! You can jump straight to Step 7: Use Your Profile using slurm as the profile name.

When to Create a Custom Profile

Create a custom profile when:

Your HPC isn't supported and you need to use it immediately
You have a private or internal cluster
You want to test profile configurations before submitting upstream

Quick Start: Auto-Generate from Slurm

If you're on a Slurm cluster, you can automatically generate a profile from the cluster configuration:

# Generate profile from current Slurm cluster
torc hpc generate

# Specify a custom name
torc hpc generate --name mycluster --display-name "My Research Cluster"

# Skip standby/preemptible partitions
torc hpc generate --skip-stdby

# Save to a file
torc hpc generate --skip-stdby -o mycluster-profile.toml

This queries sinfo and scontrol to extract:

Partition names, CPUs, memory, and time limits
GPU configuration from GRES
Node sharing settings
Hostname-based detection pattern

The generated profile can be added directly to your config file. You may want to review and adjust:

requires_explicit_request: Set to true for partitions that shouldn't be auto-selected
description: Add human-readable descriptions for each partition

After generation, skip to Step 4: Verify the Profile.

Manual Profile Creation

If automatic generation isn't available or you need more control, follow these steps.

Step 1: Gather Partition Information

Collect information about your HPC's partitions. On most Slurm systems:

# List all partitions
sinfo -s

# Get detailed partition info
sinfo -o "%P %c %m %l %G"

For this tutorial, let's say your cluster "ResearchCluster" has these partitions:

Partition	CPUs/Node	Memory	Max Walltime	GPUs
`batch`	48	192 GB	72 hours	-
`short`	48	192 GB	4 hours	-
`gpu`	32	256 GB	48 hours	4x A100
`himem`	48	1024 GB	48 hours	-

Step 2: Identify Detection Method

Determine how Torc can detect when you're on this system. Common methods:

Environment variable (most common):

echo $CLUSTER_NAME    # e.g., "research"
echo $SLURM_CLUSTER   # e.g., "researchcluster"

Hostname pattern:

hostname              # e.g., "login01.research.edu"

For this tutorial, we'll use the environment variable CLUSTER_NAME=research.

Step 3: Create the Configuration File

Create or edit your Torc configuration file:

# Linux
mkdir -p ~/.config/torc
nano ~/.config/torc/config.toml

# macOS
mkdir -p ~/Library/Application\ Support/torc
nano ~/Library/Application\ Support/torc/config.toml

Add your custom profile:

# Custom HPC Profile for ResearchCluster
[client.hpc.custom_profiles.research]
display_name = "Research Cluster"
description = "University Research HPC System"
detect_env_var = "CLUSTER_NAME=research"
default_account = "my_project"

# Batch partition - general purpose
[[client.hpc.custom_profiles.research.partitions]]
name = "batch"
cpus_per_node = 48
memory_mb = 192000        # 192 GB in MB
max_walltime_secs = 259200  # 72 hours in seconds
shared = false

# Short partition - quick jobs
[[client.hpc.custom_profiles.research.partitions]]
name = "short"
cpus_per_node = 48
memory_mb = 192000
max_walltime_secs = 14400   # 4 hours
shared = true               # Allows sharing nodes

# GPU partition
[[client.hpc.custom_profiles.research.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 256000          # 256 GB
max_walltime_secs = 172800  # 48 hours
gpus_per_node = 4
gpu_type = "A100"
shared = false

# High memory partition
[[client.hpc.custom_profiles.research.partitions]]
name = "himem"
cpus_per_node = 48
memory_mb = 1048576         # 1024 GB (1 TB)
max_walltime_secs = 172800  # 48 hours
shared = false

Step 4: Verify the Profile

Check that Torc recognizes your profile:

# List all profiles
torc hpc list

You should see your custom profile:

Known HPC profiles:

╭──────────┬──────────────────┬────────────┬──────────╮
│ Name     │ Display Name     │ Partitions │ Detected │
├──────────┼──────────────────┼────────────┼──────────┤
│ kestrel  │ NLR Kestrel      │ 15         │          │
│ research │ Research Cluster │ 4          │ ✓        │
╰──────────┴──────────────────┴────────────┴──────────╯

View the partitions:

torc hpc partitions research

Partitions for research:

╭─────────┬───────────┬───────────┬─────────────┬──────────╮
│ Name    │ CPUs/Node │ Mem/Node  │ Max Walltime│ GPUs     │
├─────────┼───────────┼───────────┼─────────────┼──────────┤
│ batch   │ 48        │ 192 GB    │ 72h         │ -        │
│ short   │ 48        │ 192 GB    │ 4h          │ -        │
│ gpu     │ 32        │ 256 GB    │ 48h         │ 4 (A100) │
│ himem   │ 48        │ 1024 GB   │ 48h         │ -        │
╰─────────┴───────────┴───────────┴─────────────┴──────────╯

Step 5: Test Partition Matching

Verify that Torc correctly matches resource requirements to partitions:

# Should match 'short' partition
torc hpc match research --cpus 8 --memory 16g --walltime 02:00:00

# Should match 'gpu' partition
torc hpc match research --cpus 16 --memory 64g --walltime 08:00:00 --gpus 2

# Should match 'himem' partition
torc hpc match research --cpus 24 --memory 512g --walltime 24:00:00

Step 6: Test Scheduler Generation

Create a test workflow to verify scheduler generation:

# test_workflow.yaml
name: profile_test
description: Test custom HPC profile

resource_requirements:
  - name: standard
    num_cpus: 16
    memory: 64g
    runtime: PT2H

  - name: gpu_compute
    num_cpus: 16
    num_gpus: 2
    memory: 128g
    runtime: PT8H

jobs:
  - name: preprocess
    command: echo "preprocessing"
    resource_requirements: standard

  - name: train
    command: echo "training"
    resource_requirements: gpu_compute
    depends_on: [preprocess]

Generate schedulers:

torc slurm generate --account my_project --profile research test_workflow.yaml

You should see the generated workflow with appropriate schedulers for each partition.

Step 7: Use Your Profile

Now you can submit workflows using your custom profile:

# Auto-detect the profile (if on the cluster)
torc slurm generate --account my_project -o workflow_slurm.yaml workflow.yaml
torc submit workflow_slurm.yaml

# Or explicitly specify the profile
torc slurm generate --account my_project --profile research -o workflow_slurm.yaml workflow.yaml
torc submit workflow_slurm.yaml

Advanced Configuration

Hostname-Based Detection

If your cluster doesn't set a unique environment variable, use hostname detection:

[client.hpc.custom_profiles.research]
display_name = "Research Cluster"
detect_hostname = ".*\\.research\\.edu"  # Regex pattern

Minimum Node Requirements

Some partitions require a minimum number of nodes:

[[client.hpc.custom_profiles.research.partitions]]
name = "large_scale"
cpus_per_node = 128
memory_mb = 512000
max_walltime_secs = 172800
min_nodes = 16  # Must request at least 16 nodes

Explicit Request Partitions

Some partitions shouldn't be auto-selected:

[[client.hpc.custom_profiles.research.partitions]]
name = "priority"
cpus_per_node = 48
memory_mb = 192000
max_walltime_secs = 86400
requires_explicit_request = true  # Only used when explicitly requested

Troubleshooting

Profile Not Detected

If torc hpc detect doesn't find your profile:

Check the environment variable or hostname:
```
echo $CLUSTER_NAME
hostname
```
Verify the detection pattern in your config matches exactly
Test with explicit profile specification:
```
torc hpc show research
```

No Partition Found for Job

If torc slurm generate can't find a matching partition:

Check if any partition satisfies all requirements:

torc hpc match research --cpus 32 --memory 128g --walltime 08:00:00

Verify memory is specified in MB in the config (not GB)
Verify walltime is in seconds (not hours)

Configuration File Location

Torc looks for config files in these locations:

Linux: ~/.config/torc/config.toml
macOS: ~/Library/Application Support/torc/config.toml
Windows: %APPDATA%\torc\config.toml

You can also use the TORC_CONFIG environment variable to specify a custom path.

Contributing Your Profile

If your HPC is used by others, please contribute it upstream:

Fork the Torc repository
Add your profile as a new module in src/client/hpc/ (see kestrel.rs for an example)
Add tests for your profile
Submit a pull request

Or simply open an issue with your partition information and we'll add it for you.

Torc Documentation