HPC Deployment Reference

Configuration guide for deploying Torc on High-Performance Computing systems.

Overview

Running Torc on HPC systems requires special configuration to ensure:

Compute nodes can reach the torc-server running on a login node
The database is stored on a filesystem accessible to all nodes
Network paths use the correct hostnames for the HPC interconnect

Hostname Requirements

On most HPC systems, login nodes have multiple network interfaces:

External hostname: Used for SSH access from outside (e.g., kl3.hpc.nrel.gov)
Internal hostname: Used by compute nodes via the high-speed interconnect (e.g., kl3.hsn.cm.kestrel.hpc.nrel.gov)

When running torc-server on a login node, you must use the internal hostname so compute nodes can connect.

NREL Kestrel Example

On NREL's Kestrel system, login nodes use the High-Speed Network (HSN) for internal communication:

Login Node	External Hostname	Internal Hostname (for `-u` flag)
kl1	`kl1.hpc.nrel.gov`	`kl1.hsn.cm.kestrel.hpc.nrel.gov`
kl2	`kl2.hpc.nrel.gov`	`kl2.hsn.cm.kestrel.hpc.nrel.gov`
kl3	`kl3.hpc.nrel.gov`	`kl3.hsn.cm.kestrel.hpc.nrel.gov`

Starting the server:

# On login node kl3, use the internal hostname
torc-server run \
    --database /scratch/$USER/torc.db \
    -u kl3.hsn.cm.kestrel.hpc.nrel.gov \
    --port 8085

Connecting clients:

# Set the API URL using the internal hostname
export TORC_API_URL="http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"

# Now torc commands will use this URL
torc workflows list

Finding the Internal Hostname

If you're unsure of your system's internal hostname, try these approaches:

# Check all network interfaces
hostname -A

# Look for hostnames in the hosts file
grep $(hostname -s) /etc/hosts

# Check Slurm configuration for the control machine
scontrol show config | grep ControlMachine

Consult your HPC system's documentation or support team for the correct internal hostname format.

Database Placement

The SQLite database must be on a filesystem accessible to both:

The login node running torc-server
All compute nodes running jobs

Recommended Locations

Filesystem	Pros	Cons
Scratch (`/scratch/$USER/`)	Fast, shared, high capacity	May be purged periodically
Project (`/projects/`)	Persistent, shared	May have quotas
Home (`~`)	Persistent	Often slow, limited space

Best practice: Use scratch for active workflows, backup completed workflows to project storage.

# Create a dedicated directory
mkdir -p /scratch/$USER/torc

# Start server with scratch database
torc-server run \
    --database /scratch/$USER/torc/workflows.db \
    -u $(hostname -s).hsn.cm.kestrel.hpc.nrel.gov \
    --port 8085

Database Backup

For long-running workflows, periodically backup the database:

# SQLite backup (safe while server is running)
sqlite3 /scratch/$USER/torc.db ".backup /projects/$USER/torc_backup.db"

Port Selection

Use a non-default port: Choose a port in the range 8000-9999
Check for conflicts: lsof -i :8085
Consider using your UID: --port $((8000 + UID % 1000))

# Use a unique port based on your user ID
MY_PORT=$((8000 + $(id -u) % 1000))
torc-server run \
    --database /scratch/$USER/torc.db \
    -u kl3.hsn.cm.kestrel.hpc.nrel.gov \
    --port $MY_PORT

Running in tmux/screen

Always run torc-server in a terminal multiplexer to prevent loss on disconnect:

# Start a tmux session
tmux new -s torc

# Start the server
torc-server run \
    --database /scratch/$USER/torc.db \
    -u kl3.hsn.cm.kestrel.hpc.nrel.gov \
    --port 8085

# Detach with Ctrl+b, then d
# Reattach later with: tmux attach -t torc

Complete Configuration Example

Server Configuration File

Create ~/.config/torc/config.toml:

[server]
# Use internal hostname for compute node access
url = "kl3.hsn.cm.kestrel.hpc.nrel.gov"
port = 8085
database = "/scratch/myuser/torc/workflows.db"
threads = 4
completion_check_interval_secs = 30.0
log_level = "info"

[server.logging]
log_dir = "/scratch/myuser/torc/logs"

Client Configuration File

Create ~/.config/torc/config.toml (or add to existing):

[client]
# Match the server's internal hostname and port
api_url = "http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"
format = "table"

[client.run]
output_dir = "/scratch/myuser/torc/output"

Environment Variables

Alternatively, set environment variables in your shell profile:

# Add to ~/.bashrc or ~/.bash_profile
export TORC_API_URL="http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"
export TORC_CLIENT__RUN__OUTPUT_DIR="/scratch/$USER/torc/output"

Slurm Job Runner Configuration

When submitting workflows to Slurm, the job runners on compute nodes need to reach the server. The TORC_API_URL is automatically passed to Slurm jobs.

Verify connectivity from a compute node:

# Submit an interactive job
salloc -N 1 -t 00:10:00

# Test connectivity to the server
curl -s "$TORC_API_URL/workflows" | head

# Exit the allocation
exit

Troubleshooting

"Connection refused" from compute nodes

Verify the server is using the internal hostname:

torc-server run -u <internal-hostname> --port 8085

Check the server is listening on all interfaces:
```
netstat -tlnp | grep 8085
```

Verify no firewall blocks the port:

# From a compute node
nc -zv <internal-hostname> 8085

Database locked errors

SQLite may report locking issues on network filesystems:

Ensure only one torc-server instance is running
Use a local scratch filesystem rather than NFS home directories
Consider increasing completion_check_interval_secs to reduce database contention

Server stops when SSH disconnects

Always use tmux or screen (see above). If the server dies unexpectedly:

# Check if the server is still running
pgrep -f torc-server

# Check server logs
tail -100 /scratch/$USER/torc/logs/torc-server*.log

Torc Documentation