HPC Deployment Reference
Configuration guide for deploying Torc on High-Performance Computing systems.
Overview
Running Torc on HPC systems requires special configuration to ensure:
- Compute nodes can reach the torc-server running on a login node
- The database is stored on a filesystem accessible to all nodes
- Network paths use the correct hostnames for the HPC interconnect
Server Configuration on Login Nodes
Hostname Requirements
On most HPC systems, login nodes have multiple network interfaces:
- External hostname: Used for SSH access from outside (e.g.,
kl3.hpc.nrel.gov) - Internal hostname: Used by compute nodes via the high-speed interconnect (e.g.,
kl3.hsn.cm.kestrel.hpc.nrel.gov)
When running torc-server on a login node, you must use the internal hostname so compute nodes
can connect.
NREL Kestrel Example
On NREL's Kestrel system, login nodes use the High-Speed Network (HSN) for internal communication:
| Login Node | External Hostname | Internal Hostname (for -u flag) |
|---|---|---|
| kl1 | kl1.hpc.nrel.gov | kl1.hsn.cm.kestrel.hpc.nrel.gov |
| kl2 | kl2.hpc.nrel.gov | kl2.hsn.cm.kestrel.hpc.nrel.gov |
| kl3 | kl3.hpc.nrel.gov | kl3.hsn.cm.kestrel.hpc.nrel.gov |
Starting the server:
# On login node kl3, use the internal hostname
torc-server run \
--database /scratch/$USER/torc.db \
-u kl3.hsn.cm.kestrel.hpc.nrel.gov \
--port 8085
Connecting clients:
# Set the API URL using the internal hostname
export TORC_API_URL="http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"
# Now torc commands will use this URL
torc workflows list
Finding the Internal Hostname
If you're unsure of your system's internal hostname, try these approaches:
# Check all network interfaces
hostname -A
# Look for hostnames in the hosts file
grep $(hostname -s) /etc/hosts
# Check Slurm configuration for the control machine
scontrol show config | grep ControlMachine
Consult your HPC system's documentation or support team for the correct internal hostname format.
Database Placement
The SQLite database must be on a filesystem accessible to both:
- The login node running
torc-server - All compute nodes running jobs
Recommended Locations
| Filesystem | Pros | Cons |
|---|---|---|
Scratch (/scratch/$USER/) | Fast, shared, high capacity | May be purged periodically |
Project (/projects/) | Persistent, shared | May have quotas |
Home (~) | Persistent | Often slow, limited space |
Best practice: Use scratch for active workflows, backup completed workflows to project storage.
# Create a dedicated directory
mkdir -p /scratch/$USER/torc
# Start server with scratch database
torc-server run \
--database /scratch/$USER/torc/workflows.db \
-u $(hostname -s).hsn.cm.kestrel.hpc.nrel.gov \
--port 8085
Database Backup
For long-running workflows, periodically backup the database:
# SQLite backup (safe while server is running)
sqlite3 /scratch/$USER/torc.db ".backup /projects/$USER/torc_backup.db"
Port Selection
Login nodes are shared resources. To avoid conflicts:
- Use a non-default port: Choose a port in the range 8000-9999
- Check for conflicts:
lsof -i :8085 - Consider using your UID:
--port $((8000 + UID % 1000))
# Use a unique port based on your user ID
MY_PORT=$((8000 + $(id -u) % 1000))
torc-server run \
--database /scratch/$USER/torc.db \
-u kl3.hsn.cm.kestrel.hpc.nrel.gov \
--port $MY_PORT
Running in tmux/screen
Always run torc-server in a terminal multiplexer to prevent loss on disconnect:
# Start a tmux session
tmux new -s torc
# Start the server
torc-server run \
--database /scratch/$USER/torc.db \
-u kl3.hsn.cm.kestrel.hpc.nrel.gov \
--port 8085
# Detach with Ctrl+b, then d
# Reattach later with: tmux attach -t torc
Complete Configuration Example
Server Configuration File
Create ~/.config/torc/config.toml:
[server]
# Use internal hostname for compute node access
url = "kl3.hsn.cm.kestrel.hpc.nrel.gov"
port = 8085
database = "/scratch/myuser/torc/workflows.db"
threads = 4
completion_check_interval_secs = 30.0
log_level = "info"
[server.logging]
log_dir = "/scratch/myuser/torc/logs"
Client Configuration File
Create ~/.config/torc/config.toml (or add to existing):
[client]
# Match the server's internal hostname and port
api_url = "http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"
format = "table"
[client.run]
output_dir = "/scratch/myuser/torc/output"
Environment Variables
Alternatively, set environment variables in your shell profile:
# Add to ~/.bashrc or ~/.bash_profile
export TORC_API_URL="http://kl3.hsn.cm.kestrel.hpc.nrel.gov:8085/torc-service/v1"
export TORC_CLIENT__RUN__OUTPUT_DIR="/scratch/$USER/torc/output"
Slurm Job Runner Configuration
When submitting workflows to Slurm, the job runners on compute nodes need to reach the server. The
TORC_API_URL is automatically passed to Slurm jobs.
Verify connectivity from a compute node:
# Submit an interactive job
salloc -N 1 -t 00:10:00
# Test connectivity to the server
curl -s "$TORC_API_URL/workflows" | head
# Exit the allocation
exit
Troubleshooting
"Connection refused" from compute nodes
-
Verify the server is using the internal hostname:
torc-server run -u <internal-hostname> --port 8085 -
Check the server is listening on all interfaces:
netstat -tlnp | grep 8085 -
Verify no firewall blocks the port:
# From a compute node nc -zv <internal-hostname> 8085
Database locked errors
SQLite may report locking issues on network filesystems:
- Ensure only one
torc-serverinstance is running - Use a local scratch filesystem rather than NFS home directories
- Consider increasing
completion_check_interval_secsto reduce database contention
Server stops when SSH disconnects
Always use tmux or screen (see above). If the server dies unexpectedly:
# Check if the server is still running
pgrep -f torc-server
# Check server logs
tail -100 /scratch/$USER/torc/logs/torc-server*.log