Remote Workers
Run workflows across multiple machines via SSH without requiring an HPC scheduler.
Overview
Torc supports three execution modes:
- Local (
torc run) - Jobs run on the current machine - HPC (
torc submit-slurm) - Jobs run on Slurm-allocated nodes - Remote Workers (
torc remote run) - Jobs run on SSH-accessible machines
Remote workers are ideal for:
- Ad-hoc clusters of workstations or cloud VMs
- Environments without a scheduler
- Testing distributed workflows before HPC deployment
Worker File Format
Create a text file listing remote machines:
# Lines starting with # are comments
# Format: [user@]hostname[:port]
# Simple hostname
worker1.example.com
# With username
alice@worker2.example.com
# With custom SSH port
admin@192.168.1.10:2222
# IPv4 address
10.0.0.5
# IPv6 address (must be in brackets for port specification)
[2001:db8::1]
[::1]:2222
Each host can only appear once. Duplicate hosts will cause an error.
Worker Management
Workers are stored in the database and persist across command invocations. This means you only need to specify workers once, and subsequent commands can reference them by workflow ID.
Add Workers
torc remote add-workers <workflow-id> <worker>...
Add one or more workers directly on the command line:
torc remote add-workers 42 worker1.example.com alice@worker2.example.com admin@192.168.1.10:2222
Add Workers from File
torc remote add-workers-from-file <worker-file> [workflow-id]
Example:
torc remote add-workers-from-file workers.txt 42
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
List Workers
torc remote list-workers [workflow-id]
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Remove a Worker
torc remote remove-worker <worker> [workflow-id]
Example:
torc remote remove-worker worker1.example.com 42
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Commands
Start Workers
torc remote run [workflow-id] [options]
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Workers are fetched from the database. If you want to add workers from a file at the same time:
torc remote run <workflow-id> --workers <worker-file> [options]
Options:
| Option | Default | Description |
|---|---|---|
--workers | none | Worker file to add before starting |
-o, --output-dir | torc_output | Output directory on remote machines |
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
-p, --poll-interval | 5.0 | How often workers poll for jobs (seconds) |
--max-parallel-jobs | auto | Maximum parallel jobs per worker |
--num-cpus | auto | CPUs per worker (auto-detected if not specified) |
--memory-gb | auto | Memory per worker (auto-detected if not specified) |
--num-gpus | auto | GPUs per worker (auto-detected if not specified) |
--skip-version-check | false | Skip version verification (not recommended) |
Example:
# First time: add workers and start
torc remote run 42 --workers workers.txt \
--output-dir /data/torc_output \
--poll-interval 10
# Subsequent runs: workers already in database
torc remote run 42 --output-dir /data/torc_output
Check Status
torc remote status [workflow-id] [options]
Shows which workers are still running. Workers are fetched from the database. If workflow-id is
omitted, you'll be prompted to select a workflow interactively.
Options:
| Option | Default | Description |
|---|---|---|
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
Stop Workers
torc remote stop [workflow-id] [options]
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Options:
| Option | Default | Description |
|---|---|---|
--force | false | Send SIGKILL instead of SIGTERM |
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
Collect Logs
torc remote collect-logs [workflow-id] [options]
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Options:
| Option | Default | Description |
|---|---|---|
-l, --local-output-dir | remote_logs | Local directory for collected logs |
--remote-output-dir | torc_output | Remote output directory |
--delete | false | Delete remote logs after successful collection |
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
Example with deletion:
# Collect logs and clean up remote workers
torc remote collect-logs 42 --delete
Delete Logs
torc remote delete-logs [workflow-id] [options]
Delete the output directory from all remote workers without collecting logs first. Use
collect-logs --delete if you want to save logs before deleting.
If workflow-id is omitted, you'll be prompted to select a workflow interactively.
Options:
| Option | Default | Description |
|---|---|---|
--remote-output-dir | torc_output | Remote output directory |
--max-parallel-ssh | 10 | Maximum parallel SSH connections |
Typical Workflow
-
Create a workflow:
torc workflows create my_workflow.yaml -
Add workers:
# From command line torc remote add-workers 42 worker1.example.com worker2.example.com # Or from file torc remote add-workers-from-file workers.txt 42 -
Start workers:
torc remote run 42 -
Monitor status:
torc remote status 42 -
Collect logs when complete:
torc remote collect-logs 42 -l ./logs
Or combine steps 2 and 3:
torc remote run 42 --workers workers.txt
How It Works
- Version Check: Verifies all remote machines have the same torc version
- Worker Start: Uses
nohupto start detached workers that survive SSH disconnection - Job Execution: Each worker polls the server for available jobs and executes them locally
- Completion: Workers exit when the workflow is complete or canceled
The server coordinates job distribution. Multiple workers can safely poll the same workflow without double-allocating jobs.
SSH Configuration
Workers connect using SSH with these options:
ConnectTimeout=30- 30 second connection timeoutBatchMode=yes- No password prompts (requires key-based auth)StrictHostKeyChecking=accept-new- Accept new host keys automatically
For custom SSH configuration, use ~/.ssh/config on the local machine:
Host worker1
HostName worker1.example.com
User alice
Port 2222
IdentityFile ~/.ssh/worker_key
Then reference the alias in your worker file:
worker1
worker2
worker3
Resource Monitoring
If your workflow has resource monitoring enabled, each worker collects utilization data:
name: my_workflow
resource_monitor_config:
enabled: true
granularity: time_series
sample_interval_seconds: 5
The collect-logs command retrieves these databases along with job logs.
Troubleshooting
No Workers Configured
No workers configured for workflow 42. Use 'torc remote add-workers' or '--workers' flag.
Add workers to the workflow using torc remote add-workers or the --workers flag on run.
Version Mismatch
Error: Version check failed on 2 worker(s):
worker1: Version mismatch: local=0.7.0, worker1=0.6.5
worker2: Version mismatch: local=0.7.0, worker2=0.6.5
Install the same torc version on all machines, or use --skip-version-check (not recommended for
production).
SSH Connection Failed
Error: SSH connectivity check failed for 1 worker(s):
worker1: SSH connection failed to worker1: Permission denied (publickey)
Verify SSH key-based authentication works:
ssh worker1.example.com true
Worker Died Immediately
[FAILED] worker1: Process died immediately. Last log:
Error: connection refused...
The worker couldn't connect to the server. Check:
- Server is accessible from the remote machine
- Firewall allows connections on the server port
- The
--urlpoints to the correct server address
Workers Not Claiming Jobs
If workers start but don't claim jobs:
- Check the workflow is initialized:
torc workflows status <id> - Check jobs are ready:
torc jobs list <id> - Check resource requirements match available resources
Comparison with Slurm
| Feature | Remote Workers | Slurm |
|---|---|---|
| Scheduler required | No | Yes |
| Resource allocation | Manual (worker file) | Automatic |
| Fault tolerance | Limited | Full (job requeue) |
| Walltime limits | No | Yes |
| Priority/queuing | No | Yes |
| Best for | Ad-hoc clusters, testing | Production HPC |
Security Considerations
- Workers authenticate to the torc server (if authentication is enabled)
- SSH keys should be properly secured
- Workers run with the permissions of the SSH user on each machine
- The torc server URL is passed to workers and visible in process lists