Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Remote Workers

Run workflows across multiple machines via SSH without requiring an HPC scheduler.

Overview

Torc supports three execution modes:

  1. Local (torc run) - Jobs run on the current machine
  2. HPC (torc submit-slurm) - Jobs run on Slurm-allocated nodes
  3. Remote Workers (torc remote run) - Jobs run on SSH-accessible machines

Remote workers are ideal for:

  • Ad-hoc clusters of workstations or cloud VMs
  • Environments without a scheduler
  • Testing distributed workflows before HPC deployment

Worker File Format

Create a text file listing remote machines:

# Lines starting with # are comments
# Format: [user@]hostname[:port]

# Simple hostname
worker1.example.com

# With username
alice@worker2.example.com

# With custom SSH port
admin@192.168.1.10:2222

# IPv4 address
10.0.0.5

# IPv6 address (must be in brackets for port specification)
[2001:db8::1]
[::1]:2222

Each host can only appear once. Duplicate hosts will cause an error.

Worker Management

Workers are stored in the database and persist across command invocations. This means you only need to specify workers once, and subsequent commands can reference them by workflow ID.

Add Workers

torc remote add-workers <workflow-id> <worker>...

Add one or more workers directly on the command line:

torc remote add-workers 42 worker1.example.com alice@worker2.example.com admin@192.168.1.10:2222

Add Workers from File

torc remote add-workers-from-file <worker-file> [workflow-id]

Example:

torc remote add-workers-from-file workers.txt 42

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

List Workers

torc remote list-workers [workflow-id]

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Remove a Worker

torc remote remove-worker <worker> [workflow-id]

Example:

torc remote remove-worker worker1.example.com 42

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Commands

Start Workers

torc remote run [workflow-id] [options]

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Workers are fetched from the database. If you want to add workers from a file at the same time:

torc remote run <workflow-id> --workers <worker-file> [options]

Options:

OptionDefaultDescription
--workersnoneWorker file to add before starting
-o, --output-dirtorc_outputOutput directory on remote machines
--max-parallel-ssh10Maximum parallel SSH connections
-p, --poll-interval5.0How often workers poll for jobs (seconds)
--max-parallel-jobsautoMaximum parallel jobs per worker
--num-cpusautoCPUs per worker (auto-detected if not specified)
--memory-gbautoMemory per worker (auto-detected if not specified)
--num-gpusautoGPUs per worker (auto-detected if not specified)
--skip-version-checkfalseSkip version verification (not recommended)

Example:

# First time: add workers and start
torc remote run 42 --workers workers.txt \
    --output-dir /data/torc_output \
    --poll-interval 10

# Subsequent runs: workers already in database
torc remote run 42 --output-dir /data/torc_output

Check Status

torc remote status [workflow-id] [options]

Shows which workers are still running. Workers are fetched from the database. If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Options:

OptionDefaultDescription
--max-parallel-ssh10Maximum parallel SSH connections

Stop Workers

torc remote stop [workflow-id] [options]

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Options:

OptionDefaultDescription
--forcefalseSend SIGKILL instead of SIGTERM
--max-parallel-ssh10Maximum parallel SSH connections

Collect Logs

torc remote collect-logs [workflow-id] [options]

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Options:

OptionDefaultDescription
-l, --local-output-dirremote_logsLocal directory for collected logs
--remote-output-dirtorc_outputRemote output directory
--deletefalseDelete remote logs after successful collection
--max-parallel-ssh10Maximum parallel SSH connections

Example with deletion:

# Collect logs and clean up remote workers
torc remote collect-logs 42 --delete

Delete Logs

torc remote delete-logs [workflow-id] [options]

Delete the output directory from all remote workers without collecting logs first. Use collect-logs --delete if you want to save logs before deleting.

If workflow-id is omitted, you'll be prompted to select a workflow interactively.

Options:

OptionDefaultDescription
--remote-output-dirtorc_outputRemote output directory
--max-parallel-ssh10Maximum parallel SSH connections

Typical Workflow

  1. Create a workflow:

    torc workflows create my_workflow.yaml
    
  2. Add workers:

    # From command line
    torc remote add-workers 42 worker1.example.com worker2.example.com
    
    # Or from file
    torc remote add-workers-from-file workers.txt 42
    
  3. Start workers:

    torc remote run 42
    
  4. Monitor status:

    torc remote status 42
    
  5. Collect logs when complete:

    torc remote collect-logs 42 -l ./logs
    

Or combine steps 2 and 3:

torc remote run 42 --workers workers.txt

How It Works

  1. Version Check: Verifies all remote machines have the same torc version
  2. Worker Start: Uses nohup to start detached workers that survive SSH disconnection
  3. Job Execution: Each worker polls the server for available jobs and executes them locally
  4. Completion: Workers exit when the workflow is complete or canceled

The server coordinates job distribution. Multiple workers can safely poll the same workflow without double-allocating jobs.

SSH Configuration

Workers connect using SSH with these options:

  • ConnectTimeout=30 - 30 second connection timeout
  • BatchMode=yes - No password prompts (requires key-based auth)
  • StrictHostKeyChecking=accept-new - Accept new host keys automatically

For custom SSH configuration, use ~/.ssh/config on the local machine:

Host worker1
    HostName worker1.example.com
    User alice
    Port 2222
    IdentityFile ~/.ssh/worker_key

Then reference the alias in your worker file:

worker1
worker2
worker3

Resource Monitoring

If your workflow has resource monitoring enabled, each worker collects utilization data:

name: my_workflow
resource_monitor_config:
  enabled: true
  granularity: time_series
  sample_interval_seconds: 5

The collect-logs command retrieves these databases along with job logs.

Troubleshooting

No Workers Configured

No workers configured for workflow 42. Use 'torc remote add-workers' or '--workers' flag.

Add workers to the workflow using torc remote add-workers or the --workers flag on run.

Version Mismatch

Error: Version check failed on 2 worker(s):
  worker1: Version mismatch: local=0.7.0, worker1=0.6.5
  worker2: Version mismatch: local=0.7.0, worker2=0.6.5

Install the same torc version on all machines, or use --skip-version-check (not recommended for production).

SSH Connection Failed

Error: SSH connectivity check failed for 1 worker(s):
  worker1: SSH connection failed to worker1: Permission denied (publickey)

Verify SSH key-based authentication works:

ssh worker1.example.com true

Worker Died Immediately

[FAILED] worker1: Process died immediately. Last log:
  Error: connection refused...

The worker couldn't connect to the server. Check:

  1. Server is accessible from the remote machine
  2. Firewall allows connections on the server port
  3. The --url points to the correct server address

Workers Not Claiming Jobs

If workers start but don't claim jobs:

  1. Check the workflow is initialized: torc workflows status <id>
  2. Check jobs are ready: torc jobs list <id>
  3. Check resource requirements match available resources

Comparison with Slurm

FeatureRemote WorkersSlurm
Scheduler requiredNoYes
Resource allocationManual (worker file)Automatic
Fault toleranceLimitedFull (job requeue)
Walltime limitsNoYes
Priority/queuingNoYes
Best forAd-hoc clusters, testingProduction HPC

Security Considerations

  • Workers authenticate to the torc server (if authentication is enabled)
  • SSH keys should be properly secured
  • Workers run with the permissions of the SSH user on each machine
  • The torc server URL is passed to workers and visible in process lists