How to use GPU acceleration

RAPIDS Accelerator

The NVIDIA RAPIDS Accelerator for Apache Spark is the recommended way to accelerate Spark workloads on NVIDIA GPUs. It offloads SQL and DataFrame operations to GPUs with no code changes — supported operators run on the GPU automatically, and unsupported ones fall back to CPUs.

  1. Download the RAPIDS Accelerator jar from the RAPIDS download page. (This may have already been done by your local sparkctl administrator.)

  2. If needed, record its path when creating your settings file:

    $ sparkctl default-config \
        --rapids-jar-file /path/to/rapids-4-spark_2.13-<version>.jar \
        /path/to/spark \
        /path/to/java
    
  3. Enable RAPIDS:

    $ sparkctl configure --rapids
    

This enables GPU-aware scheduling behind the scenes and writes these settings to spark-defaults.conf:

  • Plugin jar via spark.jars, plugin registration via spark.plugins com.nvidia.spark.SQLPlugin

  • GPU-aware scheduling (spark.worker.resource.gpu.*, spark.executor.resource.gpu.*, spark.task.resource.gpu.*)

  • spark.rapids.sql.enabled

Note

spark.jars is used instead of spark.{driver,executor}.extraClassPath so the RAPIDS jar does not conflict with the classpath entries the PostgreSQL Hive metastore sets.

Verifying GPU acceleration

Not every operator is GPU-accelerated. Unsupported expressions, data types, and Python/Scala UDFs fall back to the CPU, and a query that bounces between CPU and GPU can be slower than staying on the CPU. Before assuming a query is GPU-accelerated, ask RAPIDS what it actually placed on the GPU:

spark.conf.set("spark.rapids.sql.explain", "NOT_ON_GPU")  # log every operator that fell back
df.explain()  # "GPU" nodes ran on the GPU; "Project"/"Filter" without "Gpu" fell back to CPU

Tuning RAPIDS

Useful tuning knobs (set in your settings file’s spark_defaults or at runtime):

  • spark.rapids.sql.concurrentGpuTasks (sparkctl defaults to 1) — how many tasks share a GPU at once. Raising it to 24 can improve throughput if GPU memory allows.

  • spark.sql.files.maxPartitionBytes — larger partitions (e.g. 512m) give the GPU more work per task, which it prefers over many tiny tasks.

  • Keep Adaptive Query Execution on (spark.sql.adaptive.enabled true, the Spark default).

Custom GPU code (advanced)

If you call GPU libraries directly (e.g. CuPy, PyTorch, RAPIDS cuDF, or XGBoost) inside your tasks, RAPIDS does not apply. The section below explains how GPU scheduling works; once enabled, pin each task to its assigned GPU using the task context:

from pyspark import TaskContext

def run_on_gpu(rows):
    ctx = TaskContext.get()
    gpu = ctx.resources()["gpu"].addresses[0]  # address(es) assigned to this task
    import cupy
    with cupy.cuda.Device(int(gpu)):
        ...  # your GPU work here

rdd.mapPartitions(run_on_gpu).collect()

Pin to the assigned address to keep two tasks on the same node from fighting over the same device. The spark.task.resource.gpu.amount value sparkctl writes controls how many tasks Spark will co-schedule on each GPU.

Under the hood: GPU scheduling

Executor sizing

sparkctl configure --rapids enables GPU-aware scheduling automatically. When GPU scheduling is enabled and you do not set executor_cores explicitly, sparkctl follows NVIDIA’s recommended layout: one executor per GPU, with the node’s usable cores divided evenly among them. On a node with 4 GPUs and 64 cores you get 4 executors with ~15 cores each, so every GPU is used and each has a healthy pool of CPU cores to feed it (I/O, decompression, shuffle). To use all N GPUs you therefore need at least N cores in the allocation; request cores generously (e.g. Slurm --cpus-per-task or --exclusive). If CPUs or memory only allow fewer executors than there are GPUs, sparkctl logs a warning that some GPUs will sit idle.

Set executor_cores explicitly to override this. Tune the GPU assignment through your settings file:

[runtime]
enable_gpus = true
gpus_per_node = 4
executor_gpu_amount = 1
task_gpu_amount = 0.25
# executor_cores = 16   # optional; omit to auto-size one executor per GPU

GPU discovery and placement

sparkctl detects the number of GPUs per node from the compute environment (Slurm GPU variables such as SLURM_GPUS_ON_NODE, or nvidia-smi in a native environment). Override the count when detection is unavailable or incorrect:

$ sparkctl configure --gpus --gpus-per-node 4

sparkctl writes a GPU discovery script ($SPARK_CONF_DIR/get_gpus_resources.sh) that Spark calls on each worker. The script reports the GPUs visible to Spark as JSON ({"name": "gpu", "addresses": ["0", "1", ...]}), preferring CUDA_VISIBLE_DEVICES when set (becomes essential when executors run in containers where nvidia-smi may be unavailable) and falling back to nvidia-smi.

Note

On a multi-node Slurm job every worker node must have the same number of GPUs, because sparkctl writes a single spark.worker.resource.gpu.amount for all workers. Request GPUs per node (--gpus-per-node=4) rather than as a job-wide total (--gpus=4), which Slurm can split unevenly across nodes. sparkctl configure fails fast if it detects a non-uniform distribution.

Monitor GPU usage while a job runs

sparkctl’s built-in resource monitor (--resource-monitor) only collects CPU, memory, disk, and network stats — it does not capture GPU utilization. Use NVIDIA’s tools directly.

GPU work happens on the worker/executor nodes, so monitor there, not on the node where you launched the driver. From a login node, attach a second shell to the same Slurm allocation:

$ srun --overlap --jobid=$SLURM_JOB_ID --nodes=1 --pty bash

Then use any of:

$ nvidia-smi -l 1                     # full table, refreshed every second
$ nvidia-smi dmon -s pucvmet -d 1     # scrolling per-GPU metrics; best for watching a live job
$ nvtop                               # htop-style TUI, incl. per-process GPU memory (if available)

To log the whole run to a CSV for later analysis:

$ nvidia-smi --query-gpu=timestamp,index,utilization.gpu,utilization.memory,memory.used,power.draw \
    --format=csv -l 1 > gpu_$(hostname).csv

To watch every node at once on a multi-node cluster (node names are in conf/workers):

$ srun --overlap --jobid=$SLURM_JOB_ID --ntasks-per-node=1 nvidia-smi dmon -c 120 -d 1

Watch utilization.gpu while a query runs. Sustained high utilization means operators really are executing on the GPU; near-zero utilization while CPUs are busy means the work is falling back to the CPU — cross-check with spark.rapids.sql.explain (see above).

When are GPUs worth it?

GPUs are not a blanket speedup for Spark — they help some workloads dramatically and slow others down. Reach for GPUs when most of these hold:

  • Large data and heavy compute. Multi-GB-to-TB scans with joins, aggregations, sorts, window functions, or expand/hash heavy plans. The GPU’s advantage grows with data volume; small jobs are dominated by launch and transfer overhead.

  • Columnar formats. Parquet/ORC/CSV at scale, where RAPIDS can read and process columns directly on the GPU.

  • Operations RAPIDS supports. Standard SQL/DataFrame expressions and supported types. Check with spark.rapids.sql.explain (above) — a plan full of CPU fallbacks will not benefit.

  • ML training/inference with GPU-native libraries (XGBoost, deep learning, RAPIDS cuML).

GPUs usually do not help, and can be slower or more expensive per result, when:

  • The dataset is small or the job is short — fixed GPU overhead dominates.

  • The work is dominated by Python/Scala UDFs, complex regex, or other operators that fall back to the CPU (data must round-trip between CPU and GPU memory).

  • The job is I/O- or shuffle-network-bound rather than compute-bound.

  • Per-partition working sets exceed GPU memory, forcing spills.

Tip

Before committing a workload to GPUs, run NVIDIA’s Spark RAPIDS qualification tool against the CPU run’s event logs. It estimates the speedup (and flags unsupported operators) from a real run, which is more reliable than guessing.