# How to use GPU acceleration ## RAPIDS Accelerator The [NVIDIA RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) is the recommended way to accelerate Spark workloads on NVIDIA GPUs. It offloads SQL and DataFrame operations to GPUs with no code changes — supported operators run on the GPU automatically, and unsupported ones fall back to CPUs. 1. Download the RAPIDS Accelerator jar from the [RAPIDS download page](https://nvidia.github.io/spark-rapids/docs/download.html). (This may have already been done by your local sparkctl administrator.) 2. If needed, record its path when creating your settings file: ```console $ sparkctl default-config \ --rapids-jar-file /path/to/rapids-4-spark_2.13-.jar \ /path/to/spark \ /path/to/java ``` 3. Enable RAPIDS: ```console $ sparkctl configure --rapids ``` This enables GPU-aware scheduling behind the scenes and writes these settings to `spark-defaults.conf`: - Plugin jar via `spark.jars`, plugin registration via `spark.plugins com.nvidia.spark.SQLPlugin` - GPU-aware scheduling (`spark.worker.resource.gpu.*`, `spark.executor.resource.gpu.*`, `spark.task.resource.gpu.*`) - `spark.rapids.sql.enabled` ```{eval-rst} .. note:: ``spark.jars`` is used instead of ``spark.{driver,executor}.extraClassPath`` so the RAPIDS jar does not conflict with the classpath entries the PostgreSQL Hive metastore sets. ``` ### Verifying GPU acceleration Not every operator is GPU-accelerated. Unsupported expressions, data types, and Python/Scala UDFs fall back to the CPU, and a query that bounces between CPU and GPU can be slower than staying on the CPU. Before assuming a query is GPU-accelerated, ask RAPIDS what it actually placed on the GPU: ```python spark.conf.set("spark.rapids.sql.explain", "NOT_ON_GPU") # log every operator that fell back df.explain() # "GPU" nodes ran on the GPU; "Project"/"Filter" without "Gpu" fell back to CPU ``` ### Tuning RAPIDS Useful tuning knobs (set in your settings file's `spark_defaults` or at runtime): - `spark.rapids.sql.concurrentGpuTasks` (sparkctl defaults to `1`) — how many tasks share a GPU at once. Raising it to `2`–`4` can improve throughput if GPU memory allows. - `spark.sql.files.maxPartitionBytes` — larger partitions (e.g. `512m`) give the GPU more work per task, which it prefers over many tiny tasks. - Keep Adaptive Query Execution on (`spark.sql.adaptive.enabled true`, the Spark default). ### Custom GPU code (advanced) If you call GPU libraries directly (e.g. CuPy, PyTorch, RAPIDS cuDF, or XGBoost) inside your tasks, RAPIDS does not apply. The section below explains how GPU scheduling works; once enabled, pin each task to its assigned GPU using the task context: ```python from pyspark import TaskContext def run_on_gpu(rows): ctx = TaskContext.get() gpu = ctx.resources()["gpu"].addresses[0] # address(es) assigned to this task import cupy with cupy.cuda.Device(int(gpu)): ... # your GPU work here rdd.mapPartitions(run_on_gpu).collect() ``` Pin to the assigned address to keep two tasks on the same node from fighting over the same device. The `spark.task.resource.gpu.amount` value sparkctl writes controls how many tasks Spark will co-schedule on each GPU. ## Under the hood: GPU scheduling ### Executor sizing `sparkctl configure --rapids` enables GPU-aware scheduling automatically. When GPU scheduling is enabled and you do not set `executor_cores` explicitly, sparkctl follows NVIDIA's recommended layout: **one executor per GPU**, with the node's usable cores divided evenly among them. On a node with 4 GPUs and 64 cores you get 4 executors with ~15 cores each, so every GPU is used and each has a healthy pool of CPU cores to feed it (I/O, decompression, shuffle). To use all *N* GPUs you therefore need at least *N* cores in the allocation; request cores generously (e.g. Slurm `--cpus-per-task` or `--exclusive`). If CPUs or memory only allow fewer executors than there are GPUs, sparkctl logs a warning that some GPUs will sit idle. Set `executor_cores` explicitly to override this. Tune the GPU assignment through your settings file: ```toml [runtime] enable_gpus = true gpus_per_node = 4 executor_gpu_amount = 1 task_gpu_amount = 0.25 # executor_cores = 16 # optional; omit to auto-size one executor per GPU ``` ### GPU discovery and placement sparkctl detects the number of GPUs per node from the compute environment (Slurm GPU variables such as `SLURM_GPUS_ON_NODE`, or `nvidia-smi` in a native environment). Override the count when detection is unavailable or incorrect: ```console $ sparkctl configure --gpus --gpus-per-node 4 ``` sparkctl writes a GPU discovery script ($SPARK_CONF_DIR/get_gpus_resources.sh) that Spark calls on each worker. The script reports the GPUs visible to Spark as JSON (`{"name": "gpu", "addresses": ["0", "1", ...]}`), preferring `CUDA_VISIBLE_DEVICES` when set (becomes essential when executors run in containers where `nvidia-smi` may be unavailable) and falling back to `nvidia-smi`. ```{eval-rst} .. note:: On a multi-node Slurm job every worker node must have the **same** number of GPUs, because sparkctl writes a single ``spark.worker.resource.gpu.amount`` for all workers. Request GPUs per node (``--gpus-per-node=4``) rather than as a job-wide total (``--gpus=4``), which Slurm can split unevenly across nodes. ``sparkctl configure`` fails fast if it detects a non-uniform distribution. ``` ## Monitor GPU usage while a job runs sparkctl's built-in resource monitor (`--resource-monitor`) only collects CPU, memory, disk, and network stats — **it does not capture GPU utilization**. Use NVIDIA's tools directly. GPU work happens on the **worker/executor nodes**, so monitor there, not on the node where you launched the driver. From a login node, attach a second shell to the same Slurm allocation: ```console $ srun --overlap --jobid=$SLURM_JOB_ID --nodes=1 --pty bash ``` Then use any of: ```console $ nvidia-smi -l 1 # full table, refreshed every second $ nvidia-smi dmon -s pucvmet -d 1 # scrolling per-GPU metrics; best for watching a live job $ nvtop # htop-style TUI, incl. per-process GPU memory (if available) ``` To log the whole run to a CSV for later analysis: ```console $ nvidia-smi --query-gpu=timestamp,index,utilization.gpu,utilization.memory,memory.used,power.draw \ --format=csv -l 1 > gpu_$(hostname).csv ``` To watch every node at once on a multi-node cluster (node names are in `conf/workers`): ```console $ srun --overlap --jobid=$SLURM_JOB_ID --ntasks-per-node=1 nvidia-smi dmon -c 120 -d 1 ``` Watch `utilization.gpu` while a query runs. Sustained high utilization means operators really are executing on the GPU; near-zero utilization while CPUs are busy means the work is falling back to the CPU — cross-check with `spark.rapids.sql.explain` (see above). ## When are GPUs worth it? GPUs are not a blanket speedup for Spark — they help some workloads dramatically and slow others down. Reach for GPUs when most of these hold: - **Large data and heavy compute.** Multi-GB-to-TB scans with joins, aggregations, sorts, window functions, or `expand`/`hash` heavy plans. The GPU's advantage grows with data volume; small jobs are dominated by launch and transfer overhead. - **Columnar formats.** Parquet/ORC/CSV at scale, where RAPIDS can read and process columns directly on the GPU. - **Operations RAPIDS supports.** Standard SQL/DataFrame expressions and supported types. Check with `spark.rapids.sql.explain` (above) — a plan full of CPU fallbacks will not benefit. - **ML training/inference** with GPU-native libraries (XGBoost, deep learning, RAPIDS cuML). GPUs usually do **not** help, and can be slower or more expensive per result, when: - The dataset is small or the job is short — fixed GPU overhead dominates. - The work is dominated by **Python/Scala UDFs**, complex regex, or other operators that fall back to the CPU (data must round-trip between CPU and GPU memory). - The job is I/O- or shuffle-network-bound rather than compute-bound. - Per-partition working sets exceed GPU memory, forcing spills. ```{eval-rst} .. tip:: Before committing a workload to GPUs, run NVIDIA's `Spark RAPIDS qualification tool `_ against the CPU run's event logs. It estimates the speedup (and flags unsupported operators) from a real run, which is more reliable than guessing. ```