Deploy sparkctl in an HPC environment¶

This is a one-time operation to be performed by an administrator or a user with write access to a common location on the shared filesystem.

If you are a sparkctl user, you should not need to perform this step unless you want to use a custom version of Spark.

Sparkctl requires that Apache Spark and all dependent software are installed on the shared filesystem, accessible by all compute nodes.

At minimum, this includes Spark and Java. If your users want to use a PostgreSQL-based Hive metastore, you must also download Apache Hive, Hadoop, and an integration jar file for postgres.

Here is an example filesystem layout:

/datasets/images/apache_spark
├── apache-hive-4.2.0-bin.tar.gz
├── hadoop-3.4.1/
├── jdk-21.0.7/
├── postgresql-42.7.4.jar
├── rapids-4-spark_2.13-26.04.2.jar
├── spark-4.1.1-bin-hadoop3/

URLs¶

Download locations will vary over time. Here is a set of permanent links to the specific software versions tested with Apache Spark v4.1.1:

https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
https://download.oracle.com/java/21/archive/jdk-21.0.7_linux-x64_bin.tar.gz
https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/26.04.2/rapids-4-spark_2.13-26.04.2.jar
https://archive.apache.org/dist/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
https://archive.apache.org/dist/hive/hive-4.2.0/apache-hive-4.2.0-bin.tar.gz
https://jdbc.postgresql.org/download/postgresql-42.7.4.jar

To use a different version, substitute the version number in the path. For the Apache projects, use archive.apache.org/dist/..., which permanently hosts every release. The mirror hosts (downloads.apache.org and dlcdn.apache.org) and Oracle’s java/21/latest/ path only serve the current release, so a version-specific URL on those hosts returns a 404 once a newer version is published.

sparkctl configuration file¶

This command will create a default sparkctl configuration file given this filesystem layout:

$ sparkctl default-config \
    /datasets/images/apache_spark/spark-4.1.1-bin-hadoop3 \
    /datasets/images/apache_spark/jdk-21.0.7 \
    --hadoop-path /datasets/images/apache_spark/hadoop-3.4.1 \
    --rapids-jar-file /datasets/images/apache_spark/rapids-4-spark_2.13-26.04.2.jar \
    --hive-tarball /datasets/images/apache_spark/apache-hive-4.2.0-bin.tar.gz \
    --postgresql-jar-file /datasets/images/apache_spark/postgresql-42.7.4.jar \
    --compute-environment slurm

By default sparkctl reads this file from ~/.sparkctl.toml. For a shared deployment, place it in a common location and point users at it with the SPARKCTL_SETTINGS_FILE environment variable:

$ export SPARKCTL_SETTINGS_FILE=/datasets/images/apache_spark/sparkctl.toml

sparkctl loads settings files in increasing order of precedence: ~/.sparkctl.toml, then the file named by SPARKCTL_SETTINGS_FILE, then a .sparkctl.toml in the current working directory. This lets a user override any site-wide default locally without touching the shared deployment.

Environment module (recommended)¶

On HPC systems that use Lmod or Environment Modules, you can wrap the steps above in an environment module so that users only need:

$ module load sparkctl
$ sparkctl configure --start

A ready-to-deploy modulefile (TCL and Lua flavors), an example shared settings file, and step-by-step instructions are provided in the hpc/environment_module directory of the repository. The module activates the shared virtual environment, sets SPARKCTL_SETTINGS_FILE, and puts spark-submit/pyspark on the user’s PATH.