Deploy sparkctl in an HPC environment

This is a one-time operation to be performed by an administrator or a user with write access to a common location on the shared filesystem.

If you are a sparkctl user, you should not need to perform this step unless you want to use a custom version of Spark.


Sparkctl requires that Apache Spark and all dependent software are installed on the shared filesystem, accessible by all compute nodes.

At minimum, this includes Spark and Java. If your users want to use a PostgreSQL-based Hive metastore, you must also download Apache Hive, Hadoop, and an integration jar file for postgres.

Here is an example filesystem layout:

/datasets/images/apache_spark
├── apache-hive-4.2.0-bin.tar.gz
├── hadoop-3.4.1/
├── jdk-21.0.7/
├── postgresql-42.7.4.jar
├── rapids-4-spark_2.13-26.04.2.jar
├── spark-4.1.1-bin-hadoop3/

URLs

Download locations will vary over time. Here is a set of permanent links to the specific software versions tested with Apache Spark v4.1.1:

  • https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz

  • https://download.oracle.com/java/21/archive/jdk-21.0.7_linux-x64_bin.tar.gz

  • https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/26.04.2/rapids-4-spark_2.13-26.04.2.jar

  • https://archive.apache.org/dist/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz

  • https://archive.apache.org/dist/hive/hive-4.2.0/apache-hive-4.2.0-bin.tar.gz

  • https://jdbc.postgresql.org/download/postgresql-42.7.4.jar

To use a different version, substitute the version number in the path. For the Apache projects, use archive.apache.org/dist/..., which permanently hosts every release. The mirror hosts (downloads.apache.org and dlcdn.apache.org) and Oracle’s java/21/latest/ path only serve the current release, so a version-specific URL on those hosts returns a 404 once a newer version is published.

sparkctl configuration file

This command will create a default sparkctl configuration file given this filesystem layout:

$ sparkctl default-config \
    /datasets/images/apache_spark/spark-4.1.1-bin-hadoop3 \
    /datasets/images/apache_spark/jdk-21.0.7 \
    --hadoop-path /datasets/images/apache_spark/hadoop-3.4.1 \
    --rapids-jar-file /datasets/images/apache_spark/rapids-4-spark_2.13-26.04.2.jar \
    --hive-tarball /datasets/images/apache_spark/apache-hive-4.2.0-bin.tar.gz \
    --postgresql-jar-file /datasets/images/apache_spark/postgresql-42.7.4.jar \
    --compute-environment slurm

By default sparkctl reads this file from ~/.sparkctl.toml. For a shared deployment, place it in a common location and point users at it with the SPARKCTL_SETTINGS_FILE environment variable:

$ export SPARKCTL_SETTINGS_FILE=/datasets/images/apache_spark/sparkctl.toml

sparkctl loads settings files in increasing order of precedence: ~/.sparkctl.toml, then the file named by SPARKCTL_SETTINGS_FILE, then a .sparkctl.toml in the current working directory. This lets a user override any site-wide default locally without touching the shared deployment.