CLI Reference¶

sparkctl¶

sparkctl comands

Usage

sparkctl [OPTIONS] COMMAND [ARGS]...

Options

-c, --console-level <console_level>¶

Console log level

Default:: 'INFO'

-f, --file-level <file_level>¶

File log level

Default:: 'DEBUG'

-r, --reraise-exceptions¶

Reraise unhandled sparkctl exceptions.

Default:: False

clean¶

Delete all Spark runtime files in the directory.

Stop the cluster before cleaning. By default this refuses to run while a cluster appears to be running, since it deletes the state needed to stop it; pass –force to override.

This also deletes the configured spark_scratch directory recursively, even when it is located outside the base configuration directory. Point spark_scratch at a dedicated directory.

Usage

sparkctl clean [OPTIONS] DIRECTORY

Options

--force, --no-force¶

Clean even if a cluster appears to be running. By default clean refuses in that case because it would delete the files needed to stop the cluster.

Default:: False

Arguments

DIRECTORY¶: Required argument

configure¶

Create a Spark cluster configuration.

Usage

sparkctl configure [OPTIONS]

Options

-d, --directory <directory>¶

Base directory for the cluster configuration

Default:: PosixPath('.')

-s, --spark-scratch <spark_scratch>¶

Directory to use for shuffle data. Use a dedicated directory: sparkctl clean deletes it recursively, even when it is outside the base configuration directory.

Default:: PosixPath('spark_scratch')

-e, --executor-cores <executor_cores>¶: Number of cores per executor. By default this is auto-determined: when GPUs are enabled, sparkctl runs one executor per GPU and divides the node’s cores evenly among them (the NVIDIA-recommended layout); otherwise it defaults to 5.

-E, --executor-memory-gb <executor_memory_gb>¶: Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.

-M, --driver-memory-gb <driver_memory_gb>¶

Driver memory in GB. This is the maximum amount of data that can be pulled into the application.

Default:: 10

-o, --node-memory-overhead-gb <node_memory_overhead_gb>¶

Memory to reserve for system processes.

Default:: 10

--dynamic-allocation, --no-dynamic-allocation¶

Enable Spark dynamic resource allocation.

Default:: False

-m, --shuffle-partition-multiplier <shuffle_partition_multiplier>¶

Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)

Default:: 1

-t, --spark-defaults-template-file <spark_defaults_template_file>¶: Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.

--local-storage, --no-local-storage¶

Use compute node local storage for shuffle data.

Default:: False

--connect-server, --no-connect-server¶

Enable the Spark connect server.

Default:: False

--connect-server-port <connect_server_port>¶

Port on which the Spark Connect server listens.

Default:: 15002

--history-server, --no-history-server¶

Enable the Spark history server.

Default:: False

--thrift-server, --no-thrift-server¶

Enable the Thrift server to connect a SQL client.

Default:: False

--jupyter, --no-jupyter¶

Start a Jupyter server on the master node. Pre-wired to the Spark Connect server when it is enabled (the notebook’s SparkSession connects automatically).

Default:: False

--jupyter-command <jupyter_command>¶

Jupyter frontend to launch, i.e. the jupyter <command> subcommand. Defaults to the classic ‘notebook’; use ‘lab’ for JupyterLab.

Default:: 'notebook'

--jupyter-ip <jupyter_ip>¶

IP address the Jupyter server binds to. Defaults to all interfaces so it can be reached by tunneling to the compute node’s hostname through a login node (the common HPC pattern); access is protected by Jupyter’s token. Set to 127.0.0.1 to bind to localhost only, which requires tunneling directly into the compute node.

Default:: '0.0.0.0'

--jupyter-port <jupyter_port>¶

Port on which the Jupyter server listens.

Default:: 8889

--reverse-proxy, --no-reverse-proxy¶

Run the Spark master as a reverse proxy for the worker and application web UIs. Useful on HPC clusters where the compute nodes are not directly reachable, so the UIs are served through the master node only.

Default:: False

--reverse-proxy-url <reverse_proxy_url>¶: External URL used to reach the Spark master UI when reverse proxy is enabled and the master is itself behind another front-end proxy. Leave unset to serve relative links (recommended when reaching the master through an SSH tunnel).

--prometheus, --no-prometheus¶

Expose Spark metrics in Prometheus format through the existing web UI ports (no extra ports are opened).

Default:: False

--metrics-csv, --no-metrics-csv¶

Write Spark metrics to CSV files in <base>/metrics-csv. Unlike the Prometheus sink, this leaves a durable record on disk after the cluster shuts down.

Default:: False

--metrics-csv-period <metrics_csv_period>¶

Interval in seconds at which the CSV metrics sink writes samples.

Default:: 10

--gpus, --no-gpus¶

Enable GPU-aware scheduling. Spark workers advertise GPUs and executors/tasks request them. Requires GPUs on the worker nodes.

Default:: False

--gpus-per-node <gpus_per_node>¶: Number of GPUs available on each worker node. Auto-detected from the compute environment by default.

--rapids, --no-rapids¶

Enable the NVIDIA RAPIDS Accelerator for Apache Spark to offload SQL/DataFrame operations to GPUs. Implies enable_gpus and requires binaries.rapids_jar_file.

Default:: False

-l, --spark-log-level <spark_log_level>¶

Set the root log level for all Spark processes. Defaults to Spark’s defaults.

Options:: debug | info | warn | error

--hive-metastore, --no-hive-metastore¶

Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.

Default:: False

--postgres-hive-metastore, --no-postgres-hive-metastore¶

Create a metastore with PostgreSQL. Supports multiple Spark sessions.

Default:: False

-w, --metastore-dir <metastore_dir>¶

Set a custom directory for the metastore and warehouse.

Default:: PosixPath('.')

-P, --python-path <python_path>¶: Python path to set for Spark workers. Use the Python inside the Spark distribution by default.

--resource-monitor, --no-resource-monitor¶

Enable resource monitoring.

Default:: False

--start, --no-start¶

Start the cluster after configuration.

Default:: False

--use-current-python, --no-use-current-python¶

Use the Python executable in the current environment for Spark workers. –python-path takes precedence.

Default:: True

Examples:

$ sparkctl configure –start

$ sparkctl configure –shuffle-partition-multiplier 4 –local-storage

$ sparkctl configure –local-storage –thrift-server

default-config¶

Create a sparkctl config file that defines paths to Spark binaries. This is a one-time requirement when installing sparkctl in a new environment.

Usage

sparkctl default-config [OPTIONS] SPARK_PATH JAVA_PATH

Options

-d, --directory <directory>¶

Directory in which to create the sparkctl config file.

Default:: PosixPath('/home/runner')

-e, --compute-environment <compute_environment>¶

Compute environment

Options:: native | slurm

-H, --hadoop-path <hadoop_path>¶: Directory containing Hadoop binaries.

-h, --hive-tarball <hive_tarball>¶: File containing Hive binaries.

-p, --postgresql-jar-file <postgresql_jar_file>¶: Path to PostgreSQL jar file.

-R, --rapids-jar-file <rapids_jar_file>¶: Path to the NVIDIA RAPIDS Accelerator for Apache Spark jar file. Only required to enable RAPIDS GPU acceleration.

Arguments

SPARK_PATH¶: Required argument

JAVA_PATH¶: Required argument

Examples:

$ sparkctl default-config

/datasets/images/apache-spark/spark-4.1.1-bin-hadoop3

/datasets/images/apache-spark/jdk-21.0.7

-e slurm

$ sparkctl default-config ~/apache-spark/spark-4.1.1-bin-hadoop3 ~/jdk-21.0.8 -e native

start¶

Start a Spark cluster with an existing configuration.

Usage

sparkctl start [OPTIONS]

Options

--wait, --no-wait¶

If True, wait until the user presses Ctrl-C or timeout is reached and then stop the cluster. If False, start the cluster and exit.

Default:: False

-d, --directory <directory>¶

Base directory for the cluster configuration

Default:: PosixPath('.')

-t, --timeout <timeout>¶: If –wait is set, timeout in minutes. Defaults to no timeout.

Examples:

$ sparkctl start

$ sparkctl start –directory ./my-spark-config

$ sparkctl start –wait

stop¶

Stop a Spark cluster.

Usage

sparkctl stop [OPTIONS]

Options

-d, --directory <directory>¶

Base directory for the cluster configuration

Default:: PosixPath('.')

Examples:

$ sparkctl stop

$ sparkctl stop –directory ./my-spark-config