CLI Reference¶
sparkctl¶
sparkctl comands
Usage
sparkctl [OPTIONS] COMMAND [ARGS]...
Options
- -c, --console-level <console_level>¶
Console log level
- Default:
'INFO'
- -f, --file-level <file_level>¶
File log level
- Default:
'DEBUG'
- -r, --reraise-exceptions¶
Reraise unhandled sparkctl exceptions.
- Default:
False
clean¶
Delete all Spark runtime files in the directory.
Stop the cluster before cleaning. By default this refuses to run while a cluster appears to be running, since it deletes the state needed to stop it; pass –force to override.
This also deletes the configured spark_scratch directory recursively, even when it is located outside the base configuration directory. Point spark_scratch at a dedicated directory.
Usage
sparkctl clean [OPTIONS] DIRECTORY
Options
- --force, --no-force¶
Clean even if a cluster appears to be running. By default clean refuses in that case because it would delete the files needed to stop the cluster.
- Default:
False
Arguments
- DIRECTORY¶
Required argument
configure¶
Create a Spark cluster configuration.
Usage
sparkctl configure [OPTIONS]
Options
- -d, --directory <directory>¶
Base directory for the cluster configuration
- Default:
PosixPath('.')
- -s, --spark-scratch <spark_scratch>¶
Directory to use for shuffle data. Use a dedicated directory: sparkctl clean deletes it recursively, even when it is outside the base configuration directory.
- Default:
PosixPath('spark_scratch')
- -e, --executor-cores <executor_cores>¶
Number of cores per executor. By default this is auto-determined: when GPUs are enabled, sparkctl runs one executor per GPU and divides the node’s cores evenly among them (the NVIDIA-recommended layout); otherwise it defaults to 5.
- -E, --executor-memory-gb <executor_memory_gb>¶
Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.
- -M, --driver-memory-gb <driver_memory_gb>¶
Driver memory in GB. This is the maximum amount of data that can be pulled into the application.
- Default:
10
- -o, --node-memory-overhead-gb <node_memory_overhead_gb>¶
Memory to reserve for system processes.
- Default:
10
- --dynamic-allocation, --no-dynamic-allocation¶
Enable Spark dynamic resource allocation.
- Default:
False
- -m, --shuffle-partition-multiplier <shuffle_partition_multiplier>¶
Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)
- Default:
1
- -t, --spark-defaults-template-file <spark_defaults_template_file>¶
Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.
- --local-storage, --no-local-storage¶
Use compute node local storage for shuffle data.
- Default:
False
- --connect-server, --no-connect-server¶
Enable the Spark connect server.
- Default:
False
- --connect-server-port <connect_server_port>¶
Port on which the Spark Connect server listens.
- Default:
15002
- --history-server, --no-history-server¶
Enable the Spark history server.
- Default:
False
- --thrift-server, --no-thrift-server¶
Enable the Thrift server to connect a SQL client.
- Default:
False
- --jupyter, --no-jupyter¶
Start a Jupyter server on the master node. Pre-wired to the Spark Connect server when it is enabled (the notebook’s SparkSession connects automatically).
- Default:
False
- --jupyter-command <jupyter_command>¶
Jupyter frontend to launch, i.e. the jupyter <command> subcommand. Defaults to the classic ‘notebook’; use ‘lab’ for JupyterLab.
- Default:
'notebook'
- --jupyter-ip <jupyter_ip>¶
IP address the Jupyter server binds to. Defaults to all interfaces so it can be reached by tunneling to the compute node’s hostname through a login node (the common HPC pattern); access is protected by Jupyter’s token. Set to 127.0.0.1 to bind to localhost only, which requires tunneling directly into the compute node.
- Default:
'0.0.0.0'
- --jupyter-port <jupyter_port>¶
Port on which the Jupyter server listens.
- Default:
8889
- --reverse-proxy, --no-reverse-proxy¶
Run the Spark master as a reverse proxy for the worker and application web UIs. Useful on HPC clusters where the compute nodes are not directly reachable, so the UIs are served through the master node only.
- Default:
False
- --reverse-proxy-url <reverse_proxy_url>¶
External URL used to reach the Spark master UI when reverse proxy is enabled and the master is itself behind another front-end proxy. Leave unset to serve relative links (recommended when reaching the master through an SSH tunnel).
- --prometheus, --no-prometheus¶
Expose Spark metrics in Prometheus format through the existing web UI ports (no extra ports are opened).
- Default:
False
- --metrics-csv, --no-metrics-csv¶
Write Spark metrics to CSV files in <base>/metrics-csv. Unlike the Prometheus sink, this leaves a durable record on disk after the cluster shuts down.
- Default:
False
- --metrics-csv-period <metrics_csv_period>¶
Interval in seconds at which the CSV metrics sink writes samples.
- Default:
10
- --gpus, --no-gpus¶
Enable GPU-aware scheduling. Spark workers advertise GPUs and executors/tasks request them. Requires GPUs on the worker nodes.
- Default:
False
- --gpus-per-node <gpus_per_node>¶
Number of GPUs available on each worker node. Auto-detected from the compute environment by default.
- --rapids, --no-rapids¶
Enable the NVIDIA RAPIDS Accelerator for Apache Spark to offload SQL/DataFrame operations to GPUs. Implies enable_gpus and requires binaries.rapids_jar_file.
- Default:
False
- -l, --spark-log-level <spark_log_level>¶
Set the root log level for all Spark processes. Defaults to Spark’s defaults.
- Options:
debug | info | warn | error
- --hive-metastore, --no-hive-metastore¶
Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.
- Default:
False
- --postgres-hive-metastore, --no-postgres-hive-metastore¶
Create a metastore with PostgreSQL. Supports multiple Spark sessions.
- Default:
False
- -w, --metastore-dir <metastore_dir>¶
Set a custom directory for the metastore and warehouse.
- Default:
PosixPath('.')
- -P, --python-path <python_path>¶
Python path to set for Spark workers. Use the Python inside the Spark distribution by default.
- --resource-monitor, --no-resource-monitor¶
Enable resource monitoring.
- Default:
False
- --start, --no-start¶
Start the cluster after configuration.
- Default:
False
- --use-current-python, --no-use-current-python¶
Use the Python executable in the current environment for Spark workers. –python-path takes precedence.
- Default:
True
Examples:
$ sparkctl configure –start
$ sparkctl configure –shuffle-partition-multiplier 4 –local-storage
$ sparkctl configure –local-storage –thrift-server
default-config¶
Create a sparkctl config file that defines paths to Spark binaries. This is a one-time requirement when installing sparkctl in a new environment.
Usage
sparkctl default-config [OPTIONS] SPARK_PATH JAVA_PATH
Options
- -d, --directory <directory>¶
Directory in which to create the sparkctl config file.
- Default:
PosixPath('/home/runner')
- -e, --compute-environment <compute_environment>¶
Compute environment
- Options:
native | slurm
- -H, --hadoop-path <hadoop_path>¶
Directory containing Hadoop binaries.
- -h, --hive-tarball <hive_tarball>¶
File containing Hive binaries.
- -p, --postgresql-jar-file <postgresql_jar_file>¶
Path to PostgreSQL jar file.
- -R, --rapids-jar-file <rapids_jar_file>¶
Path to the NVIDIA RAPIDS Accelerator for Apache Spark jar file. Only required to enable RAPIDS GPU acceleration.
Arguments
- SPARK_PATH¶
Required argument
- JAVA_PATH¶
Required argument
$ sparkctl default-config
/datasets/images/apache-spark/spark-4.1.1-bin-hadoop3
/datasets/images/apache-spark/jdk-21.0.7
-e slurm
$ sparkctl default-config ~/apache-spark/spark-4.1.1-bin-hadoop3 ~/jdk-21.0.8 -e native
start¶
Start a Spark cluster with an existing configuration.
Usage
sparkctl start [OPTIONS]
Options
- --wait, --no-wait¶
If True, wait until the user presses Ctrl-C or timeout is reached and then stop the cluster. If False, start the cluster and exit.
- Default:
False
- -d, --directory <directory>¶
Base directory for the cluster configuration
- Default:
PosixPath('.')
- -t, --timeout <timeout>¶
If –wait is set, timeout in minutes. Defaults to no timeout.
Examples:
$ sparkctl start
$ sparkctl start –directory ./my-spark-config
$ sparkctl start –wait
stop¶
Stop a Spark cluster.
Usage
sparkctl stop [OPTIONS]
Options
- -d, --directory <directory>¶
Base directory for the cluster configuration
- Default:
PosixPath('.')
Examples:
$ sparkctl stop
$ sparkctl stop –directory ./my-spark-config