sparkctl API

sparkctl.config.make_default_spark_config() SparkConfig

Return a SparkConfig created from the user’s config file.

class sparkctl.cluster_manager.ClusterManager(config: SparkConfig, status: StatusTracker | None = None)

Manages operation of the Spark cluster.

classmethod from_config(config: SparkConfig) Self

Create a ClusterManager from a config instance.

Examples

>>> from sparkctl import ClusterManager, make_default_spark_config
>>> config = make_default_spark_config()
>>> config.runtime.start_connect_server = True
>>> mgr = ClusterManager.from_config(config)

See also

from_config_file

classmethod from_config_file(config_file: Path | str | None = None) Self

Create a ClusterManager from a config file. If filename is None, use the default config file (e.g., ~/.sparkctl.toml).

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file(config_file="config.json")

See also

from_config

classmethod load(directory: Path | str | None = None) Self

Load an active cluster manager from a directory containg a previously-created sparkctl config.

Parameters:

directory – Directory containing the sparkctl configuration files. Defaults to the current directory.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.load()
>>> mgr = ClusterManager.load(directory="path/to/sparkctl/config")

See also

from_config

clean(force: bool = False) None

Delete all Spark runtime files generated by sparkctl in the base directory.

Parameters:

force – Clean even when a cluster appears to be running. By default clean refuses in that case because deleting the runtime files removes the state needed to stop the cluster.

configure() None

Configure a Spark cluster based on the input parameters.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
get_spark_session() SparkSession

Return a SparkSession for the current cluster.

Examples

>>> spark = mgr.get_spark_session()
>>> spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]).show()
set_workers(workers: list[str]) None

Set the workers for the cluster. Must be called after configure() and before start().

Parameters:

workers – Worker node names or IP addresses, will be used as ssh targets.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config(make_default_spark_config())
>>> mgr.configure()
>>> mgr.set_workers(["worker1", "worker2"])
>>> mgr.start()
get_workers() list[str]

Return the current worker node names.

start(print_env_paths: bool = True) None

Start the Spark cluster. The caller must have called configure() beforehand.

The environment variables SPARK_HOME, SPARK_CONF_DIR, and JAVA_HOME are set to correct values for the current process.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
>>> mgr.start()
managed_cluster() Generator[SparkSession, None, None]

Configure and start the Spark cluster, yield a SparkSession in a context manager, stop the cluster after exit.

The environment variables SPARK_HOME, SPARK_CONF_DIR, and JAVA_HOME are set to correct values for the current process while the context is active and cleared when complete.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> with mgr.managed_cluster() as spark:
    df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
    df.show()
stop() None

Stop all Spark processes.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
>>> mgr.start()
>>> mgr.stop()
pydantic model sparkctl.models.SparkConfig

Contains all Spark configuration parameters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "SparkConfig",
   "description": "Contains all Spark configuration parameters.",
   "type": "object",
   "properties": {
      "binaries": {
         "$ref": "#/$defs/BinaryLocations"
      },
      "runtime": {
         "$ref": "#/$defs/SparkRuntimeParams",
         "default": {
            "executor_cores": null,
            "executor_memory_gb": null,
            "driver_memory_gb": 10,
            "node_memory_overhead_gb": 10,
            "use_local_storage": false,
            "start_connect_server": false,
            "connect_server_port": 15002,
            "start_history_server": false,
            "start_thrift_server": false,
            "start_jupyter": false,
            "jupyter_command": "notebook",
            "jupyter_ip": "0.0.0.0",
            "jupyter_port": 8889,
            "enable_reverse_proxy": false,
            "reverse_proxy_url": null,
            "enable_prometheus": false,
            "enable_metrics_csv": false,
            "metrics_csv_period": 10,
            "enable_gpus": false,
            "gpus_per_node": null,
            "executor_gpu_amount": 1,
            "task_gpu_amount": null,
            "enable_rapids": false,
            "spark_log_level": null,
            "enable_dynamic_allocation": false,
            "shuffle_partition_multiplier": 1,
            "enable_hive_metastore": false,
            "enable_postgres_hive_metastore": false,
            "postgres_password": "89480885-995a-4ef7-92dd-17e8babdeece",
            "python_path": null,
            "spark_defaults_template_file": null
         }
      },
      "directories": {
         "$ref": "#/$defs/RuntimeDirectories",
         "default": {
            "base": "/home/runner/work/sparkctl/sparkctl/docs",
            "spark_scratch": "/home/runner/work/sparkctl/sparkctl/docs/spark_scratch",
            "metastore_dir": "/home/runner/work/sparkctl/sparkctl/docs"
         }
      },
      "compute": {
         "$ref": "#/$defs/ComputeParams",
         "default": {
            "environment": "slurm",
            "use_srun": true,
            "postgres": {
               "setup_metastore": "postgres/setup_metastore.sh",
               "start_container": "postgres/start_container.sh",
               "stop_container": "postgres/stop_container.sh"
            }
         }
      },
      "resource_monitor": {
         "$ref": "#/$defs/ResourceMonitorConfig",
         "default": {
            "cpu": true,
            "disk": true,
            "memory": true,
            "network": true,
            "interval": 5,
            "enabled": false
         }
      },
      "app": {
         "$ref": "#/$defs/AppParams",
         "default": {
            "console_level": "INFO",
            "file_level": "DEBUG",
            "reraise_exceptions": false
         }
      }
   },
   "$defs": {
      "AppParams": {
         "additionalProperties": false,
         "properties": {
            "console_level": {
               "default": "INFO",
               "description": "Console log level",
               "title": "Console Level",
               "type": "string"
            },
            "file_level": {
               "default": "DEBUG",
               "description": "File log level",
               "title": "File Level",
               "type": "string"
            },
            "reraise_exceptions": {
               "default": false,
               "description": "Reraise sparkctl exceptions in the CLI handler. Not recommended for users. Useful for developers when debugging issues.",
               "title": "Reraise Exceptions",
               "type": "boolean"
            }
         },
         "title": "AppParams",
         "type": "object"
      },
      "BinaryLocations": {
         "additionalProperties": false,
         "description": "Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file\nare only required if the user wants to enable a Postgres-based Hive metastore.",
         "properties": {
            "spark_path": {
               "description": "Path to the Spark binaries.",
               "format": "path",
               "title": "Spark Path",
               "type": "string"
            },
            "java_path": {
               "description": "Path to the Java binaries.",
               "format": "path",
               "title": "Java Path",
               "type": "string"
            },
            "hadoop_path": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the Hadoop binaries.",
               "title": "Hadoop Path"
            },
            "hive_tarball": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the Hive binaries.",
               "title": "Hive Tarball"
            },
            "postgresql_jar_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the PostgreSQL jar file.",
               "title": "Postgresql Jar File"
            },
            "rapids_jar_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the NVIDIA RAPIDS Accelerator for Apache Spark jar file. Only required to enable RAPIDS GPU acceleration.",
               "title": "Rapids Jar File"
            }
         },
         "required": [
            "spark_path",
            "java_path"
         ],
         "title": "BinaryLocations",
         "type": "object"
      },
      "ComputeEnvironment": {
         "description": "Defines the supported compute environments.",
         "enum": [
            "native",
            "slurm",
            "fake"
         ],
         "title": "ComputeEnvironment",
         "type": "string"
      },
      "ComputeParams": {
         "additionalProperties": false,
         "properties": {
            "environment": {
               "$ref": "#/$defs/ComputeEnvironment",
               "default": "slurm"
            },
            "use_srun": {
               "default": true,
               "description": "In a Slurm environment, launch Spark workers with srun instead of ssh. srun forwards the full submission environment (modules, virtual environments, LD_LIBRARY_PATH) to the worker nodes. Set to false to fall back to ssh if a site's Slurm configuration does not work with sparkctl's srun invocation. Has no effect in a native environment.",
               "title": "Use Srun",
               "type": "boolean"
            },
            "postgres": {
               "$ref": "#/$defs/PostgresScripts",
               "default": {
                  "start_container": "postgres/start_container.sh",
                  "stop_container": "postgres/stop_container.sh",
                  "setup_metastore": "postgres/setup_metastore.sh"
               }
            }
         },
         "title": "ComputeParams",
         "type": "object"
      },
      "PostgresScripts": {
         "additionalProperties": false,
         "description": "Scripts that setup a PostgreSQL database for use in a Hive metastore.\nRelative paths are assumed to be based on the root path of the sparkctl package.\nAbsolute paths can be anywhere on the filesystem.",
         "properties": {
            "start_container": {
               "default": "postgres/start_container.sh",
               "title": "Start Container",
               "type": "string"
            },
            "stop_container": {
               "default": "postgres/stop_container.sh",
               "title": "Stop Container",
               "type": "string"
            },
            "setup_metastore": {
               "default": "postgres/setup_metastore.sh",
               "title": "Setup Metastore",
               "type": "string"
            }
         },
         "title": "PostgresScripts",
         "type": "object"
      },
      "ResourceMonitorConfig": {
         "additionalProperties": false,
         "description": "Defines the resource stats to monitor.",
         "properties": {
            "cpu": {
               "default": true,
               "description": "Monitor CPU utilization",
               "title": "Cpu",
               "type": "boolean"
            },
            "disk": {
               "default": true,
               "description": "Monitor disk/storage utilization",
               "title": "Disk",
               "type": "boolean"
            },
            "memory": {
               "default": true,
               "description": "Monitor memory utilization",
               "title": "Memory",
               "type": "boolean"
            },
            "network": {
               "default": true,
               "description": "Monitor network utilization",
               "title": "Network",
               "type": "boolean"
            },
            "interval": {
               "default": 5,
               "description": "Interval in seconds on which to collect stats",
               "title": "Interval",
               "type": "integer"
            },
            "enabled": {
               "default": false,
               "description": "Enable resource monitoring.",
               "title": "Enabled",
               "type": "boolean"
            }
         },
         "title": "ResourceMonitorConfig",
         "type": "object"
      },
      "RuntimeDirectories": {
         "additionalProperties": false,
         "description": "Defines the directories to be used by a Spark cluster.",
         "properties": {
            "base": {
               "default": ".",
               "description": "Base directory for the cluster configuration",
               "format": "path",
               "title": "Base",
               "type": "string"
            },
            "spark_scratch": {
               "default": "spark_scratch",
               "description": "Directory to use for shuffle data. Use a dedicated directory: `sparkctl clean` deletes it recursively, even when it is outside the base configuration directory.",
               "format": "path",
               "title": "Spark Scratch",
               "type": "string"
            },
            "metastore_dir": {
               "default": ".",
               "description": "Set a custom directory for the metastore and warehouse.",
               "format": "path",
               "title": "Metastore Dir",
               "type": "string"
            }
         },
         "title": "RuntimeDirectories",
         "type": "object"
      },
      "SparkRuntimeParams": {
         "additionalProperties": false,
         "description": "Controls Spark runtime parameters.",
         "properties": {
            "executor_cores": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Number of cores per executor. By default this is auto-determined: when GPUs are enabled, sparkctl runs one executor per GPU and divides the node's cores evenly among them (the NVIDIA-recommended layout); otherwise it defaults to 5.",
               "title": "Executor Cores"
            },
            "executor_memory_gb": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.",
               "title": "Executor Memory Gb"
            },
            "driver_memory_gb": {
               "default": 10,
               "description": "Driver memory in GB. This is the maximum amount of data that can be pulled into the application.",
               "title": "Driver Memory Gb",
               "type": "integer"
            },
            "node_memory_overhead_gb": {
               "default": 10,
               "description": "Memory to reserve for system processes.",
               "title": "Node Memory Overhead Gb",
               "type": "integer"
            },
            "use_local_storage": {
               "default": false,
               "description": "Use compute node local storage for shuffle data.",
               "title": "Use Local Storage",
               "type": "boolean"
            },
            "start_connect_server": {
               "default": false,
               "description": "Enable the Spark connect server.",
               "title": "Start Connect Server",
               "type": "boolean"
            },
            "connect_server_port": {
               "default": 15002,
               "description": "Port on which the Spark Connect server listens.",
               "title": "Connect Server Port",
               "type": "integer"
            },
            "start_history_server": {
               "default": false,
               "description": "Enable the Spark history server.",
               "title": "Start History Server",
               "type": "boolean"
            },
            "start_thrift_server": {
               "default": false,
               "description": "Enable the Thrift server to connect a SQL client.",
               "title": "Start Thrift Server",
               "type": "boolean"
            },
            "start_jupyter": {
               "default": false,
               "description": "Start a Jupyter server on the master node. Pre-wired to the Spark Connect server when it is enabled (the notebook's SparkSession connects automatically).",
               "title": "Start Jupyter",
               "type": "boolean"
            },
            "jupyter_command": {
               "default": "notebook",
               "description": "Jupyter frontend to launch, i.e. the `jupyter <command>` subcommand. Defaults to the classic 'notebook'; use 'lab' for JupyterLab.",
               "title": "Jupyter Command",
               "type": "string"
            },
            "jupyter_ip": {
               "default": "0.0.0.0",
               "description": "IP address the Jupyter server binds to. Defaults to all interfaces so it can be reached by tunneling to the compute node's hostname through a login node (the common HPC pattern); access is protected by Jupyter's token. Set to 127.0.0.1 to bind to localhost only, which requires tunneling directly into the compute node.",
               "title": "Jupyter Ip",
               "type": "string"
            },
            "jupyter_port": {
               "default": 8889,
               "description": "Port on which the Jupyter server listens.",
               "title": "Jupyter Port",
               "type": "integer"
            },
            "enable_reverse_proxy": {
               "default": false,
               "description": "Run the Spark master as a reverse proxy for the worker and application web UIs. Useful on HPC clusters where the compute nodes are not directly reachable, so the UIs are served through the master node only.",
               "title": "Enable Reverse Proxy",
               "type": "boolean"
            },
            "reverse_proxy_url": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "External URL used to reach the Spark master UI when reverse proxy is enabled and the master is itself behind another front-end proxy. Leave unset to serve relative links (recommended when reaching the master through an SSH tunnel).",
               "title": "Reverse Proxy Url"
            },
            "enable_prometheus": {
               "default": false,
               "description": "Expose Spark metrics in Prometheus format through the existing web UI ports (no extra ports are opened).",
               "title": "Enable Prometheus",
               "type": "boolean"
            },
            "enable_metrics_csv": {
               "default": false,
               "description": "Write Spark metrics to CSV files in <base>/metrics-csv. Unlike the Prometheus sink, this leaves a durable record on disk after the cluster shuts down.",
               "title": "Enable Metrics Csv",
               "type": "boolean"
            },
            "metrics_csv_period": {
               "default": 10,
               "description": "Interval in seconds at which the CSV metrics sink writes samples.",
               "title": "Metrics Csv Period",
               "type": "integer"
            },
            "enable_gpus": {
               "default": false,
               "description": "Enable GPU-aware scheduling. Spark workers advertise GPUs and executors/tasks request them. Requires GPUs on the worker nodes.",
               "title": "Enable Gpus",
               "type": "boolean"
            },
            "gpus_per_node": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Number of GPUs available on each worker node. Auto-detected from the compute environment by default.",
               "title": "Gpus Per Node"
            },
            "executor_gpu_amount": {
               "default": 1,
               "description": "Number of GPUs assigned to each executor.",
               "title": "Executor Gpu Amount",
               "type": "integer"
            },
            "task_gpu_amount": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "GPUs assigned to each task. Defaults to executor_gpu_amount / executor_cores so that concurrent tasks share an executor's GPUs.",
               "title": "Task Gpu Amount"
            },
            "enable_rapids": {
               "default": false,
               "description": "Enable the NVIDIA RAPIDS Accelerator for Apache Spark to offload SQL/DataFrame operations to GPUs. Implies enable_gpus and requires binaries.rapids_jar_file.",
               "title": "Enable Rapids",
               "type": "boolean"
            },
            "spark_log_level": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Set the root log level for all Spark processes. Defaults to Spark's defaults.",
               "title": "Spark Log Level"
            },
            "enable_dynamic_allocation": {
               "default": false,
               "description": "Enable Spark dynamic resource allocation.",
               "title": "Enable Dynamic Allocation",
               "type": "boolean"
            },
            "shuffle_partition_multiplier": {
               "default": 1,
               "description": "Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)",
               "title": "Shuffle Partition Multiplier",
               "type": "integer"
            },
            "enable_hive_metastore": {
               "default": false,
               "description": "Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.",
               "title": "Enable Hive Metastore",
               "type": "boolean"
            },
            "enable_postgres_hive_metastore": {
               "default": false,
               "description": "Create a metastore with PostgreSQL. Supports multiple Spark sessions.",
               "title": "Enable Postgres Hive Metastore",
               "type": "boolean"
            },
            "postgres_password": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Password for PostgreSQL.",
               "title": "Postgres Password"
            },
            "python_path": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Python path to set for Spark workers. Use the Python inside the Spark distribution by default.",
               "title": "Python Path"
            },
            "spark_defaults_template_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.",
               "title": "Spark Defaults Template File"
            }
         },
         "title": "SparkRuntimeParams",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "binaries"
   ]
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
field app: AppParams = AppParams(console_level='INFO', file_level='DEBUG', reraise_exceptions=False)
field binaries: BinaryLocations [Required]
field compute: ComputeParams = ComputeParams(environment=<ComputeEnvironment.SLURM: 'slurm'>, use_srun=True, postgres=PostgresScripts(start_container='postgres/start_container.sh', stop_container='postgres/stop_container.sh', setup_metastore='postgres/setup_metastore.sh'))
field directories: RuntimeDirectories = RuntimeDirectories(base=PosixPath('/home/runner/work/sparkctl/sparkctl/docs'), spark_scratch=PosixPath('/home/runner/work/sparkctl/sparkctl/docs/spark_scratch'), metastore_dir=PosixPath('/home/runner/work/sparkctl/sparkctl/docs'))
field resource_monitor: ResourceMonitorConfig = ResourceMonitorConfig(cpu=True, disk=True, memory=True, network=True, interval=5, enabled=False)
field runtime: SparkRuntimeParams = SparkRuntimeParams(executor_cores=None, executor_memory_gb=None, driver_memory_gb=10, node_memory_overhead_gb=10, use_local_storage=False, start_connect_server=False, connect_server_port=15002, start_history_server=False, start_thrift_server=False, start_jupyter=False, jupyter_command='notebook', jupyter_ip='0.0.0.0', jupyter_port=8889, enable_reverse_proxy=False, reverse_proxy_url=None, enable_prometheus=False, enable_metrics_csv=False, metrics_csv_period=10, enable_gpus=False, gpus_per_node=None, executor_gpu_amount=1, task_gpu_amount=None, enable_rapids=False, spark_log_level=None, enable_dynamic_allocation=False, shuffle_partition_multiplier=1, enable_hive_metastore=False, enable_postgres_hive_metastore=False, postgres_password='89480885-995a-4ef7-92dd-17e8babdeece', python_path=None, spark_defaults_template_file=None)
pydantic model sparkctl.models.BinaryLocations

Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file are only required if the user wants to enable a Postgres-based Hive metastore.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "BinaryLocations",
   "description": "Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file\nare only required if the user wants to enable a Postgres-based Hive metastore.",
   "type": "object",
   "properties": {
      "spark_path": {
         "description": "Path to the Spark binaries.",
         "format": "path",
         "title": "Spark Path",
         "type": "string"
      },
      "java_path": {
         "description": "Path to the Java binaries.",
         "format": "path",
         "title": "Java Path",
         "type": "string"
      },
      "hadoop_path": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the Hadoop binaries.",
         "title": "Hadoop Path"
      },
      "hive_tarball": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the Hive binaries.",
         "title": "Hive Tarball"
      },
      "postgresql_jar_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the PostgreSQL jar file.",
         "title": "Postgresql Jar File"
      },
      "rapids_jar_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the NVIDIA RAPIDS Accelerator for Apache Spark jar file. Only required to enable RAPIDS GPU acceleration.",
         "title": "Rapids Jar File"
      }
   },
   "additionalProperties": false,
   "required": [
      "spark_path",
      "java_path"
   ]
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field hadoop_path: Path | None = None

Path to the Hadoop binaries.

Validated by:
field hive_tarball: Path | None = None

Path to the Hive binaries.

Validated by:
field java_path: Path [Required]

Path to the Java binaries.

Validated by:
field postgresql_jar_file: Path | None = None

Path to the PostgreSQL jar file.

Validated by:
field rapids_jar_file: Path | None = None

Path to the NVIDIA RAPIDS Accelerator for Apache Spark jar file. Only required to enable RAPIDS GPU acceleration.

Validated by:
field spark_path: Path [Required]

Path to the Spark binaries.

Validated by:
validator make_absolute  »  java_path, spark_path, hadoop_path, rapids_jar_file, postgresql_jar_file, hive_tarball
pydantic model sparkctl.models.ComputeParams

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "ComputeParams",
   "type": "object",
   "properties": {
      "environment": {
         "$ref": "#/$defs/ComputeEnvironment",
         "default": "slurm"
      },
      "use_srun": {
         "default": true,
         "description": "In a Slurm environment, launch Spark workers with srun instead of ssh. srun forwards the full submission environment (modules, virtual environments, LD_LIBRARY_PATH) to the worker nodes. Set to false to fall back to ssh if a site's Slurm configuration does not work with sparkctl's srun invocation. Has no effect in a native environment.",
         "title": "Use Srun",
         "type": "boolean"
      },
      "postgres": {
         "$ref": "#/$defs/PostgresScripts",
         "default": {
            "start_container": "postgres/start_container.sh",
            "stop_container": "postgres/stop_container.sh",
            "setup_metastore": "postgres/setup_metastore.sh"
         }
      }
   },
   "$defs": {
      "ComputeEnvironment": {
         "description": "Defines the supported compute environments.",
         "enum": [
            "native",
            "slurm",
            "fake"
         ],
         "title": "ComputeEnvironment",
         "type": "string"
      },
      "PostgresScripts": {
         "additionalProperties": false,
         "description": "Scripts that setup a PostgreSQL database for use in a Hive metastore.\nRelative paths are assumed to be based on the root path of the sparkctl package.\nAbsolute paths can be anywhere on the filesystem.",
         "properties": {
            "start_container": {
               "default": "postgres/start_container.sh",
               "title": "Start Container",
               "type": "string"
            },
            "stop_container": {
               "default": "postgres/stop_container.sh",
               "title": "Stop Container",
               "type": "string"
            },
            "setup_metastore": {
               "default": "postgres/setup_metastore.sh",
               "title": "Setup Metastore",
               "type": "string"
            }
         },
         "title": "PostgresScripts",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
field environment: ComputeEnvironment = ComputeEnvironment.SLURM
field postgres: PostgresScripts = PostgresScripts(start_container='postgres/start_container.sh', stop_container='postgres/stop_container.sh', setup_metastore='postgres/setup_metastore.sh')
field use_srun: bool = True

In a Slurm environment, launch Spark workers with srun instead of ssh. srun forwards the full submission environment (modules, virtual environments, LD_LIBRARY_PATH) to the worker nodes. Set to false to fall back to ssh if a site’s Slurm configuration does not work with sparkctl’s srun invocation. Has no effect in a native environment.

pydantic model sparkctl.models.SparkRuntimeParams

Controls Spark runtime parameters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "SparkRuntimeParams",
   "description": "Controls Spark runtime parameters.",
   "type": "object",
   "properties": {
      "executor_cores": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Number of cores per executor. By default this is auto-determined: when GPUs are enabled, sparkctl runs one executor per GPU and divides the node's cores evenly among them (the NVIDIA-recommended layout); otherwise it defaults to 5.",
         "title": "Executor Cores"
      },
      "executor_memory_gb": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.",
         "title": "Executor Memory Gb"
      },
      "driver_memory_gb": {
         "default": 10,
         "description": "Driver memory in GB. This is the maximum amount of data that can be pulled into the application.",
         "title": "Driver Memory Gb",
         "type": "integer"
      },
      "node_memory_overhead_gb": {
         "default": 10,
         "description": "Memory to reserve for system processes.",
         "title": "Node Memory Overhead Gb",
         "type": "integer"
      },
      "use_local_storage": {
         "default": false,
         "description": "Use compute node local storage for shuffle data.",
         "title": "Use Local Storage",
         "type": "boolean"
      },
      "start_connect_server": {
         "default": false,
         "description": "Enable the Spark connect server.",
         "title": "Start Connect Server",
         "type": "boolean"
      },
      "connect_server_port": {
         "default": 15002,
         "description": "Port on which the Spark Connect server listens.",
         "title": "Connect Server Port",
         "type": "integer"
      },
      "start_history_server": {
         "default": false,
         "description": "Enable the Spark history server.",
         "title": "Start History Server",
         "type": "boolean"
      },
      "start_thrift_server": {
         "default": false,
         "description": "Enable the Thrift server to connect a SQL client.",
         "title": "Start Thrift Server",
         "type": "boolean"
      },
      "start_jupyter": {
         "default": false,
         "description": "Start a Jupyter server on the master node. Pre-wired to the Spark Connect server when it is enabled (the notebook's SparkSession connects automatically).",
         "title": "Start Jupyter",
         "type": "boolean"
      },
      "jupyter_command": {
         "default": "notebook",
         "description": "Jupyter frontend to launch, i.e. the `jupyter <command>` subcommand. Defaults to the classic 'notebook'; use 'lab' for JupyterLab.",
         "title": "Jupyter Command",
         "type": "string"
      },
      "jupyter_ip": {
         "default": "0.0.0.0",
         "description": "IP address the Jupyter server binds to. Defaults to all interfaces so it can be reached by tunneling to the compute node's hostname through a login node (the common HPC pattern); access is protected by Jupyter's token. Set to 127.0.0.1 to bind to localhost only, which requires tunneling directly into the compute node.",
         "title": "Jupyter Ip",
         "type": "string"
      },
      "jupyter_port": {
         "default": 8889,
         "description": "Port on which the Jupyter server listens.",
         "title": "Jupyter Port",
         "type": "integer"
      },
      "enable_reverse_proxy": {
         "default": false,
         "description": "Run the Spark master as a reverse proxy for the worker and application web UIs. Useful on HPC clusters where the compute nodes are not directly reachable, so the UIs are served through the master node only.",
         "title": "Enable Reverse Proxy",
         "type": "boolean"
      },
      "reverse_proxy_url": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "External URL used to reach the Spark master UI when reverse proxy is enabled and the master is itself behind another front-end proxy. Leave unset to serve relative links (recommended when reaching the master through an SSH tunnel).",
         "title": "Reverse Proxy Url"
      },
      "enable_prometheus": {
         "default": false,
         "description": "Expose Spark metrics in Prometheus format through the existing web UI ports (no extra ports are opened).",
         "title": "Enable Prometheus",
         "type": "boolean"
      },
      "enable_metrics_csv": {
         "default": false,
         "description": "Write Spark metrics to CSV files in <base>/metrics-csv. Unlike the Prometheus sink, this leaves a durable record on disk after the cluster shuts down.",
         "title": "Enable Metrics Csv",
         "type": "boolean"
      },
      "metrics_csv_period": {
         "default": 10,
         "description": "Interval in seconds at which the CSV metrics sink writes samples.",
         "title": "Metrics Csv Period",
         "type": "integer"
      },
      "enable_gpus": {
         "default": false,
         "description": "Enable GPU-aware scheduling. Spark workers advertise GPUs and executors/tasks request them. Requires GPUs on the worker nodes.",
         "title": "Enable Gpus",
         "type": "boolean"
      },
      "gpus_per_node": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Number of GPUs available on each worker node. Auto-detected from the compute environment by default.",
         "title": "Gpus Per Node"
      },
      "executor_gpu_amount": {
         "default": 1,
         "description": "Number of GPUs assigned to each executor.",
         "title": "Executor Gpu Amount",
         "type": "integer"
      },
      "task_gpu_amount": {
         "anyOf": [
            {
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "GPUs assigned to each task. Defaults to executor_gpu_amount / executor_cores so that concurrent tasks share an executor's GPUs.",
         "title": "Task Gpu Amount"
      },
      "enable_rapids": {
         "default": false,
         "description": "Enable the NVIDIA RAPIDS Accelerator for Apache Spark to offload SQL/DataFrame operations to GPUs. Implies enable_gpus and requires binaries.rapids_jar_file.",
         "title": "Enable Rapids",
         "type": "boolean"
      },
      "spark_log_level": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Set the root log level for all Spark processes. Defaults to Spark's defaults.",
         "title": "Spark Log Level"
      },
      "enable_dynamic_allocation": {
         "default": false,
         "description": "Enable Spark dynamic resource allocation.",
         "title": "Enable Dynamic Allocation",
         "type": "boolean"
      },
      "shuffle_partition_multiplier": {
         "default": 1,
         "description": "Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)",
         "title": "Shuffle Partition Multiplier",
         "type": "integer"
      },
      "enable_hive_metastore": {
         "default": false,
         "description": "Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.",
         "title": "Enable Hive Metastore",
         "type": "boolean"
      },
      "enable_postgres_hive_metastore": {
         "default": false,
         "description": "Create a metastore with PostgreSQL. Supports multiple Spark sessions.",
         "title": "Enable Postgres Hive Metastore",
         "type": "boolean"
      },
      "postgres_password": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Password for PostgreSQL.",
         "title": "Postgres Password"
      },
      "python_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Python path to set for Spark workers. Use the Python inside the Spark distribution by default.",
         "title": "Python Path"
      },
      "spark_defaults_template_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.",
         "title": "Spark Defaults Template File"
      }
   },
   "additionalProperties": false
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field connect_server_port: int = 15002

Port on which the Spark Connect server listens.

field driver_memory_gb: int = 10

Driver memory in GB. This is the maximum amount of data that can be pulled into the application.

field enable_dynamic_allocation: bool = False

Enable Spark dynamic resource allocation.

field enable_gpus: bool = False

Enable GPU-aware scheduling. Spark workers advertise GPUs and executors/tasks request them. Requires GPUs on the worker nodes.

field enable_hive_metastore: bool = False

Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.

field enable_metrics_csv: bool = False

Write Spark metrics to CSV files in <base>/metrics-csv. Unlike the Prometheus sink, this leaves a durable record on disk after the cluster shuts down.

field enable_postgres_hive_metastore: bool = False

Create a metastore with PostgreSQL. Supports multiple Spark sessions.

field enable_prometheus: bool = False

Expose Spark metrics in Prometheus format through the existing web UI ports (no extra ports are opened).

field enable_rapids: bool = False

Enable the NVIDIA RAPIDS Accelerator for Apache Spark to offload SQL/DataFrame operations to GPUs. Implies enable_gpus and requires binaries.rapids_jar_file.

field enable_reverse_proxy: bool = False

Run the Spark master as a reverse proxy for the worker and application web UIs. Useful on HPC clusters where the compute nodes are not directly reachable, so the UIs are served through the master node only.

field executor_cores: int | None = None

Number of cores per executor. By default this is auto-determined: when GPUs are enabled, sparkctl runs one executor per GPU and divides the node’s cores evenly among them (the NVIDIA-recommended layout); otherwise it defaults to 5.

field executor_gpu_amount: int = 1

Number of GPUs assigned to each executor.

field executor_memory_gb: int | None = None

Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.

field gpus_per_node: int | None = None

Number of GPUs available on each worker node. Auto-detected from the compute environment by default.

field jupyter_command: str = 'notebook'

Jupyter frontend to launch, i.e. the jupyter <command> subcommand. Defaults to the classic ‘notebook’; use ‘lab’ for JupyterLab.

field jupyter_ip: str = '0.0.0.0'

IP address the Jupyter server binds to. Defaults to all interfaces so it can be reached by tunneling to the compute node’s hostname through a login node (the common HPC pattern); access is protected by Jupyter’s token. Set to 127.0.0.1 to bind to localhost only, which requires tunneling directly into the compute node.

field jupyter_port: int = 8889

Port on which the Jupyter server listens.

field metrics_csv_period: int = 10

Interval in seconds at which the CSV metrics sink writes samples.

field node_memory_overhead_gb: int = 10

Memory to reserve for system processes.

field postgres_password: str | None = None

Password for PostgreSQL.

Validated by:
field python_path: str | None = None

Python path to set for Spark workers. Use the Python inside the Spark distribution by default.

field reverse_proxy_url: str | None = None

External URL used to reach the Spark master UI when reverse proxy is enabled and the master is itself behind another front-end proxy. Leave unset to serve relative links (recommended when reaching the master through an SSH tunnel).

field shuffle_partition_multiplier: int = 1

Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)

field spark_defaults_template_file: Path | None = None

Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.

field spark_log_level: str | None = None

Set the root log level for all Spark processes. Defaults to Spark’s defaults.

field start_connect_server: bool = False

Enable the Spark connect server.

field start_history_server: bool = False

Enable the Spark history server.

field start_jupyter: bool = False

Start a Jupyter server on the master node. Pre-wired to the Spark Connect server when it is enabled (the notebook’s SparkSession connects automatically).

field start_thrift_server: bool = False

Enable the Thrift server to connect a SQL client.

field task_gpu_amount: float | None = None

GPUs assigned to each task. Defaults to executor_gpu_amount / executor_cores so that concurrent tasks share an executor’s GPUs.

field use_local_storage: bool = False

Use compute node local storage for shuffle data.

validator set_postgres_password  »  postgres_password
pydantic model sparkctl.models.RuntimeDirectories

Defines the directories to be used by a Spark cluster.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema
{
   "title": "RuntimeDirectories",
   "description": "Defines the directories to be used by a Spark cluster.",
   "type": "object",
   "properties": {
      "base": {
         "default": ".",
         "description": "Base directory for the cluster configuration",
         "format": "path",
         "title": "Base",
         "type": "string"
      },
      "spark_scratch": {
         "default": "spark_scratch",
         "description": "Directory to use for shuffle data. Use a dedicated directory: `sparkctl clean` deletes it recursively, even when it is outside the base configuration directory.",
         "format": "path",
         "title": "Spark Scratch",
         "type": "string"
      },
      "metastore_dir": {
         "default": ".",
         "description": "Set a custom directory for the metastore and warehouse.",
         "format": "path",
         "title": "Metastore Dir",
         "type": "string"
      }
   },
   "additionalProperties": false
}

Config:
  • str_strip_whitespace: bool = True

  • validate_assignment: bool = True

  • validate_default: bool = True

  • extra: str = forbid

  • use_enum_values: bool = False

  • arbitrary_types_allowed: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field base: Path = PosixPath('.')

Base directory for the cluster configuration

Validated by:
field metastore_dir: Path = PosixPath('.')

Set a custom directory for the metastore and warehouse.

Validated by:
field spark_scratch: Path = PosixPath('spark_scratch')

Directory to use for shuffle data. Use a dedicated directory: sparkctl clean deletes it recursively, even when it is outside the base configuration directory.

Validated by:
validator make_absolute  »  base, metastore_dir, spark_scratch
clean_spark_conf_dir() Path

Ensure that the Spark conf dir exists and is clean.

get_events_dir() Path

Return the file path to hive-site.xml

get_gpu_discovery_script_file() Path

Return the file path to the GPU discovery script.

get_hive_site_file() Path

Return the file path to hive-site.xml

get_metrics_properties_file() Path

Return the file path to metrics.properties

get_spark_conf_dir() Path

Return the Spark conf directory

get_spark_defaults_file() Path

Return the file path to spark-defaults.conf

get_spark_env_file() Path

Return the file path to spark-env.sh

get_spark_log_file() Path

Return the file path to log properties file

get_workers_file() Path

Return the file path to workers

class sparkctl.models.ComputeEnvironment(*values)

Defines the supported compute environments.