sparkctl API¶

sparkctl.config.make_default_spark_config() → SparkConfig¶: Return a SparkConfig created from the user’s config file.

class sparkctl.cluster_manager.ClusterManager(config: SparkConfig, status: StatusTracker | None = None)¶

Manages operation of the Spark cluster.

classmethod from_config(config: SparkConfig) → Self¶

Create a ClusterManager from a config instance.

Examples

>>> from sparkctl import ClusterManager, make_default_spark_config
>>> config = make_default_spark_config()
>>> config.runtime.start_connect_server = True
>>> mgr = ClusterManager.from_config(config)

See also

from_config

clean(force: bool = False) → None¶

Delete all Spark runtime files generated by sparkctl in the base directory.

Parameters:: force – Clean even when a cluster appears to be running. By default clean refuses in that case because deleting the runtime files removes the state needed to stop the cluster.

configure() → None¶

Configure a Spark cluster based on the input parameters.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()

get_spark_session() → SparkSession¶

Return a SparkSession for the current cluster.

Examples

>>> spark = mgr.get_spark_session()
>>> spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]).show()

set_workers(workers: list[str]) → None¶

Set the workers for the cluster. Must be called after configure() and before start().

Parameters:: workers – Worker node names or IP addresses, will be used as ssh targets.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config(make_default_spark_config())
>>> mgr.configure()
>>> mgr.set_workers(["worker1", "worker2"])
>>> mgr.start()

get_workers() → list[str]¶: Return the current worker node names.

start(print_env_paths: bool = True) → None¶

Start the Spark cluster. The caller must have called configure() beforehand.

The environment variables SPARK_HOME, SPARK_CONF_DIR, and JAVA_HOME are set to correct values for the current process.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
>>> mgr.start()

managed_cluster() → Generator[SparkSession, None, None]¶

Configure and start the Spark cluster, yield a SparkSession in a context manager, stop the cluster after exit.

The environment variables SPARK_HOME, SPARK_CONF_DIR, and JAVA_HOME are set to correct values for the current process while the context is active and cleared when complete.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> with mgr.managed_cluster() as spark:
    df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
    df.show()

stop() → None¶

Stop all Spark processes.

Examples

>>> from sparkctl import ClusterManager
>>> mgr = ClusterManager.from_config_file("config.json")
>>> mgr.configure()
>>> mgr.start()
>>> mgr.stop()

pydantic model sparkctl.models.SparkConfig¶

Contains all Spark configuration parameters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "SparkConfig",
   "description": "Contains all Spark configuration parameters.",
   "type": "object",
   "properties": {
      "binaries": {
         "$ref": "#/$defs/BinaryLocations"
      },
      "runtime": {
         "$ref": "#/$defs/SparkRuntimeParams",
         "default": {
            "executor_cores": null,
            "executor_memory_gb": null,
            "driver_memory_gb": 10,
            "node_memory_overhead_gb": 10,
            "use_local_storage": false,
            "start_connect_server": false,
            "connect_server_port": 15002,
            "start_history_server": false,
            "start_thrift_server": false,
            "start_jupyter": false,
            "jupyter_command": "notebook",
            "jupyter_ip": "0.0.0.0",
            "jupyter_port": 8889,
            "enable_reverse_proxy": false,
            "reverse_proxy_url": null,
            "enable_prometheus": false,
            "enable_metrics_csv": false,
            "metrics_csv_period": 10,
            "enable_gpus": false,
            "gpus_per_node": null,
            "executor_gpu_amount": 1,
            "task_gpu_amount": null,
            "enable_rapids": false,
            "spark_log_level": null,
            "enable_dynamic_allocation": false,
            "shuffle_partition_multiplier": 1,
            "enable_hive_metastore": false,
            "enable_postgres_hive_metastore": false,
            "postgres_password": "89480885-995a-4ef7-92dd-17e8babdeece",
            "python_path": null,
            "spark_defaults_template_file": null
         }
      },
      "directories": {
         "$ref": "#/$defs/RuntimeDirectories",
         "default": {
            "base": "/home/runner/work/sparkctl/sparkctl/docs",
            "spark_scratch": "/home/runner/work/sparkctl/sparkctl/docs/spark_scratch",
            "metastore_dir": "/home/runner/work/sparkctl/sparkctl/docs"
         }
      },
      "compute": {
         "$ref": "#/$defs/ComputeParams",
         "default": {
            "environment": "slurm",
            "use_srun": true,
            "postgres": {
               "setup_metastore": "postgres/setup_metastore.sh",
               "start_container": "postgres/start_container.sh",
               "stop_container": "postgres/stop_container.sh"
            }
         }
      },
      "resource_monitor": {
         "$ref": "#/$defs/ResourceMonitorConfig",
         "default": {
            "cpu": true,
            "disk": true,
            "memory": true,
            "network": true,
            "interval": 5,
            "enabled": false
         }
      },
      "app": {
         "$ref": "#/$defs/AppParams",
         "default": {
            "console_level": "INFO",
            "file_level": "DEBUG",
            "reraise_exceptions": false
         }
      }
   },
   "$defs": {
      "AppParams": {
         "additionalProperties": false,
         "properties": {
            "console_level": {
               "default": "INFO",
               "description": "Console log level",
               "title": "Console Level",
               "type": "string"
            },
            "file_level": {
               "default": "DEBUG",
               "description": "File log level",
               "title": "File Level",
               "type": "string"
            },
            "reraise_exceptions": {
               "default": false,
               "description": "Reraise sparkctl exceptions in the CLI handler. Not recommended for users. Useful for developers when debugging issues.",
               "title": "Reraise Exceptions",
               "type": "boolean"
            }
         },
         "title": "AppParams",
         "type": "object"
      },
      "BinaryLocations": {
         "additionalProperties": false,
         "description": "Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file\nare only required if the user wants to enable a Postgres-based Hive metastore.",
         "properties": {
            "spark_path": {
               "description": "Path to the Spark binaries.",
               "format": "path",
               "title": "Spark Path",
               "type": "string"
            },
            "java_path": {
               "description": "Path to the Java binaries.",
               "format": "path",
               "title": "Java Path",
               "type": "string"
            },
            "hadoop_path": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the Hadoop binaries.",
               "title": "Hadoop Path"
            },
            "hive_tarball": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the Hive binaries.",
               "title": "Hive Tarball"
            },
            "postgresql_jar_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the PostgreSQL jar file.",
               "title": "Postgresql Jar File"
            },
            "rapids_jar_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to the NVIDIA RAPIDS Accelerator for Apache Spark jar file. Only required to enable RAPIDS GPU acceleration.",
               "title": "Rapids Jar File"
            }
         },
         "required": [
            "spark_path",
            "java_path"
         ],
         "title": "BinaryLocations",
         "type": "object"
      },
      "ComputeEnvironment": {
         "description": "Defines the supported compute environments.",
         "enum": [
            "native",
            "slurm",
            "fake"
         ],
         "title": "ComputeEnvironment",
         "type": "string"
      },
      "ComputeParams": {
         "additionalProperties": false,
         "properties": {
            "environment": {
               "$ref": "#/$defs/ComputeEnvironment",
               "default": "slurm"
            },
            "use_srun": {
               "default": true,
               "description": "In a Slurm environment, launch Spark workers with srun instead of ssh. srun forwards the full submission environment (modules, virtual environments, LD_LIBRARY_PATH) to the worker nodes. Set to false to fall back to ssh if a site's Slurm configuration does not work with sparkctl's srun invocation. Has no effect in a native environment.",
               "title": "Use Srun",
               "type": "boolean"
            },
            "postgres": {
               "$ref": "#/$defs/PostgresScripts",
               "default": {
                  "start_container": "postgres/start_container.sh",
                  "stop_container": "postgres/stop_container.sh",
                  "setup_metastore": "postgres/setup_metastore.sh"
               }
            }
         },
         "title": "ComputeParams",
         "type": "object"
      },
      "PostgresScripts": {
         "additionalProperties": false,
         "description": "Scripts that setup a PostgreSQL database for use in a Hive metastore.\nRelative paths are assumed to be based on the root path of the sparkctl package.\nAbsolute paths can be anywhere on the filesystem.",
         "properties": {
            "start_container": {
               "default": "postgres/start_container.sh",
               "title": "Start Container",
               "type": "string"
            },
            "stop_container": {
               "default": "postgres/stop_container.sh",
               "title": "Stop Container",
               "type": "string"
            },
            "setup_metastore": {
               "default": "postgres/setup_metastore.sh",
               "title": "Setup Metastore",
               "type": "string"
            }
         },
         "title": "PostgresScripts",
         "type": "object"
      },
      "ResourceMonitorConfig": {
         "additionalProperties": false,
         "description": "Defines the resource stats to monitor.",
         "properties": {
            "cpu": {
               "default": true,
               "description": "Monitor CPU utilization",
               "title": "Cpu",
               "type": "boolean"
            },
            "disk": {
               "default": true,
               "description": "Monitor disk/storage utilization",
               "title": "Disk",
               "type": "boolean"
            },
            "memory": {
               "default": true,
               "description": "Monitor memory utilization",
               "title": "Memory",
               "type": "boolean"
            },
            "network": {
               "default": true,
               "description": "Monitor network utilization",
               "title": "Network",
               "type": "boolean"
            },
            "interval": {
               "default": 5,
               "description": "Interval in seconds on which to collect stats",
               "title": "Interval",
               "type": "integer"
            },
            "enabled": {
               "default": false,
               "description": "Enable resource monitoring.",
               "title": "Enabled",
               "type": "boolean"
            }
         },
         "title": "ResourceMonitorConfig",
         "type": "object"
      },
      "RuntimeDirectories": {
         "additionalProperties": false,
         "description": "Defines the directories to be used by a Spark cluster.",
         "properties": {
            "base": {
               "default": ".",
               "description": "Base directory for the cluster configuration",
               "format": "path",
               "title": "Base",
               "type": "string"
            },
            "spark_scratch": {
               "default": "spark_scratch",
               "description": "Directory to use for shuffle data. Use a dedicated directory: `sparkctl clean` deletes it recursively, even when it is outside the base configuration directory.",
               "format": "path",
               "title": "Spark Scratch",
               "type": "string"
            },
            "metastore_dir": {
               "default": ".",
               "description": "Set a custom directory for the metastore and warehouse.",
               "format": "path",
               "title": "Metastore Dir",
               "type": "string"
            }
         },
         "title": "RuntimeDirectories",
         "type": "object"
      },
      "SparkRuntimeParams": {
         "additionalProperties": false,
         "description": "Controls Spark runtime parameters.",
         "properties": {
            "executor_cores": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Number of cores per executor. By default this is auto-determined: when GPUs are enabled, sparkctl runs one executor per GPU and divides the node's cores evenly among them (the NVIDIA-recommended layout); otherwise it defaults to 5.",
               "title": "Executor Cores"
            },
            "executor_memory_gb": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.",
               "title": "Executor Memory Gb"
            },
            "driver_memory_gb": {
               "default": 10,
               "description": "Driver memory in GB. This is the maximum amount of data that can be pulled into the application.",
               "title": "Driver Memory Gb",
               "type": "integer"
            },
            "node_memory_overhead_gb": {
               "default": 10,
               "description": "Memory to reserve for system processes.",
               "title": "Node Memory Overhead Gb",
               "type": "integer"
            },
            "use_local_storage": {
               "default": false,
               "description": "Use compute node local storage for shuffle data.",
               "title": "Use Local Storage",
               "type": "boolean"
            },
            "start_connect_server": {
               "default": false,
               "description": "Enable the Spark connect server.",
               "title": "Start Connect Server",
               "type": "boolean"
            },
            "connect_server_port": {
               "default": 15002,
               "description": "Port on which the Spark Connect server listens.",
               "title": "Connect Server Port",
               "type": "integer"
            },
            "start_history_server": {
               "default": false,
               "description": "Enable the Spark history server.",
               "title": "Start History Server",
               "type": "boolean"
            },
            "start_thrift_server": {
               "default": false,
               "description": "Enable the Thrift server to connect a SQL client.",
               "title": "Start Thrift Server",
               "type": "boolean"
            },
            "start_jupyter": {
               "default": false,
               "description": "Start a Jupyter server on the master node. Pre-wired to the Spark Connect server when it is enabled (the notebook's SparkSession connects automatically).",
               "title": "Start Jupyter",
               "type": "boolean"
            },
            "jupyter_command": {
               "default": "notebook",
               "description": "Jupyter frontend to launch, i.e. the `jupyter <command>` subcommand. Defaults to the classic 'notebook'; use 'lab' for JupyterLab.",
               "title": "Jupyter Command",
               "type": "string"
            },
            "jupyter_ip": {
               "default": "0.0.0.0",
               "description": "IP address the Jupyter server binds to. Defaults to all interfaces so it can be reached by tunneling to the compute node's hostname through a login node (the common HPC pattern); access is protected by Jupyter's token. Set to 127.0.0.1 to bind to localhost only, which requires tunneling directly into the compute node.",
               "title": "Jupyter Ip",
               "type": "string"
            },
            "jupyter_port": {
               "default": 8889,
               "description": "Port on which the Jupyter server listens.",
               "title": "Jupyter Port",
               "type": "integer"
            },
            "enable_reverse_proxy": {
               "default": false,
               "description": "Run the Spark master as a reverse proxy for the worker and application web UIs. Useful on HPC clusters where the compute nodes are not directly reachable, so the UIs are served through the master node only.",
               "title": "Enable Reverse Proxy",
               "type": "boolean"
            },
            "reverse_proxy_url": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "External URL used to reach the Spark master UI when reverse proxy is enabled and the master is itself behind another front-end proxy. Leave unset to serve relative links (recommended when reaching the master through an SSH tunnel).",
               "title": "Reverse Proxy Url"
            },
            "enable_prometheus": {
               "default": false,
               "description": "Expose Spark metrics in Prometheus format through the existing web UI ports (no extra ports are opened).",
               "title": "Enable Prometheus",
               "type": "boolean"
            },
            "enable_metrics_csv": {
               "default": false,
               "description": "Write Spark metrics to CSV files in <base>/metrics-csv. Unlike the Prometheus sink, this leaves a durable record on disk after the cluster shuts down.",
               "title": "Enable Metrics Csv",
               "type": "boolean"
            },
            "metrics_csv_period": {
               "default": 10,
               "description": "Interval in seconds at which the CSV metrics sink writes samples.",
               "title": "Metrics Csv Period",
               "type": "integer"
            },
            "enable_gpus": {
               "default": false,
               "description": "Enable GPU-aware scheduling. Spark workers advertise GPUs and executors/tasks request them. Requires GPUs on the worker nodes.",
               "title": "Enable Gpus",
               "type": "boolean"
            },
            "gpus_per_node": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Number of GPUs available on each worker node. Auto-detected from the compute environment by default.",
               "title": "Gpus Per Node"
            },
            "executor_gpu_amount": {
               "default": 1,
               "description": "Number of GPUs assigned to each executor.",
               "title": "Executor Gpu Amount",
               "type": "integer"
            },
            "task_gpu_amount": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "GPUs assigned to each task. Defaults to executor_gpu_amount / executor_cores so that concurrent tasks share an executor's GPUs.",
               "title": "Task Gpu Amount"
            },
            "enable_rapids": {
               "default": false,
               "description": "Enable the NVIDIA RAPIDS Accelerator for Apache Spark to offload SQL/DataFrame operations to GPUs. Implies enable_gpus and requires binaries.rapids_jar_file.",
               "title": "Enable Rapids",
               "type": "boolean"
            },
            "spark_log_level": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Set the root log level for all Spark processes. Defaults to Spark's defaults.",
               "title": "Spark Log Level"
            },
            "enable_dynamic_allocation": {
               "default": false,
               "description": "Enable Spark dynamic resource allocation.",
               "title": "Enable Dynamic Allocation",
               "type": "boolean"
            },
            "shuffle_partition_multiplier": {
               "default": 1,
               "description": "Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)",
               "title": "Shuffle Partition Multiplier",
               "type": "integer"
            },
            "enable_hive_metastore": {
               "default": false,
               "description": "Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.",
               "title": "Enable Hive Metastore",
               "type": "boolean"
            },
            "enable_postgres_hive_metastore": {
               "default": false,
               "description": "Create a metastore with PostgreSQL. Supports multiple Spark sessions.",
               "title": "Enable Postgres Hive Metastore",
               "type": "boolean"
            },
            "postgres_password": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Password for PostgreSQL.",
               "title": "Postgres Password"
            },
            "python_path": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Python path to set for Spark workers. Use the Python inside the Spark distribution by default.",
               "title": "Python Path"
            },
            "spark_defaults_template_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.",
               "title": "Spark Defaults Template File"
            }
         },
         "title": "SparkRuntimeParams",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "binaries"
   ]
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

app (sparkctl.models.AppParams)
binaries (sparkctl.models.BinaryLocations)
compute (sparkctl.models.ComputeParams)
directories (sparkctl.models.RuntimeDirectories)
resource_monitor (sparkctl.models.ResourceMonitorConfig)
runtime (sparkctl.models.SparkRuntimeParams)

field app: AppParams = AppParams(console_level='INFO', file_level='DEBUG', reraise_exceptions=False)¶

field binaries: BinaryLocations [Required]¶

field compute: ComputeParams = ComputeParams(environment=<ComputeEnvironment.SLURM: 'slurm'>, use_srun=True, postgres=PostgresScripts(start_container='postgres/start_container.sh', stop_container='postgres/stop_container.sh', setup_metastore='postgres/setup_metastore.sh'))¶

field directories: RuntimeDirectories = RuntimeDirectories(base=PosixPath('/home/runner/work/sparkctl/sparkctl/docs'), spark_scratch=PosixPath('/home/runner/work/sparkctl/sparkctl/docs/spark_scratch'), metastore_dir=PosixPath('/home/runner/work/sparkctl/sparkctl/docs'))¶

field resource_monitor: ResourceMonitorConfig = ResourceMonitorConfig(cpu=True, disk=True, memory=True, network=True, interval=5, enabled=False)¶

field runtime: SparkRuntimeParams = SparkRuntimeParams(executor_cores=None, executor_memory_gb=None, driver_memory_gb=10, node_memory_overhead_gb=10, use_local_storage=False, start_connect_server=False, connect_server_port=15002, start_history_server=False, start_thrift_server=False, start_jupyter=False, jupyter_command='notebook', jupyter_ip='0.0.0.0', jupyter_port=8889, enable_reverse_proxy=False, reverse_proxy_url=None, enable_prometheus=False, enable_metrics_csv=False, metrics_csv_period=10, enable_gpus=False, gpus_per_node=None, executor_gpu_amount=1, task_gpu_amount=None, enable_rapids=False, spark_log_level=None, enable_dynamic_allocation=False, shuffle_partition_multiplier=1, enable_hive_metastore=False, enable_postgres_hive_metastore=False, postgres_password='89480885-995a-4ef7-92dd-17e8babdeece', python_path=None, spark_defaults_template_file=None)¶

pydantic model sparkctl.models.BinaryLocations¶

Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file are only required if the user wants to enable a Postgres-based Hive metastore.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "BinaryLocations",
   "description": "Locations to the Spark and dependent software. Hadoop, Hive, and the PostgreSQL jar file\nare only required if the user wants to enable a Postgres-based Hive metastore.",
   "type": "object",
   "properties": {
      "spark_path": {
         "description": "Path to the Spark binaries.",
         "format": "path",
         "title": "Spark Path",
         "type": "string"
      },
      "java_path": {
         "description": "Path to the Java binaries.",
         "format": "path",
         "title": "Java Path",
         "type": "string"
      },
      "hadoop_path": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the Hadoop binaries.",
         "title": "Hadoop Path"
      },
      "hive_tarball": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the Hive binaries.",
         "title": "Hive Tarball"
      },
      "postgresql_jar_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the PostgreSQL jar file.",
         "title": "Postgresql Jar File"
      },
      "rapids_jar_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to the NVIDIA RAPIDS Accelerator for Apache Spark jar file. Only required to enable RAPIDS GPU acceleration.",
         "title": "Rapids Jar File"
      }
   },
   "additionalProperties": false,
   "required": [
      "spark_path",
      "java_path"
   ]
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

hadoop_path (pathlib.Path | None)
hive_tarball (pathlib.Path | None)
java_path (pathlib.Path)
postgresql_jar_file (pathlib.Path | None)
rapids_jar_file (pathlib.Path | None)
spark_path (pathlib.Path)

Validators:

make_absolute » hadoop_path
make_absolute » hive_tarball
make_absolute » java_path
make_absolute » postgresql_jar_file
make_absolute » rapids_jar_file
make_absolute » spark_path

field hadoop_path: Path | None = None¶

Path to the Hadoop binaries.

Validated by:

make_absolute

field hive_tarball: Path | None = None¶

Path to the Hive binaries.

Validated by:

make_absolute

field java_path: Path [Required]¶

Path to the Java binaries.

Validated by:

make_absolute

field postgresql_jar_file: Path | None = None¶

Path to the PostgreSQL jar file.

Validated by:

make_absolute

field rapids_jar_file: Path | None = None¶

Path to the NVIDIA RAPIDS Accelerator for Apache Spark jar file. Only required to enable RAPIDS GPU acceleration.

Validated by:

make_absolute

field spark_path: Path [Required]¶

Path to the Spark binaries.

Validated by:

make_absolute

validator make_absolute » java_path, spark_path, hadoop_path, rapids_jar_file, postgresql_jar_file, hive_tarball¶

pydantic model sparkctl.models.ComputeParams¶

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "ComputeParams",
   "type": "object",
   "properties": {
      "environment": {
         "$ref": "#/$defs/ComputeEnvironment",
         "default": "slurm"
      },
      "use_srun": {
         "default": true,
         "description": "In a Slurm environment, launch Spark workers with srun instead of ssh. srun forwards the full submission environment (modules, virtual environments, LD_LIBRARY_PATH) to the worker nodes. Set to false to fall back to ssh if a site's Slurm configuration does not work with sparkctl's srun invocation. Has no effect in a native environment.",
         "title": "Use Srun",
         "type": "boolean"
      },
      "postgres": {
         "$ref": "#/$defs/PostgresScripts",
         "default": {
            "start_container": "postgres/start_container.sh",
            "stop_container": "postgres/stop_container.sh",
            "setup_metastore": "postgres/setup_metastore.sh"
         }
      }
   },
   "$defs": {
      "ComputeEnvironment": {
         "description": "Defines the supported compute environments.",
         "enum": [
            "native",
            "slurm",
            "fake"
         ],
         "title": "ComputeEnvironment",
         "type": "string"
      },
      "PostgresScripts": {
         "additionalProperties": false,
         "description": "Scripts that setup a PostgreSQL database for use in a Hive metastore.\nRelative paths are assumed to be based on the root path of the sparkctl package.\nAbsolute paths can be anywhere on the filesystem.",
         "properties": {
            "start_container": {
               "default": "postgres/start_container.sh",
               "title": "Start Container",
               "type": "string"
            },
            "stop_container": {
               "default": "postgres/stop_container.sh",
               "title": "Stop Container",
               "type": "string"
            },
            "setup_metastore": {
               "default": "postgres/setup_metastore.sh",
               "title": "Setup Metastore",
               "type": "string"
            }
         },
         "title": "PostgresScripts",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

environment (sparkctl.models.ComputeEnvironment)
postgres (sparkctl.models.PostgresScripts)
use_srun (bool)

field environment: ComputeEnvironment = ComputeEnvironment.SLURM¶

field postgres: PostgresScripts = PostgresScripts(start_container='postgres/start_container.sh', stop_container='postgres/stop_container.sh', setup_metastore='postgres/setup_metastore.sh')¶

field use_srun: bool = True¶: In a Slurm environment, launch Spark workers with srun instead of ssh. srun forwards the full submission environment (modules, virtual environments, LD_LIBRARY_PATH) to the worker nodes. Set to false to fall back to ssh if a site’s Slurm configuration does not work with sparkctl’s srun invocation. Has no effect in a native environment.

pydantic model sparkctl.models.SparkRuntimeParams¶

Controls Spark runtime parameters.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "SparkRuntimeParams",
   "description": "Controls Spark runtime parameters.",
   "type": "object",
   "properties": {
      "executor_cores": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Number of cores per executor. By default this is auto-determined: when GPUs are enabled, sparkctl runs one executor per GPU and divides the node's cores evenly among them (the NVIDIA-recommended layout); otherwise it defaults to 5.",
         "title": "Executor Cores"
      },
      "executor_memory_gb": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.",
         "title": "Executor Memory Gb"
      },
      "driver_memory_gb": {
         "default": 10,
         "description": "Driver memory in GB. This is the maximum amount of data that can be pulled into the application.",
         "title": "Driver Memory Gb",
         "type": "integer"
      },
      "node_memory_overhead_gb": {
         "default": 10,
         "description": "Memory to reserve for system processes.",
         "title": "Node Memory Overhead Gb",
         "type": "integer"
      },
      "use_local_storage": {
         "default": false,
         "description": "Use compute node local storage for shuffle data.",
         "title": "Use Local Storage",
         "type": "boolean"
      },
      "start_connect_server": {
         "default": false,
         "description": "Enable the Spark connect server.",
         "title": "Start Connect Server",
         "type": "boolean"
      },
      "connect_server_port": {
         "default": 15002,
         "description": "Port on which the Spark Connect server listens.",
         "title": "Connect Server Port",
         "type": "integer"
      },
      "start_history_server": {
         "default": false,
         "description": "Enable the Spark history server.",
         "title": "Start History Server",
         "type": "boolean"
      },
      "start_thrift_server": {
         "default": false,
         "description": "Enable the Thrift server to connect a SQL client.",
         "title": "Start Thrift Server",
         "type": "boolean"
      },
      "start_jupyter": {
         "default": false,
         "description": "Start a Jupyter server on the master node. Pre-wired to the Spark Connect server when it is enabled (the notebook's SparkSession connects automatically).",
         "title": "Start Jupyter",
         "type": "boolean"
      },
      "jupyter_command": {
         "default": "notebook",
         "description": "Jupyter frontend to launch, i.e. the `jupyter <command>` subcommand. Defaults to the classic 'notebook'; use 'lab' for JupyterLab.",
         "title": "Jupyter Command",
         "type": "string"
      },
      "jupyter_ip": {
         "default": "0.0.0.0",
         "description": "IP address the Jupyter server binds to. Defaults to all interfaces so it can be reached by tunneling to the compute node's hostname through a login node (the common HPC pattern); access is protected by Jupyter's token. Set to 127.0.0.1 to bind to localhost only, which requires tunneling directly into the compute node.",
         "title": "Jupyter Ip",
         "type": "string"
      },
      "jupyter_port": {
         "default": 8889,
         "description": "Port on which the Jupyter server listens.",
         "title": "Jupyter Port",
         "type": "integer"
      },
      "enable_reverse_proxy": {
         "default": false,
         "description": "Run the Spark master as a reverse proxy for the worker and application web UIs. Useful on HPC clusters where the compute nodes are not directly reachable, so the UIs are served through the master node only.",
         "title": "Enable Reverse Proxy",
         "type": "boolean"
      },
      "reverse_proxy_url": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "External URL used to reach the Spark master UI when reverse proxy is enabled and the master is itself behind another front-end proxy. Leave unset to serve relative links (recommended when reaching the master through an SSH tunnel).",
         "title": "Reverse Proxy Url"
      },
      "enable_prometheus": {
         "default": false,
         "description": "Expose Spark metrics in Prometheus format through the existing web UI ports (no extra ports are opened).",
         "title": "Enable Prometheus",
         "type": "boolean"
      },
      "enable_metrics_csv": {
         "default": false,
         "description": "Write Spark metrics to CSV files in <base>/metrics-csv. Unlike the Prometheus sink, this leaves a durable record on disk after the cluster shuts down.",
         "title": "Enable Metrics Csv",
         "type": "boolean"
      },
      "metrics_csv_period": {
         "default": 10,
         "description": "Interval in seconds at which the CSV metrics sink writes samples.",
         "title": "Metrics Csv Period",
         "type": "integer"
      },
      "enable_gpus": {
         "default": false,
         "description": "Enable GPU-aware scheduling. Spark workers advertise GPUs and executors/tasks request them. Requires GPUs on the worker nodes.",
         "title": "Enable Gpus",
         "type": "boolean"
      },
      "gpus_per_node": {
         "anyOf": [
            {
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Number of GPUs available on each worker node. Auto-detected from the compute environment by default.",
         "title": "Gpus Per Node"
      },
      "executor_gpu_amount": {
         "default": 1,
         "description": "Number of GPUs assigned to each executor.",
         "title": "Executor Gpu Amount",
         "type": "integer"
      },
      "task_gpu_amount": {
         "anyOf": [
            {
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "GPUs assigned to each task. Defaults to executor_gpu_amount / executor_cores so that concurrent tasks share an executor's GPUs.",
         "title": "Task Gpu Amount"
      },
      "enable_rapids": {
         "default": false,
         "description": "Enable the NVIDIA RAPIDS Accelerator for Apache Spark to offload SQL/DataFrame operations to GPUs. Implies enable_gpus and requires binaries.rapids_jar_file.",
         "title": "Enable Rapids",
         "type": "boolean"
      },
      "spark_log_level": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Set the root log level for all Spark processes. Defaults to Spark's defaults.",
         "title": "Spark Log Level"
      },
      "enable_dynamic_allocation": {
         "default": false,
         "description": "Enable Spark dynamic resource allocation.",
         "title": "Enable Dynamic Allocation",
         "type": "boolean"
      },
      "shuffle_partition_multiplier": {
         "default": 1,
         "description": "Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)",
         "title": "Shuffle Partition Multiplier",
         "type": "integer"
      },
      "enable_hive_metastore": {
         "default": false,
         "description": "Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.",
         "title": "Enable Hive Metastore",
         "type": "boolean"
      },
      "enable_postgres_hive_metastore": {
         "default": false,
         "description": "Create a metastore with PostgreSQL. Supports multiple Spark sessions.",
         "title": "Enable Postgres Hive Metastore",
         "type": "boolean"
      },
      "postgres_password": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Password for PostgreSQL.",
         "title": "Postgres Password"
      },
      "python_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Python path to set for Spark workers. Use the Python inside the Spark distribution by default.",
         "title": "Python Path"
      },
      "spark_defaults_template_file": {
         "anyOf": [
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.",
         "title": "Spark Defaults Template File"
      }
   },
   "additionalProperties": false
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

connect_server_port (int)
driver_memory_gb (int)
enable_dynamic_allocation (bool)
enable_gpus (bool)
enable_hive_metastore (bool)
enable_metrics_csv (bool)
enable_postgres_hive_metastore (bool)
enable_prometheus (bool)
enable_rapids (bool)
enable_reverse_proxy (bool)
executor_cores (int | None)
executor_gpu_amount (int)
executor_memory_gb (int | None)
gpus_per_node (int | None)
jupyter_command (str)
jupyter_ip (str)
jupyter_port (int)
metrics_csv_period (int)
node_memory_overhead_gb (int)
postgres_password (str | None)
python_path (str | None)
reverse_proxy_url (str | None)
shuffle_partition_multiplier (int)
spark_defaults_template_file (pathlib.Path | None)
spark_log_level (str | None)
start_connect_server (bool)
start_history_server (bool)
start_jupyter (bool)
start_thrift_server (bool)
task_gpu_amount (float | None)
use_local_storage (bool)

Validators:

set_postgres_password » postgres_password

field connect_server_port: int = 15002¶: Port on which the Spark Connect server listens.

field driver_memory_gb: int = 10¶: Driver memory in GB. This is the maximum amount of data that can be pulled into the application.

field enable_dynamic_allocation: bool = False¶: Enable Spark dynamic resource allocation.

field enable_gpus: bool = False¶: Enable GPU-aware scheduling. Spark workers advertise GPUs and executors/tasks request them. Requires GPUs on the worker nodes.

field enable_hive_metastore: bool = False¶: Create a Hive metastore with Spark defaults (Apache Derby). Supports only one Spark session.

field enable_metrics_csv: bool = False¶: Write Spark metrics to CSV files in <base>/metrics-csv. Unlike the Prometheus sink, this leaves a durable record on disk after the cluster shuts down.

field enable_postgres_hive_metastore: bool = False¶: Create a metastore with PostgreSQL. Supports multiple Spark sessions.

field enable_prometheus: bool = False¶: Expose Spark metrics in Prometheus format through the existing web UI ports (no extra ports are opened).

field enable_rapids: bool = False¶: Enable the NVIDIA RAPIDS Accelerator for Apache Spark to offload SQL/DataFrame operations to GPUs. Implies enable_gpus and requires binaries.rapids_jar_file.

field enable_reverse_proxy: bool = False¶: Run the Spark master as a reverse proxy for the worker and application web UIs. Useful on HPC clusters where the compute nodes are not directly reachable, so the UIs are served through the master node only.

field executor_cores: int | None = None¶: Number of cores per executor. By default this is auto-determined: when GPUs are enabled, sparkctl runs one executor per GPU and divides the node’s cores evenly among them (the NVIDIA-recommended layout); otherwise it defaults to 5.

field executor_gpu_amount: int = 1¶: Number of GPUs assigned to each executor.

field executor_memory_gb: int | None = None¶: Memory per executor in GB. By default, auto-determine by using what is available. This can also be set implicitly by increasing executor_cores.

field gpus_per_node: int | None = None¶: Number of GPUs available on each worker node. Auto-detected from the compute environment by default.

field jupyter_command: str = 'notebook'¶: Jupyter frontend to launch, i.e. the jupyter <command> subcommand. Defaults to the classic ‘notebook’; use ‘lab’ for JupyterLab.

field jupyter_ip: str = '0.0.0.0'¶: IP address the Jupyter server binds to. Defaults to all interfaces so it can be reached by tunneling to the compute node’s hostname through a login node (the common HPC pattern); access is protected by Jupyter’s token. Set to 127.0.0.1 to bind to localhost only, which requires tunneling directly into the compute node.

field jupyter_port: int = 8889¶: Port on which the Jupyter server listens.

field metrics_csv_period: int = 10¶: Interval in seconds at which the CSV metrics sink writes samples.

field node_memory_overhead_gb: int = 10¶: Memory to reserve for system processes.

field postgres_password: str | None = None¶

Password for PostgreSQL.

Validated by:

set_postgres_password

field python_path: str | None = None¶: Python path to set for Spark workers. Use the Python inside the Spark distribution by default.

field reverse_proxy_url: str | None = None¶: External URL used to reach the Spark master UI when reverse proxy is enabled and the master is itself behind another front-end proxy. Leave unset to serve relative links (recommended when reaching the master through an SSH tunnel).

field shuffle_partition_multiplier: int = 1¶: Spark SQL shuffle partition multiplier (multipy by the number of worker CPUs)

field spark_defaults_template_file: Path | None = None¶: Path to a custom spark-defaults.conf template file. If not set, use the sparkctl defaults.

field spark_log_level: str | None = None¶: Set the root log level for all Spark processes. Defaults to Spark’s defaults.

field start_connect_server: bool = False¶: Enable the Spark connect server.

field start_history_server: bool = False¶: Enable the Spark history server.

field start_jupyter: bool = False¶: Start a Jupyter server on the master node. Pre-wired to the Spark Connect server when it is enabled (the notebook’s SparkSession connects automatically).

field start_thrift_server: bool = False¶: Enable the Thrift server to connect a SQL client.

field task_gpu_amount: float | None = None¶: GPUs assigned to each task. Defaults to executor_gpu_amount / executor_cores so that concurrent tasks share an executor’s GPUs.

field use_local_storage: bool = False¶: Use compute node local storage for shuffle data.

validator set_postgres_password » postgres_password¶

pydantic model sparkctl.models.RuntimeDirectories¶

Defines the directories to be used by a Spark cluster.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "RuntimeDirectories",
   "description": "Defines the directories to be used by a Spark cluster.",
   "type": "object",
   "properties": {
      "base": {
         "default": ".",
         "description": "Base directory for the cluster configuration",
         "format": "path",
         "title": "Base",
         "type": "string"
      },
      "spark_scratch": {
         "default": "spark_scratch",
         "description": "Directory to use for shuffle data. Use a dedicated directory: `sparkctl clean` deletes it recursively, even when it is outside the base configuration directory.",
         "format": "path",
         "title": "Spark Scratch",
         "type": "string"
      },
      "metastore_dir": {
         "default": ".",
         "description": "Set a custom directory for the metastore and warehouse.",
         "format": "path",
         "title": "Metastore Dir",
         "type": "string"
      }
   },
   "additionalProperties": false
}

Config:

str_strip_whitespace: bool = True
validate_assignment: bool = True
validate_default: bool = True
extra: str = forbid
use_enum_values: bool = False
arbitrary_types_allowed: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True

Fields:

base (pathlib.Path)
metastore_dir (pathlib.Path)
spark_scratch (pathlib.Path)

Validators:

make_absolute » base
make_absolute » metastore_dir
make_absolute » spark_scratch

field base: Path = PosixPath('.')¶

Base directory for the cluster configuration

Validated by:

make_absolute

field metastore_dir: Path = PosixPath('.')¶

Set a custom directory for the metastore and warehouse.

Validated by:

make_absolute

field spark_scratch: Path = PosixPath('spark_scratch')¶

Directory to use for shuffle data. Use a dedicated directory: sparkctl clean deletes it recursively, even when it is outside the base configuration directory.

Validated by:

make_absolute

validator make_absolute » base, metastore_dir, spark_scratch¶

clean_spark_conf_dir() → Path¶: Ensure that the Spark conf dir exists and is clean.

get_events_dir() → Path¶: Return the file path to hive-site.xml

get_gpu_discovery_script_file() → Path¶: Return the file path to the GPU discovery script.

get_hive_site_file() → Path¶: Return the file path to hive-site.xml

get_metrics_properties_file() → Path¶: Return the file path to metrics.properties

get_spark_conf_dir() → Path¶: Return the Spark conf directory

get_spark_defaults_file() → Path¶: Return the file path to spark-defaults.conf

get_spark_env_file() → Path¶: Return the file path to spark-env.sh

get_spark_log_file() → Path¶: Return the file path to log properties file

get_workers_file() → Path¶: Return the file path to workers

class sparkctl.models.ComputeEnvironment(*values)¶: Defines the supported compute environments.