How to run a Jupyter notebook against the cluster¶

sparkctl can start a Jupyter server on the master node so you can run interactive notebooks against the Spark cluster. When the Spark Connect server is enabled, the notebook’s SparkSession connects to the cluster automatically.

By default sparkctl launches the classic notebook (jupyter notebook), which is a good fit for a single-user cluster. Use --jupyter-command lab if you prefer JupyterLab.

Prerequisites¶

Install Jupyter in the same environment as sparkctl. The jupyter extra pulls in the classic notebook frontend:

$ pip install "sparkctl[jupyter]"    # or: uv pip install "sparkctl[jupyter]"

If you want JupyterLab instead, install jupyterlab and pass --jupyter-command lab.

Start Jupyter with the Connect server¶

The recommended setup enables the Spark Connect server so the notebook connects remotely without any extra configuration:

$ sparkctl configure --connect-server --jupyter --start

sparkctl sets SPARK_REMOTE for the Jupyter process, so inside a notebook you can simply do:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]).show()

Connect to the server¶

When the server starts, sparkctl logs the node it is running on, a ready-to-use SSH tunnel command, and the access URL (with token). It looks like:

Jupyter is running on x1000c0s0b0n0 (port 8889). From your laptop, open an SSH tunnel:
    ssh -L 8889:x1000c0s0b0n0:8889 <your-hpc-login-host>
then browse to:
    http://localhost:8889/tree?token=<token>

Run the ssh command from your laptop (replacing <your-hpc-login-host> with your cluster’s login host), then open the http://localhost:8889/... URL in your browser. The same information is always available in jupyter.log in the cluster base directory. Change the port with --jupyter-port.

Jupyter listens on all interfaces (0.0.0.0) by default so it is reachable by tunneling to the compute node’s hostname through a login node, which is the portable HPC pattern. Access is protected by Jupyter’s token.

Note

To bind to localhost only, pass --jupyter-ip 127.0.0.1. The server is then off the cluster network, but you must tunnel directly into the compute node, e.g. ssh -J <hpc-login-host> -L 8889:localhost:8889 <node>.

Working in the notebook¶

The notebook server runs in the same environment as sparkctl, so its Python kernel already has pyspark-client available. With the Connect server enabled, every notebook connects to the cluster through SPARK_REMOTE — just call SparkSession.builder.getOrCreate() as shown above. See the Spark Connect tutorial for more on what the Connect client supports.

Notebooks are served from the cluster base directory, so any notebooks you create are saved there and persist after the cluster stops.

Troubleshooting¶

The browser cannot connect. Confirm the SSH tunnel from the startup banner is running, and that you opened the URL with its ?token=... (copy it from the banner or jupyter.log). If you set --jupyter-ip 127.0.0.1, the tunnel must terminate on the compute node itself.
Address already in use. Another process holds the port; choose a different one with --jupyter-port.
SparkSession cannot reach the cluster. Make sure you configured with --connect-server; without it SPARK_REMOTE is not set and the notebook will not auto-connect.

Reducing log noise¶

If jupyter.log is noisy, note that the most verbose tracebacks come from optional integrations, not from sparkctl: language-server probing is disabled automatically, but a jupyterlab package installed in your environment runs a build check at startup that can log a Node/yarn error. It is harmless. Installing only the classic notebook (pip install "sparkctl[jupyter]", without jupyterlab) avoids it.

Stopping¶

sparkctl stop shuts the Jupyter server down along with the rest of the cluster.

Note

If you enable --jupyter without --connect-server, Jupyter still starts, but the notebook is responsible for creating its own SparkSession (for example, a local driver that connects to spark://<master>:7077). The Connect server path is recommended.