Quantum Computing#
NLR provides a GPU-accelerated quantum computing environment on Kestrel through the
qiskit/aer-gpu module. The module bundles several frameworks for quantum circuit
simulation and hybrid quantum-classical algorithms:
| Package | Version | Description |
|---|---|---|
| Qiskit | 2.2.3 | Quantum circuit construction, transpilation, and execution |
| qiskit-aer-gpu | 0.15.1 | GPU-accelerated statevector, density matrix, and shot simulators (H100, sm_90) |
| qiskit-algorithms | 0.4.0 | VQE, QAOA, QAE, amplitude estimation primitives |
| CUDA-Q | 0.14.0 | NVIDIA's unified quantum-classical programming model (Python + C++) |
| cuQuantum | 26.3.0 | NVIDIA cuStateVec and cuTensorNet GPU backends |
| PyTorch | 2.10.0 | (inherited) for classical ML integration |
| mpi4py | 4.1.1 | (inherited) for multi-node parallel workflows |
Note
The qiskit/aer-gpu module targets the H100 GPUs (sm_90) on Kestrel's GPU nodes.
Loading the Module#
Please use only a GPU login node or an allocated GPU node to use the quantum computing module.
module load qiskit/aer-gpu
python3 your_script.py
To see what the module provides:
module help qiskit/aer-gpu
Qiskit Aer GPU Simulation#
qiskit-aer-gpu replaces the CPU AerSimulator with a GPU-accelerated backend.
See qae_example_backends.ipynb
for a worked example comparing CPU and GPU backends for a QAE circuit.
The simplest way to enable GPU simulation is:
from qiskit_aer import AerSimulator
from qiskit.circuit.library import QFT
from qiskit import transpile
simulator = AerSimulator(method='statevector', device='GPU')
qc = QFT(20) # 20-qubit Quantum Fourier Transform
qc.measure_all()
t_qc = transpile(qc, simulator)
result = simulator.run(t_qc, shots=1024).result()
counts = result.get_counts()
print(counts.most_frequent())
For circuits with more than ~25 qubits, enable cuStateVec for better performance:
simulator = AerSimulator(
method='statevector',
device='GPU',
cuStateVec_enable=True,
)
CUDA-Q#
CUDA-Q is NVIDIA's unified quantum-classical programming framework. Kernels are written in Python (or C++) and JIT-compiled to run on CPUs, GPUs, or QPU hardware.
Single GPU#
See binary_optimizer.py
and qae.py
for real-world examples using the single-GPU target.
import cudaq
cudaq.set_target('nvidia') # single H100, complex<float> (fp32)
@cudaq.kernel
def bell():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
mz(q)
result = cudaq.sample(bell, shots_count=1000)
print(result) # { 00:~500 11:~500 }
Noisy Simulation#
CUDA-Q supports trajectory-based noisy simulation using a depolarizing noise model.
See binary_optimizer.py
for a production example with per-gate noise channels applied to a QAE circuit.
import cudaq
cudaq.set_target('nvidia')
noise = cudaq.NoiseModel()
dep1 = cudaq.Depolarization1(0.0001) # single-qubit gate error rate
dep2 = cudaq.Depolarization2(0.001) # two-qubit gate error rate
for g in ['h', 'x', 'y', 'z', 'rx', 'ry', 'rz']:
noise.add_all_qubit_channel(g, dep1)
for g in ['cx', 'cz', 'cry']:
noise.add_all_qubit_channel(g, dep2)
result = cudaq.sample(bell, shots_count=1000, noise_model=noise)
Multi-QPU Shot-Parallel (mqpu)#
The mqpu option assigns shots to multiple GPUs in parallel. Each GPU holds its own
copy of the full statevector and executes an independent subset of shots — useful
for shot-noisy sweeps where the circuit fits in a single GPU's memory.
See par_benchmark_2gpu.py
for a DQA benchmark using this strategy on Kestrel H100s.
import cudaq
# Each of the 4 GPUs handles ~shots/4 shots independently
cudaq.set_target('nvidia', option='mqpu')
@cudaq.kernel
def ansatz(theta: float):
q = cudaq.qvector(28)
h(q[0])
ry(theta, q[1])
cx(q[1], q[0])
mz(q)
result = cudaq.sample(ansatz, 0.5, shots_count=4096)
cudaq.mpi.finalize()
#SBATCH --account=<your-account>
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --mem=320G
#SBATCH --gpus-per-node=4
module load qiskit/aer-gpu
srun python3 my_mqpu_script.py
mqpu vs mpi4py vs mgpu
| Strategy | When to use | Circuit size |
|---|---|---|
mqpu |
Shot-parallel, built-in CUDA-Q, no MPI boilerplate | Fits in 1 GPU |
mpi4py shot-split |
Noisy simulation, fine-grained rank control | Fits in 1 GPU |
mgpu |
Circuit exceeds single-GPU memory | > 30 qubits |
For noisy circuits that fit in one GPU (≤28 qubits), mpi4py shot-splitting
typically outperforms mqpu due to lower inter-process overhead.
See the parallelization report
for benchmarks on Kestrel H100s.
Multi-GPU Statevector (mgpu)#
The mgpu option of the nvidia target distributes the statevector across multiple
GPUs using cuStateVec, enabling simulation of circuits that exceed single-GPU memory:
| GPUs (H100 80 GB each) | fp32 qubit limit | Total statevector memory |
|---|---|---|
| 1 | ~33 qubits | ~64 GB |
| 4 | ~35 qubits | ~256 GB (~64 GB/GPU) |
| 8 | ~36 qubits | ~512 GB (~64 GB/GPU) |
On Kestrel, multi-GPU requires GPU-aware MPI via the Cray GTL library. Use the
wrapper script below to set the required environment variables before Python
starts — they cannot be set with os.environ inside the script.
See bench_mgpu_dqa.py
and par_benchmark_mgpu.py
for full DQA benchmarking examples.
#!/bin/bash
export MPICH_GPU_SUPPORT_ENABLED=1
export LD_LIBRARY_PATH=/nopt/cuda/12.4/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
export LD_PRELOAD=/opt/cray/pe/mpich/8.1.28/gtl/lib/libmpi_gtl_cuda.so
export CUDAQ_MGPU_LIB_MPI=/opt/cray/pe/mpich/8.1.28/ofi/gnu/10.3/lib/libmpi.so
export CUDAQ_MGPU_COMM_PLUGIN_TYPE=MPICH
exec python3 "$@"
import cudaq
cudaq.set_target('nvidia', option='mgpu,fp32')
@cudaq.kernel
def large_circuit(n: int):
q = cudaq.qvector(n)
h(q[0])
for i in range(n - 1):
cx(q[i], q[i + 1])
mz(q)
result = cudaq.sample(large_circuit, 34, shots_count=256)
cudaq.mpi.finalize()
Example Batch Scripts#
Single GPU — Qiskit Aer#
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --mem=80G
#SBATCH --gpus=1
#SBATCH --time=01:00:00
#SBATCH --job-name=qiskit_gpu
module load qiskit/aer-gpu
python3 my_qiskit_script.py
Single GPU — CUDA-Q#
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --mem=80G
#SBATCH --gpus=1
#SBATCH --time=01:00:00
#SBATCH --job-name=cudaq_single
module load qiskit/aer-gpu
python3 my_cudaq_script.py
Multi-GPU (4 GPUs, single node) — CUDA-Q mgpu#
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --mem=320G
#SBATCH --gpus-per-node=4
#SBATCH --time=02:00:00
#SBATCH --job-name=cudaq_mgpu
module load qiskit/aer-gpu
srun bash run_mgpu.sh my_cudaq_mgpu_script.py
Multi-Node (8 GPUs, 2 nodes × 4 GPUs) — CUDA-Q mgpu#
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128
#SBATCH --mem=320G
#SBATCH --gpus-per-node=4
#SBATCH --time=04:00:00
#SBATCH --job-name=cudaq_mgpu_8gpu
module load qiskit/aer-gpu
srun bash run_mgpu.sh my_cudaq_mgpu_script.py
Multi-Node Shot-Parallel (noisy, mpi4py) — CUDA-Q#
For noisy trajectory simulation with many shots, splitting shots across MPI ranks
(one rank per GPU) gives near-linear speedup. See
par_benchmark_mpi.py
for a full working example.
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --mem=160G
#SBATCH --gpus-per-node=2
#SBATCH --time=04:00:00
#SBATCH --job-name=cudaq_mpi4py
module load qiskit/aer-gpu
srun python3 my_mpi4py_cudaq_script.py
Inside the script, each rank binds to one GPU before importing CUDA-Q:
import os
from mpi4py import MPI
import cudaq
comm = MPI.COMM_WORLD
os.environ['CUDA_VISIBLE_DEVICES'] = os.environ.get('SLURM_LOCALID', str(comm.rank))
cudaq.set_target('nvidia')
my_shots = N_SHOTS // comm.size
counts = cudaq.sample(kernel, *args, shots_count=my_shots, noise_model=noise)
all_counts = comm.gather(counts, root=0)
Using Jupyter Notebooks#
module load qiskit/aer-gpu
jupyter lab --no-browser --ip=0.0.0.0
See the Jupyter documentation for instructions on forwarding the port to your local browser.
A worked example notebook for QAE-based amplitude estimation is available at
qae_example.ipynb.
Installing Additional Packages#
To add packages on top of the shared environment without modifying it:
module load qiskit/aer-gpu
python3 -m venv /scratch/$USER/qiskit_env --system-site-packages
source /scratch/$USER/qiskit_env/bin/activate
pip install <your-packages>
python3 your_script.py
Warning
Do not run pip install directly into the shared module environment. Use a
personal venv in /scratch as shown above.
Performance Notes#
-
Noise model limits: noisy trajectory simulation requires one full statevector per shot in GPU memory. At fp32 a single H100 holds up to ~33 qubits (~64 GB); at fp64 the limit is ~30 qubits (~32 GB). Use
mgpu,fp32for noisy circuits with more than ~33 qubits. -
Scaling: for noisy simulation at ≤28 qubits, shot-splitting via mpi4py outperforms mgpu because shots run in parallel rather than sequentially.
-
JIT warm-up: the first CUDA-Q
sample()call for a new circuit/device combination triggers NVRTC kernel compilation (~30–60 s for 28 qubits). Subsequent calls use the cache and are much faster.
Additional Resources#
NVIDIA / IBM documentation:
NLR example code (Kestrel-tested):
run_mgpu.sh— Cray GTL wrapper for mgputest_mgpu.py— GHZ circuit on nvidia mgpu targetbench_mgpu_dqa.py— DQA benchmark: noiseless + noisy mgpu timingpar_benchmark_mgpu.py— scaling benchmark across multiple GPUs with mgpupar_benchmark_mpi.py— DQA benchmark: mpi4py shot-splitting across rankspar_benchmark_2gpu.py— DQA benchmark: mqpu 2-GPU shot-parallelparallelisation_report.md— Full benchmark comparison (mqpu / mpi4py / mgpu) on Kestrel H100sqae_example_backends.ipynb— Qiskit Aer GPU vs CPU backend comparison notebookbinary_optimizer.py— CUDA-Q QAE-based stochastic optimiserqae.py— Quantum Amplitude Estimation circuitsresource_estimator.py— Gate/qubit resource estimation utilities