cudaq_impl — CUDA-Q Port¶
GPU-accelerated CUDA-Q implementation of the DQA+QAE pipeline.
Mirrors the Qiskit reference in binary_optimizer.py and ExpValFun_functions.py,
replacing Qiskit QuantumCircuit objects with @cudaq.kernel primitives that
run natively on NVIDIA GPUs via cuStateVec.
Optimizer Class¶
qiskit_impl.cudaq_impl.CudaqQAEOptimizer ¶
CudaqQAEOptimizer(c_x: list, c_y: list, c_r: float, n_y: int = 4, w_d: int = 2, cost_norm: float = 5.0)
GPU-accelerated QAE optimizer using CUDA-Q.
Qiskit counterpart: BinaryNestedOptimizer in binary_optimizer.py. Method mapping: init <- BinaryNestedOptimizer.init sample_ansatz <- execute_optimizer(qc, num_meas=N) estimate_expected_value<- execute_optimizer() + process_expectation_value_optimizer() _wind_scenario_cost <- wind_scenario_cost() in binary_optimizer.py benchmark_vs_qiskit <- (new) wraps both and times them
Uses @cudaq.kernel functions instead of Qiskit QuantumCircuits.
Usage::
cudaq.set_target('nvidia') # or 'qpp-cpu' for CPU fallback
opt = CudaqQAEOptimizer(c_x=[3.], c_y=[0.4, 0.5, 0.7, 1.],
c_r=10., n_y=4, w_d=2, cost_norm=5.)
phi_est = opt.estimate_expected_value(thetas, shots=1000)
Source code in qiskit_impl/cudaq_impl.py
Functions¶
sample_ansatz ¶
Sample the DQA ansatz and return bitstring probabilities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
thetas
|
list
|
Alternating [gamma, beta, gamma, beta, ...] angles. |
required |
shots
|
int
|
Number of measurement shots. |
4096
|
Returns:
| Type | Description |
|---|---|
dict
|
Dict {bitstring: probability} over 2*n_y qubits (y+xi register). |
Source code in qiskit_impl/cudaq_impl.py
estimate_expected_value ¶
Estimate E[Q(x, xi)] via DQA sampling and classical post-processing.
Mirrors BinaryNestedOptimizer.execute_optimizer + process_expectation_value_optimizer().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
thetas
|
list
|
DQA angles [gamma_0, beta_0, ...]. |
required |
wind_demand
|
int
|
k — required Hamming weight of y (= d - x). |
required |
shots
|
int
|
Number of measurement shots. |
4096
|
Returns:
| Type | Description |
|---|---|
float
|
Float — estimated expected second-stage cost. |
Source code in qiskit_impl/cudaq_impl.py
benchmark_vs_qiskit ¶
benchmark_vs_qiskit(thetas: list, wind_demand: int, shots: int = 4096, qiskit_optimizer=None) -> dict
Time CUDA-Q vs Qiskit execution and report results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
thetas
|
list
|
DQA angles. |
required |
wind_demand
|
int
|
k = d - x. |
required |
shots
|
int
|
Shots per method. |
4096
|
qiskit_optimizer
|
A BinaryNestedOptimizer instance (optional). |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Dict with keys 'cudaq_time', 'qiskit_time', 'cudaq_phi', 'qiskit_phi'. |
Source code in qiskit_impl/cudaq_impl.py
Kernel Primitives¶
pdf_init_uniform¶
qiskit_impl.cudaq_impl.pdf_init_uniform ¶
Prepare uniform superposition on the xi (pdf) register: H^⊗n.
Qiskit counterpart: BinaryNestedOptimizer.pdf_initialize() (is_uniform=True) in binary_optimizer.py and pdf_initialize() in dist_prep.py. Both apply H gates to every qubit in the xi register.
Source code in qiskit_impl/cudaq_impl.py
ccry¶
qiskit_impl.cudaq_impl.ccry ¶
Doubly-controlled RY via cudaq.control (correct for all input states).
Qiskit counterpart: qc.append(RYGate(theta).control(2), [ctrl1, ctrl2, target]) used in ExpValFun_functions.dicke_state_circuit() and single_oracle_sin_inconstraint(). Using cudaq.control avoids relative-phase errors from manual decompositions. Ref: CUDA-Q docs — https://nvidia.github.io/cuda-quantum/
Source code in qiskit_impl/cudaq_impl.py
dicke_state_n4_k2¶
qiskit_impl.cudaq_impl.dicke_state_n4_k2 ¶
Dicke state |D_4^2⟩ on 4 qubits (wind_demand=2, n_y=4).
Qiskit counterpart: ExpValFun_functions.dicke_state_circuit(args) called with args['n_y']=4, args['w_d']=2, and BinaryNestedOptimizer.dicke_state_circuit(weight=2) in binary_optimizer.py. The SCS angle patterns are identical; CUDA-Q inlines them as floats because kernel code cannot call math outside the compiled subset.
Hardcoded for n=4, k=2 matching the example notebook. For general (n, k), compute SCS angles in Python and pass as list[float]. Angles: SCS(4,2): theta1=2arccos(sqrt(1/4)), theta2=2arccos(sqrt(2/4)) SCS(3,2): theta1=2arccos(sqrt(1/3)), theta2=2arccos(sqrt(2/3)) SCS(2,1): theta=2*arccos(sqrt(1/2))
Source code in qiskit_impl/cudaq_impl.py
cost_operator_n4¶
qiskit_impl.cudaq_impl.cost_operator_n4 ¶
cost_operator_n4(gamma: float, c_y0: float, c_y1: float, c_y2: float, c_y3: float, c_r: float, cost_norm: float, y: qview, xi: qview)
Cost phase operator for n_y=4 turbines.
Qiskit counterpart: ExpValFun_functions.cost_operator(amplitude, args) in ExpValFun_functions.py. Qiskit uses qc.cp(amplitude*cost/cost_norm, q_pdf, q_w) for the operational cost and X+CP+X for the recourse cost. CUDA-Q replaces qc.cp(...) with cr1(...) (controlled-Phase gate).
Source code in qiskit_impl/cudaq_impl.py
fswap_power¶
qiskit_impl.cudaq_impl.fswap_power ¶
Partial SWAP (SWAP^beta) on two qubits via XX+YY decomposition.
Qiskit counterpart: SwapGate().power(amplitude) applied per pair (qj, qk) inside ExpValFun_functions.demand_constraint_preserving_mixer(). CUDA-Q has no native fractional SWAP, so we decompose it as Rxx(β·π/2) + Ryy(β·π/2), which is unitarily equivalent.
SWAP^β = exp(-iβπ/4 · (XX+YY)) Decomposed as: Rxx(β·π/2) followed by Ryy(β·π/2) This preserves Hamming weight (sum of excitations), implementing the demand-constraint-preserving XY mixer.
Source code in qiskit_impl/cudaq_impl.py
mixer_n4¶
qiskit_impl.cudaq_impl.mixer_n4 ¶
XY mixer for n_y=4: applies partial SWAP to all pairs (j, k).
ExpValFun_functions.demand_constraint_preserving_mixer(
amplitude, args) — the double loop over y_reg pairs.
Source code in qiskit_impl/cudaq_impl.py
oracle_sin_n4¶
qiskit_impl.cudaq_impl.oracle_sin_n4 ¶
oracle_sin_n4(c_y0: float, c_y1: float, c_y2: float, c_y3: float, c_r: float, norm: float, y: qview, xi: qview, ancilla: qubit)
F_sin oracle for n_y=4 turbines.
Qiskit counterpart: ExpValFun_functions.single_oracle_sin_inconstraint(args) in ExpValFun_functions.py. Qiskit appends RYGate(theta).control(2) (i.e., CCRY) for each turbine; CUDA-Q uses the ccry() kernel defined above as a drop-in replacement. Gate angles pic_y/norm and pic_r/norm are identical in both versions.
Rotates ancilla qubit by angles proportional to per-turbine costs so that
Pr[ancilla=|1>] ≈ normalized_expected_cost
Source code in qiskit_impl/cudaq_impl.py
dqa_ansatz_n4¶
qiskit_impl.cudaq_impl.dqa_ansatz_n4 ¶
dqa_ansatz_n4(c_y0: float, c_y1: float, c_y2: float, c_y3: float, c_r: float, cost_norm: float, thetas: list[float], n_steps: int)
DQA alternating operator ansatz for n_y=4, n_xi=4.
Qiskit counterpart: ExpValFun_functions.alternating_operator_ansatz(args) in ExpValFun_functions.py. Qiskit structure: qc.append(initial_state_circuit(args).to_gate(), y_reg) <- dicke_state_n4_k2 qc.append(pdf_circuit(args).to_gate(), pdf_reg) <- pdf_init_uniform for i, theta in enumerate(Theta): if i % 2 == 0: cost_operator_circuit(theta, args) <- cost_operator_n4 else: mixer_operator_circuit(theta, args) <- mixer_n4
Total qubits: n_y + n_xi = 8. Layout: y[0..3], xi[0..3].