Performance

MerLin is a quantum machine learning framework designed specifically for photonic quantum computing, leveraging the unique properties of single-photon quantum systems.

MerLin Quantum Layers can be executed on either CPU or GPU, like any other PyTorch module:

import merlin as ML # Package: merlinquantum, import: merlin
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create a simple quantum layer
quantum_layer = ML.QuantumLayer.simple(
    input_size=3,
    n_params=50,  # Number of trainable quantum parameters
    device = device
)

Therefore, these Quantum Layers must be optimized for efficient GPU execution across varying batch sizes, mode counts, and photon numbers.

Here, we analyze the memory and computation time requirements for running a GenericInterferometer with \(m\) modes and \(2m(m-1)\) trainable parameters (\(m(m-1)\) beam splitters and \(m(m-1)\) phase shifters). The analysis was performed on an NVIDIA H100 GPU with 80GB of VRAM. We run a simple code that you can find here, in which we want to learn a target distribution by training the beam splitters and phase shifters of our interferometer. We have used the pyNVML library for memory monitoring (documentation)

Memory Performance

First, we analyze the performance needed to train an interferometer with m modes using m//2 photons for m from 2 to 24 and varying the batch size from 1 to 2048.

Memory usage with respect to different batch sizes for m-mode interferometers with m/2 photons

Then, we analyze the performance needed to train m-mode interferometers with 1 to m//2 photons for batch_size=1:

Memory usage with respect to different number of photons, for different sizes of interferometers

Conclusion: we can run up to 24 modes with 12 photons with a Batch size of 16 on the H100 !

Time Performance

Here, we compare the average time required for different operations on the H100 GPU. First, we display the computation time needed for the QuantumLayer with varying numbers of photons:

Compilation time for the ``QuantumLayer`` with different numbers of photons

Next, we compare forward and backward pass times for different numbers of photons:

Forward and Backward times for the ``QuantumLayer`` with different numbers of photons

And then, we compare forward and backward pass times for different batch sizes:

Forward and Backward times for the ``QuantumLayer`` for different ``batch_sizes``

Conclusion: The QuantumLayer demonstrates reasonable computation times, making it suitable for integration within PyTorch-based workflows.

Pushing the H100 to its limits

We increase the number of modes from 50 to 350 and we vary the number of photons from 1 to 3 with a batch size of 1 to observe GPU performance with a high number of modes:

Memory usage for the m-mode interferometer with respect to the number of photons

Conclusion: We can increase the number of modes above 350 with a H100 GPU

Now it is your turn !

Let’s push your GPU to its limits ! Follow our code here: memory_benchmark

To benchmark on your GPU, simply run:

python3 ./docs/source/reference/memory_benchmark.py --modes 16 --photons 8 --bs 64 --type torch.float64