Results

To evaluate our results, we perform 2 sets of experiments.

Normal-layout Builds

These are builds where the input and output vector to perform NTT on is laid out linearly in HBM (i.e., the host doesn’t perform any pre/post-processing). We run experiments with the 8-core, 16-core, 32-core and 64-core variants, yielding different levels of parallelism.

Optimised-layout Builds

As discussed here, our performance is significantly bound by bandwidth. We conduct 2 builds (32-core and 64-core variant) with a simple data-rearrangement preprocessing step such that the core can stream data in 2048-4096 byte bursts.

Results

We have tested our design and ran builds on a 6-core Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz machine with [Ubuntu 22.04 LTS (GNU/Linux 5.15.0-48-generic x86_64)]. We did not use any special kernel flags / boot parameters to obtain our results. We ran our designs using the Vitis platform Xilinx has provided for the Varium C1100 card. The platform takes up some resources on the FPGA and comes with PCIe gen3 x4 support.

We measured our latency by taking the FPGA-only evaluation latency across 200 NTT runs. Power was measured by sampling xbutil examine --report electrical --device <device-pcie-id> 10 times during the latency benchmark.

In this normal layout build, we do not perform any preprocessing or post-processing. Hence, latency below includes only the FPGA NTT evaluation latency.

Latency, Power and Resource Utilisation

The table below depicts our results for various builds:

Build Latency(s) Power(W) LUTS Registers DSP BRAM36 URAM
8 core 0.2315 16.97 107291 141006 260 162 48
16 core 0.1238 18.19 126422 156149 512 162 96
32 core 0.0691 21.13 166488 184436 1028 162 192
64 core 0.0450 27.70 265523 246385 2052 162 384

Here are the available resources on the FPGA. Note that as we are building on top of a Vitis platform, it imposes a non-trivial fixed-cost that we don’t control. The number is reported as “fixed” in the post_route_utilisation.rpt.

Resource Available on FPGA Used by Vitis Platform
LUTS 870720 62191
Registers 1743360 81502
DSP 5952 4
BRAM36 1344 0
URAM 640 0

Result from Optimised-Layout Builds

Here is a detailed breakdown of a runtime sample of an optimised 64-core build (the power and utilisation is similar to the normal-layout builds):

Breakdown of a 2^24 optimised-layout 64-core evaluation:

Task Time
Preprocessing data rearrangement 0.0213s
Copying input points to device 0.0414s
Doing Actual NTT work 0.0267s (vs 0.0450s in normal layout)
Copying final result to host 0.0552s
Copy from internal page-aligned buffer 0.0231s
Evaluate NTT 0.1680s

Breakdown of a 2^24 optimised-layout 32-core evaluation:

Task Time
Preprocessing data rearrangement 0.0217s
Copying input points to device 0.0416s
Doing Actual NTT work 0.0349s (vs 0.0691s in normal layout)
Copying final result to host 0.0554s
Copy from internal page-aligned buffer 0.0228s
Evaluate NTT 0.1770s

By rearranging the data in a much more memory-friendly layout, our NTT evaluation time drops significantly compared to those of a 64-core build in a normal build (0.0267s vs 0.0450s). This comes at the cost of the host doing some data rearrangement.

The bottleneck of our evaluation clearly lies in the host and PCIe latency in this result, both of which can be solved pretty easily:

  • preprocessing + postprocessing > latency - We can run the preprocessing and post-processing in separate threads. We can set up the input and output buffers such that we don’t run into cache coherency issues. We can also mask some of the preprocessing latency with PCIe latency.
  • The PCIe latency is larger than the NTT evaluation - This is because the Vitis platform we are using only supports PCIe x4. With PCIe x16, we would have 4 times the bandwidth and could sidestep this problem.

In practice, we believe this is the more scalable design that can achieve low latency and high throughput at the cost of the host machine doing some data rearrangement.