Results

To evaluate our results, we perform 2 sets of experiments.

Normal-layout Builds

These are builds where the input and output vector to perform NTT on is laid out linearly in HBM (i.e., the host doesn’t perform any pre/post-processing). We run experiments with the 8-core, 16-core, 32-core and 64-core variants, yielding different levels of parallelism.

Optimised-layout Builds

As discussed here, our performance is significantly bound by bandwidth. We conduct 2 builds (32-core and 64-core variant) with a simple data-rearrangement preprocessing step such that the core can stream data in 2048-4096 byte bursts.

Results

We have tested our design and ran builds on a 6-core Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz machine with [Ubuntu 22.04 LTS (GNU/Linux 5.15.0-48-generic x86_64)]. We did not use any special kernel flags / boot parameters to obtain our results. We ran our designs using the Vitis platform Xilinx has provided for the Varium C1100 card. The platform takes up some resources on the FPGA and comes with PCIe gen3 x4 support.

We measured our latency by taking the FPGA-only evaluation latency across 200 NTT runs. Power was measured by sampling xbutil examine --report electrical --device <device-pcie-id> 10 times during the latency benchmark.

In this normal layout build, we do not perform any preprocessing or post-processing. Hence, latency below includes only the FPGA NTT evaluation latency.

Latency, Power and Resource Utilisation

The table below depicts our results for various builds:

Build	Latency(s)	Power(W)	LUTS	Registers	DSP	BRAM36	URAM
8 core	0.2315	16.97	107291	141006	260	162	48
16 core	0.1238	18.19	126422	156149	512	162	96
32 core	0.0691	21.13	166488	184436	1028	162	192
64 core	0.0450	27.70	265523	246385	2052	162	384

Here are the available resources on the FPGA. Note that as we are building on top of a Vitis platform, it imposes a non-trivial fixed-cost that we don’t control. The number is reported as “fixed” in the post_route_utilisation.rpt.

Resource	Available on FPGA	Used by Vitis Platform
LUTS	870720	62191
Registers	1743360	81502
DSP	5952	4
BRAM36	1344	0
URAM	640	0

Result from Optimised-Layout Builds

Here is a detailed breakdown of a runtime sample of an optimised 64-core build (the power and utilisation is similar to the normal-layout builds):

Breakdown of a 2^24 optimised-layout 64-core evaluation:

Task	Time
Preprocessing data rearrangement	0.0213s
Copying input points to device	0.0414s
Doing Actual NTT work	0.0267s (vs 0.0450s in normal layout)
Copying final result to host	0.0552s
Copy from internal page-aligned buffer	0.0231s
Evaluate NTT	0.1680s

Breakdown of a 2^24 optimised-layout 32-core evaluation:

Task	Time
Preprocessing data rearrangement	0.0217s
Copying input points to device	0.0416s
Doing Actual NTT work	0.0349s (vs 0.0691s in normal layout)
Copying final result to host	0.0554s
Copy from internal page-aligned buffer	0.0228s
Evaluate NTT	0.1770s

By rearranging the data in a much more memory-friendly layout, our NTT evaluation time drops significantly compared to those of a 64-core build in a normal build (0.0267s vs 0.0450s). This comes at the cost of the host doing some data rearrangement.

The bottleneck of our evaluation clearly lies in the host and PCIe latency in this result, both of which can be solved pretty easily:

preprocessing + postprocessing > latency - We can run the preprocessing and post-processing in separate threads. We can set up the input and output buffers such that we don’t run into cache coherency issues. We can also mask some of the preprocessing latency with PCIe latency.
The PCIe latency is larger than the NTT evaluation - This is because the Vitis platform we are using only supports PCIe x4. With PCIe x16, we would have 4 times the bandwidth and could sidestep this problem.

In practice, we believe this is the more scalable design that can achieve low latency and high throughput at the cost of the host machine doing some data rearrangement.