Number-Theoretic Transform (NTT)

This competition track required us to build a Number Theoretic Transform (NTT) accelerator capable of performing transforms of size $2^24$. NTTs are conceptually similar to the Fourier Transforms - working over a finite field instead of complex numbers. For this project, the finite field contains values of size 64 bits modulo a so called Solinas prime equal to $2^64 - 2^32 + 1$.

The mathematical formulation of the NTT is as the following:

\[X_{i} = ∑↙{j=0}↖{N-1} x_{j} w^{ij} (mod P)\]

where $N$ is the size of the input vector $x$, $P$ is the Solinas prime and $w$ is the $N$th root of unity. $X_{i}$ is defined for $0 ≤ i < N$.

The main design challenge is that such large transforms cannot fit within a single FPGA’s on-chip RAM resources, so we need to make efficient use of large external DRAM resources.

The platform targeted is the Xilinx Varium C1100 accelerator card. The card contains a Virtex UltrasScale+ FPGA with HBM2.

Design overview

Our work is built around the 4-step algorithm to break down a large NTT into much smaller NTTs. The smaller NTTs are computed using the well-known decimation in time Cooley-Tukey FFT algorithm. The following pages discuss the algorithms and design considerations:

4-step details the 4-step algorithm
Bandwidth Considerations

You can read more about our FPGA implementation in the following pages:

We present the performance results for our design and show how you can build the design.

Hardcaml on the Web

Configure our designs, download RTL and perform simulations all within your browser:

Core NTT design which includes the IO RAMs, datapath and controller
Top level design as a Vitis kernel which performs the full 4-step algorithm

Code structure

The code is built from a couple of libraries within our submission repository:

The Hardcaml_ntt library provides the base implementation of the NTT core and includes a software model and various unit tests.

Zprize_ntt contains the code for both the top-level Hardcaml NTT kernel and tests, the C++ HLS DMA kernel, host benchmarking software, and build scripts.