4-Step Algorithm
The core design limits us to transform sizes which can fit within on-chip FPGA RAM resources. To break this limit we use the well known 4-step algorithm as described in this paper. An OCaml reference implementation is provided in our repository.
The idea is to break the large transform into lots of smaller transforms. Let’s consider our target transform size of $2^24$. We first reformulate the input data as a $2^12 x 2^12$ matrix in row major order.
The following steps are then performed:
- Perform $2^12$ NTTs on each column.
- Multiply each element of the matrix at location $(i,j)$ by $ω^(i.j)$, where $ω$ is the appropriate root of unity of the transform size.
- Perform $2^12$ NTTs on each row.
- Transpose the result.
The required NTT transform can now easily fit within on-chip FPGA RAM resources, so long as we can store the complete matrix in external memory and access it efficiently.
Performance
Using the Cooley-Tukey FFT algorithm, a transform of size $2^24$ requires $2^24 log_{2} 2^24 = 402,653,184$ operations.
The 4-step algorithm performs $2 . 2^12$ smaller NTT transforms, each of size $2^12$. This works out to the same number of $402,653,184$ operations as directly computing a $2^24$ transform. However, we must add to this a further $2 . 2^24$ operations to perform the scaling operation (the scaling requires 2 multiplications—one to scale the coefficient, the other to update the scaling factor).
The following table gives the total operations and overhead for a few transform sizes.
$log_{2} N$ NTT | Full transform operations | 4-step operations | Overhead % |
---|---|---|---|
18 | 4718592 | 5242880 | 11.1 |
20 | 20971520 | 23068672 | 10.0 |
22 | 92274688 | 100663296 | 9.0 |
24 | 402653184 | 436207616 | 8.3 |
26 | 1744830464 | 1879048192 | 7.7 |
28 | 7516192768 | 8053063680 | 7.1 |
Transposition
The 4-step algorithm talks about row and column transforms and a final transposition step. This is not actually performed by the hardware design. Rather data is read and written from the matrix in memory as follows:
- Read columns
- Write columns
- Read rows
- Write columns
Reading a single column would be very inefficient for DDR-style dynamic memories. Fortunately as we scale up the design for performance we can read multiple columns at a time. We will also show a data reorganisation approach which leads to very efficient memory access at the cost of reordering the coefficients in the input matrix.