Implementation Details

Our implementation targets a AWS EC2 f1.2xlarge instance. This instance contains a single VU9P part. We have utilized the aws-fpga Vitis flow in our hardware design and the Xilinx Run Time (XRT) in our host code to interface with the FPGA.

High-Level Implementation Overview

Our FPGA design comprises of the following 3 high-level components

Memory Access Blocks (written with HLS)
IO transformation blocks (written in Hardcaml)
MSM block to compute pippenger bucket aggregation (written in Hardcaml, discussed in detail in other sections)

Memory Access Blocks

The memory access blocks (krnl_mm2s and krnl_s2mm) interface with a DDR memory bank via an AXI interface and transform them to/from AXI streams.

We have two separate HLS kernels loading affine points and scalars from DDR, and a single HLS kernel to write bucket values to DDR. Using HLS blocks for these simple tasks is a massive productivity boost:

AXI Streams are much easier to work with than AXI ports in RTL
The HLS kernels have a predefined API for communication with the host drivers

We used a single DDR bank for scalars, points and bucket values, because:

Our memory access pattern is streaming friendly
Our computation is nowhere near memory bound
The AWS shell has only 1 memory controller built-in

We futher modify the memory access block krnl_mm2s to allow us to set the tlast bit of the AXI stream when transfering data from the host. This was part of the key of allowing a subset of the point and scalar inputs to be streamed from DDR into the FPGA, while performing other calculations on the host in parallel.

IO Transformation Blocks

The memory access blocks read / write AXI streams with a bus width of 512-bits. We have some IO transformation blocks to reshape the stream into the appropriate data format.

The merge_axi_stream block converts the 512-bit AXI stream into a stream of scalars and affine points and aligns the streams to be available at same clock cycle.

The msm_result_to_host block does a similar alignment on the bucket values output to be written back to the host.

Engineering to Improve Performance

Targeting a High Frequency

At realistic frequencies, our design is compute bound, rather than memory bound. So increasing frequency directly results in a faster MSM.

The Vitis linker config file allows us to easily specify a target frequency to compile our design at. This makes it very convenient to experiment with targeting various clock frequencies with a simple config file change.

Another nice feature of the Vitis is that it automatically downclocks the design when it fails to meet timing closure. This allows us to experiment with high frequencies aggressively, and still have a working design after a long 12-hour build that fails timing. In our submission, we used a config that targets 280MHz, but got downclocked to 278MHz.

(Note that the choice of frequency might affect the implementation results! Notably, targeting 278MHz directly might not have delivered the same result.)

Vivado Implementation Strategies

We experimented with various Vivado implementation strategies. We found that Congestion_SSI_spreadLogic_high tends to deliver better results, likely due to the high congestion in our design.

Not Enabling Retiming

We have experimented with synthesis retiming by adding set_param synth.elaboration.rodinMoreOptions "rt::set_parameter synRetiming true" in our pre-synthesis hooks. Surprisingly, we have found that it degrades a build’s frequency from ~250MHZ to ~125MHZ!. We did not investigate why. We hypothesize that this could be due to the presence of a lot of register->register paths in our design dedicated for SLR crossing which synth retiming tries to insert combinational logic into, damaging routing results.

Needless to say, we did not include retiming as part of our submission!

SLR Partitioning

Modern FPGAs are really several dies stacked together with limited interconnect resources between them. Xilinx calls these dies Super Logic Regions (SLRs). The VU9P FPGA contains 3 SLRs. There are interconnects between SLR0<->SLR1 and SLR1<->SLR2.

In our design, we have carefully partitioned our design such that the RAM for running bucket values for the windows are carefully spreaded out into 3 SLRs, and the pipelined point adder’s various stages are explicitly splitted across multiple SLRs and fitted with necessary SLR-crossing registers.

Hardcaml makes some of these complicated partitioning choices a lot more manageable. We have config fields that allow us to specify the following:

How the windows RAM should be partitioned – specifically, we can assign the number of windows per SLR.
The assignment of adder stages to different SLRs. The design will dynamically instantiate necessary SLR crossing between stages as needed.

We added the following pre placement script to make the process of assigning module instantiations to SLRs more convenient. With this simple configuration, our Hardcaml design simply needs to add _SLR{0,1,2} suffix to instantiation names based on the config to map a module instantiation into a particular SLR.

add_cells_to_pblock pblock_dynamic_SLR0 [get_cells -hierarchical *SLR0*]
add_cells_to_pblock pblock_dynamic_SLR1 [get_cells -hierarchical *SLR1*]
add_cells_to_pblock pblock_dynamic_SLR2 [get_cells -hierarchical *SLR2*]