Results
As per the ZPrize competition spec, we use an AWS f1.2xlarge instance, targeting the BLS12-377 curve with the G1 subgroup-generator. This AWS instance contains a Intel Xeon E5-2686 v4 Processor (2.3 GHz (base) and 2.7 GHz (turbo)) and a UltraScale+ VU9P FPGA.
We achieved a performance of 20.33s $2^26$ MSMs with a batch size of 4.
Breakdown of Individual Steps
In our actual work, we masked a memory transfer latency and host post processing latency with the bucket aggregation computation on the FPGA.
Task | Time(s) |
---|---|
Memcpy 2^26 scalars to special memory region | 0.289 |
Transferring 2^26 scalars to FPGA | 0.198 |
Computing bucket aggregation on FPGA | 4.968 |
Copying bucket values back from FPGA | 0.001 |
Doing on-host postprocessing | 0.470 |
Serialized total time per MSM | 5.927 |
Resource Utilization
The AWS shell uses roughly 20% of the resources available on the FPGA. We tuned our MSM implementation to use the remaining resources as much as possible while still being able to successfully route in Vivado.
An interesting observation is our CLB usage usage on SLR2 (the SLR that does not contain any of the AWS shell) is almost double those of the LUT usage! This is likely due to high congestion in our design.
Site Type | SLR0 | SLR1 | SLR2 | SLR0 % | SLR1 % | SLR2 % |
---|---|---|---|---|---|---|
CLB | 24490 | 36166 | 37216 | 49.72 | 73.42 | 75.55 |
CLBL | 12267 | 17902 | 18075 | 49.87 | 72.77 | 73.48 |
CLBM | 12223 | 18264 | 19141 | 49.57 | 74.06 | 77.62 |
CLB LUTs | 109381 | 131667 | 146118 | 27.76 | 33.41 | 37.08 |
LUT as Logic | 102111 | 116533 | 125870 | 25.91 | 29.57 | 31.94 |
using O5 output only | 953 | 1360 | 12 | 0.24 | 0.35 | <0.01 |
using O6 output only | 79727 | 74527 | 79763 | 20.23 | 18.91 | 20.24 |
using O5 and O6 | 21431 | 40646 | 46095 | 5.44 | 10.31 | 11.70 |
LUT as Memory | 7270 | 15134 | 20248 | 3.69 | 7.67 | 10.26 |
LUT as Distributed RAM | 7002 | 4268 | 0 | 3.55 | 2.16 | 0.00 |
LUT as Shift Register | 268 | 10866 | 20248 | 0.14 | 5.51 | 10.26 |
using O5 output only | 0 | 0 | 1 | 0.00 | 0.00 | <0.01 |
using O6 output only | 96 | 4782 | 8247 | 0.05 | 2.42 | 4.18 |
using O5 and O6 | 172 | 6084 | 12000 | 0.09 | 3.08 | 6.08 |
CLB Registers | 159680 | 278450 | 294595 | 20.26 | 35.33 | 37.38 |
CARRY8 | 1518 | 7595 | 18131 | 3.08 | 15.42 | 36.81 |
F7 Muxes | 4818 | 1265 | 0 | 2.45 | 0.64 | 0.00 |
F8 Muxes | 279 | 226 | 0 | 0.28 | 0.23 | 0.00 |
F9 Muxes | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
Block RAM Tile | 153.5 | 239 | 28 | 21.32 | 33.19 | 3.89 |
RAMB36/FIFO | 152 | 235 | 24 | 21.11 | 32.64 | 3.33 |
RAMB36E2 only | 128 | 235 | 24 | 17.78 | 32.64 | 3.33 |
RAMB18 | 3 | 8 | 8 | 0.21 | 0.56 | 0.56 |
RAMB18E2 only | 3 | 8 | 8 | 0.21 | 0.56 | 0.56 |
URAM | 210 | 127 | 126 | 65.63 | 39.69 | 39.38 |
DSPs | 0 | 859 | 2140 | 0.00 | 37.68 | 93.86 |
PLL | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
MMCM | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 |
Unique Control Sets | 4081 | 4585 | 127 | 4.14 | 4.65 | 0.13 |
Below is an screenshot of the fully placed and routed design in Vivado. A large proportion of the FPGA has been utilized, but there’s still plenty of unused resources on the chip.