Building, Testing and Benchmarking

Please clone our submission github repository before following any instructions below.

Building the design from source

Instructions are given below for building from source. A prerequisite is that OCaml has been set up (outlined in the main

It is important you use the AMI version 1.10.5 and Vivado version 2020.2 to achieve the same results. The rtl_checksum expected of the Verilog when generated from the Hardcaml source is 1929f78e1e4bafd9cf88d507a3afa055.

Compiling the BLS12-377 reference

Run cargo build in libs/rust/ark_bls12_377_g1 to compile the dynamic library exposing a reference implementation of the BLS12-377 g1 curve. This is necessary for the unit tests and verilog generator to work correctly.

z3 should also be installed to run Hardcaml RTL simulations.

Generating the Verilog from Hardcaml

The following instructions assume you are in the zprize/msm_pippenger folder.

The Hardcaml code can be built by calling dune build, which will also cause the top level Verilog to be generated in fpga/krnl_msm_pippenger/krnl_msm_pippenger.v. We also provide a dune target for generating an md5sum fpga/krnl_msm_pippenger/rtl_checksum.md5 of the Verilog expected, so that if changes to the Hardcaml source are made that modify the Verilog (which is not checked into the repo), the rtl-checksum will show a difference.

Simulations in Hardcaml

We have various expect tests in the test folders which can be run by calling dune runtest. To optionally run a longer simulation, we added binaries that can be called and various arguments set. These run with the Verilator backend, which after a longer compile time, will provide much faster simulation time than the built-in Hardcaml simulator. Make sure you have Verilator installed when running this binary. To simulate 128 random points, run the following command:

dune exec ./hardcaml/bin/simulate.exe -- kernel -num-points 128 -verilator -timeout 1000000

The -waves switch can be optionally provided to open the simulation in the Hardcaml waveform viewer. A larger timeout should be provided when simulating more points.

Building an FPGA image for AWS

You should build the FPGA image on an AWS box with the FPGA Developer AMI installed. Make sure to clone the aws-fpga repo to ~/aws-fpga. We used the 2a36c4d76b68bb9c60bf2f7b0be0fd9ea134978e revision of the repository.

Run the following to set up the necessary environment for the build.

source ~/aws-fpga/

Navigate into the fpga directory which contains the scripts to build an actual FPGA design (takes 8-10 hours). The compile script below will also build the Hardcaml to generate the required Verilog. As this takes awhile, it is recommended that you run this from a tmux session in the build instance.

cd fpga

The image built from the source at the time of writing will produce an FPGA AFI that runs at 278MHz, automatically clocked down from 280MHz by Vitis.

We have done repeated builds to make sure the image built from source is identical to what we tested. As a checkpoint, here are a few checksums that will be reported by Vivado:

Phase 1.1 Placer Initialization Netlist Sorting | Checksum: e1119738
Ending Placer Task | Checksum: 1625c9702
Phase 4.1 Global Iteration 0 | Checksum: 2d7b45461
Phase 4.5 Global Iteration 4 | Checksum: 231df891e
Phase 13 Route finalize | Checksum: 176e41d90

Creating the AWS AFI

Once you have successfully called in the fpga folder, you want to pass the results to the AWS script responsible for generating the AFI an end-user can run:


After running the script, there should be a folder ‘afi/’. Get the afi id from the file afi/{date}_afi_id.txt and run the following command to track the progress of its creation:

aws ec2 describe-fpga-images --fpga-image-ids <afi-...>

Which will show up as “available” when the image is ready to use.

Benchmarking and Testing

AWS setup

You need to run these steps on an AWS F1 box with an FPGA. Make sure you have cloned the aws-fpga repo and run:

source ~/aws-fpga/

Optionally check the status of the FPGA:

systemctl status mpd

You need the .awsxclbin file from the build box - usually the easiest way is to download this from the s3 bucket or scp it over.

MSM FPGA Test Harness

We include a modified version of the GPU MSM test harness for testing and benchmarking our FPGA design. A feature we added is the ability to load test data from a file using an environment variable TEST_LOAD_DATA_FROM.

As the original GPU test harness expected the points returned by the FPGA to be in projective form. We modified our host driver code to convert our result point to this form.

The input and output points Basefield values are all in (or expected to be in) Montgomery form (Montgomery here refers to Montgomery multiplication). This is also some extra work that we are required to do in precomputation and during evaluation, as we don’t represent our points internally in Montgomery space.

Generating Test Data

Note: If you have already previously generated test data, you can skip this step.

Run with freshly generated points and save the test data to a directory. Note that this overwrites anything you have in ‘path/to/write/to’ without warning!

# The --nocapture flag will cause println! in rust to be displayed to the user
CMAKE=cmake3 XCLBIN=<file> TEST_NPOW=10 TEST_WRITE_DATA_TO=path/to/write/to cargo test --release -- --nocapture

For TEST_NPOW=10, this should run in a few seconds. For TEST_NPOW=26, it can take around 6-7 hours to generate all the points.

Running Tests

To load the points from disk, use the TEST_LOAD_DATA_FROM env-var. Note that you don’t have to specify TEST_NPOW when specifying TEST_LOAD_DATA_FROM. (If you do, it will just be ignored)

The following command will time and verify the result for 4 rounds of 226 MSM, and run some unit tests:

# The --nocapture flag will cause println! in rust to be displayed to the user
CMAKE=cmake3 XCLBIN=<file> TEST_LOAD_DATA_FROM=path/to/load/from cargo test --release -- --nocapture

The expected runtime for a 2^26 test:

  • around 5-10m to load the test data
  • around 20-30m to run multi_scalar_mult_init, as it needs to convert all the points given to it into twisted edwards form.
  • around 20s per MSM of batch size 4

Amongst the various outputs, one of them will show the latency of 4 rounds and correctness:

$ CMAKE=cmake3 XCLBIN=~/afi-0938ad46413691732.awsxclbin TEST_LOAD_DATA_FROM=~/testdata/2_26 cargo test  --release -- --nocapture
Done internal format conversion!
Loading XCLBIN=/home/ and doing openCL setups:
Found Platform
Platform Name: Xilinx
INFO: Reading /home/
Loading: '/home/'
Trying to program device[0]: xilinx_aws-vu9p-f1_shell-v04261818_201920_2
Device[0]: program successful!
[Copying input points to gmem] 1.4966s
multi_scalar_mult_init took Ok(1167.977275761s)
Running msm test for 1 rounds
Running MSM of [67108864] input points (4 batches)
Streaming input scalars across 4 chunks per batch (Mask IO and Post Processing)
Running multi_scalar_mult took Ok(20.37390268s) (round = 0)
test msm_correctness ... ok

For debugging some driver code, it’s usually easier to test on a set of trivial inputs (all points are G1 generator, all scalars are zero except for scalars[0]). Test data for 2^26 takes <5s to generate on AWS.

CMAKE=cmake3 XCLBIN=<file> TEST_TRIVIAL_INPUTS=1 XCLBIN=<file> cargo test --release -- --nocapture

To run the tests on the FPGA box with freshly generated points. Note that if you don’t specify TEST_NPOW, the test harness will raise (this is different from the original GPU test harness)

CMAKE=cmake3 XCLBIN=<file> TEST_NPOW=10 cargo test --release -- --nocapture


This is similar to running tests, except instead of running cargo test, you run cargo bench. The expect environment variables are similar:

CMAKE=cmake3 \
  TEST_LOAD_DATA_FROM=~/testdata/2_26 \
  cargo bench

The output shows show the result of 10 runs of 4 rounds each:

FPGA-MSM/2**26x4 time: [20.336 s 20.336 s 20.337 s]


AWS allows the average power to be measured during operation. Run the following command during the MSM evaluation to get a power measurement.

sudo fpga-describe-local-image -S 0 -M
Power consumption (Vccint):
   Last measured: 52 watts
   Average: 52 watts
   Max measured: 56 watts

Breakdown of individual host steps

The breakdown of how long each stage takes can be printed when changed the value of mask_io to false in host/driver/driver.cpp (this is not used in benchmarking as it has lower performance):

[memcpy-ing scalars to special memory region] 0.28928s
[transferring scalars to gmem] 0.198263s
[Doing FPGA Computation] 4.96781s
[Copying results back from gmem] 0.00128217s
[Doing on-host postprocessing] 0.469954s


  1. Because our solution offloads a non-trivial amount of work to the host to perform in parallel, you will see the best performance after a fresh reboot, and without other CPU-intensive tasks running at the same time.
  2. When running the tests, if you terminate the binary early by ctrl-c, it will leave the FPGA in a bad state which requires clearing and re-programming with these commands:


Running host_buckets.exe debug test

host_buckets.exe is a debug application that pumps test vectors into the FPGA from a file, and compares against a reference file. Note this is NOT the benchmarking program and has not been optimized at all.

Firstly, compile the host binaries:

cd host
mkdir build/
cd build/

# Need to explicitly use cmake3 in the aws box, since it's running a pretty old
# centos
cmake3 ..
make -j

# Now, generate the test vectors. This just needs to be done once. Here's an
# example for debugging with 50k points
dune build @../../hardcaml/bin/default
../../hardcaml/bin/tools.exe test-vectors \
  -num-points 50_000 \
  -input-filename inputs-50_000.txt \
  -output-filename outputs-50_000.txt \
  -seed 1

# Now, run the actual binary. It should say "TEST PASSED" in the end. If you
# see verbose test output with things that looks like coordinates, your test
# probably failed.
./driver/host_buckets \
  path/to/ \
  inputs-50000.txt \