HiPR: Fast Incremental Custom Partial Reconfiguration for HLS Developers

Published in The 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022

Recommended citation: Yuanlong Xiao. Andre DeHon. “HiPR: Fast Incremental Custom Partial Reconfiguration for HLS Developers” The 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays(FPGA ’22)..

Abstract

FPGA accelerators have drastically changed application develop- ment as they allow users to exploit arbitrary custom hardware architecture with low Non-Recurring Engineering (NRE) cost. De- spite the significant absolute performance and energy efficiency compared with CPUs, FPGA development requires high levels of hardware expertise and programming skills to unleash the poten- tial parallelism. High-Level Synthesis can abstract away low-level circuits design and improve the coding productivity. However, this also lengthens the compilation time, exacerbating an already slow edit-compile-debug loop that discourages the development and re- finement of FPGA accelerators. Partial Reconfiguration techniques can decrease the compilation time by reducing and parallelizing the size of the compilation task. But defining partial reconfigurable regions also needs expert layout-level knowledge, making this ap- proach inaccessible to the high-level developers that HLS is in- tended to attract. To address the problems above, we propose HiPR, an open-source framework that bridges the gap between HLS and PR. With HiPR, users can define a C/C++ function (rather than a Verilog module) as partially reconfigurable without considering detailed low-level constraints. HiPR automates the PR floorplan and allows the users to define elastic resource requirements for the C-level PR function for quick further tuning later. By mapping the full set of Rosetta Benchmarks, we show HiPR can find the proper floorplan solution within seconds and generate the overlay for later tuning. Significantly, the incremental compilation time can be accelerated by 3–10X with no performance loss.

Motivation

For normal Xilinx Vitis flow, Synthesis usually takes more time for the initial compilation (green blocks) as more peripheral modules are only compiled once. By re-saving one source file and recompiling the benchmarks, we only see 21-36 % reduction in compile times, even when we make no modifications to the code in the file.

We raise the key question here: Can we compile the HLS source code incrementally, like software, such that we only need to perform placement and routing on the portions of the design that changes?

Compute Model

HiPR uses a dataflow computational graph model based on Kahn Processing Networks (KPN) [15, 18]. Each kernel is describe in C program called operator. The operator receives inputs and sends outputs through latency-insensitive protocols. Different kernels are connected by latency-insensitive interfaces [10], which can be mapped to FIFO or handshake Relay Station [6, 36].

HiPR Toolflow

Initial-Compile Flow (Overlay Generation)

When compiling the C-based application for the first time, HiPR separately compile the operators in parallel. The PragmaParse can parse the top.cpp to extrac the interconnect information. The Floorplanner interprets the post-synthesis reports, detects the PR-functions according to the PR pragmas in the operators’ header files, and generates the floorplan xdc constraints files. Next, the overlay can be placed and routed in a monolithic way.

Incremental-Compile Flow (Seperately and Parallel Compile)

If we modify one or more PR-functions later, HiPR can detect the modified files and only re-compile that modules in parallel.

Evaluation on Alveo U50

Intial-Compile Time and Overhead

For the initial-compile, HiPR needs more time to set up the ovelay. We can see 15-67% compile time overhead.

Benchmark names	Vitis / s	HiPR / s	HiPR Overhead
Rendering	4264	7152	67 %
Digit Recognition	5172	6125	19 %
Spam Filter	3942	4541	15 %
Optical Flow	4139	6880	66 %
Face Detection	6288	8851	40 %
BNN	6584	9632	46 %

Intial-Compile Time and Overhead

We use the worst case for each benchmark, as each benchmark may have several PR-functions. For the incremental-compile, HiPR can outperform Xilinx Vitis by 3.4-5.6X.

Benchmark names	Vitis / s	HiPR / s	HiPR Speedup
Rendering	3278	908	3.6X
Digit Recognition	3927	695	5.6X
Spam Filter	2865	692	4.1X
Optical Flow	2918	657	4.4X
Face Detection	4954	952	5.2X
BNN	4154	1232	3.4X

For the compile time distribution for all the operators for all the benchmark, we can see the compile accleration can be up to 10.5X.

Conclusion

HiPR enables the users to define Partial Reconfigurable (PR) functions at the C-level instead of Verilog modules. The automation from C-to-bitstream can save manual interference. The incremental-compile time can be accelerated by 3.4-10.5X.

Share on

Twitter Facebook LinkedIn