Dependable and Scalable FPGA Computing Using HDL-based Checkpointing

VU HOANG GIA(1561036)


Thanks to high computational capabilities, reconfigurability, power efficiency, and the great advantages of customizing hardware for domain-specific applications, Field Programmable Gate Arrays (FPGAs) are now widely deployed in modern datacenters and highperformance computing systems. However, this deployment compounds the dependability of the computing systems due to their growing size and complexity. On the other hand, it challenges designers to scale computing systems. In this doctoral dissertation, we present how FPGA computing can be dependable and scalable using HDL-based checkpointing. First, we study a method to guarantee the consistency of snapshots between FPGA and other components. Such consistency is essential for the snapshots to be resumed correctly on FPGA. We then propose two checkpointing architectures along with a checkpointing mechanism on FPGA: CPRtree – a tree-based checkpointing architecture, and CPRflatten – a ring-based flattened checkpointing architecture. The two checkpointing architectures are transparent to applications and portable across different hardware platforms. Third, we investigate a static analysis of the original HDL source code for CPRflatten from fundamentals to algorithms in order to re-use hardware resources for the checkpointing purpose, thus reducing hardware consumption caused by checkpointing functionality. Fourth, we introduce two Python-based tools in structures and algorithms to generate checkpointing infrastructures according to CPRtree and CPRflatten so that designers’ task in writing checkpointing source code can be removed completely. The two tools can be integrated seamlessly into hardware design flows. The position of the tools in design flows ensures that our checkpointing architectures are independent of other tools and technology. Fifth, we study a checkpoint/restart scheme for dependability of FPGA computing. In this scheme, we also introduce a software stack with application programming interface (API) functions for “coarse-grained” management from the host. The stack is also transparent to applications and portable across hardware platforms. Sixth, we present two schemes for scalability of FPGA computing employing our above checkpointing architectures. The first scheme – on-the-fly multitasking on FPGA allows multiple users to efficiently share a limited reconfigurable fabric. The second scheme – on-the-fly hardware task migration in heterogeneous FPGA computing allows a hardware task to be migrated between different FPGA fabrics with different technology. We evaluate our proposals from hardware overhead, maximum clock frequency degradation, data footprints, and performance overhead to power consumption. Although the hardware overhead is still significant, the performance degradation and the additional power consumption is small. Our proposals show a potential for bringing FPGAs to hyper-scale computing, such as hyper-scale data centers and hyper-scale clouds while taking advantages of software-based computing.