Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning

Hongwu Peng1, Shaoyi Huang1, Tong Geng2, Ang Li2, Weiwen Jiang3, Hang Liu4, Shusen Wang4, Caiwen Ding1
1University of Connecticut, 2Pacific Northwest National Laboratory, 3University of Notre Dame, 4Stevens Institute of Technology


Although Transformer-based language representations achieve state-of-the-art accuracy on various natural language processing (NLP) tasks, the large model size has been challenging the resource constrained computing platforms. Weight pruning, as a popular and effective technique in reducing the number of weight parameters and accelerating the Transformer, has been investigated on GPUs. However, the Transformer acceleration using weight pruning on field-programmable gate array (FPGAs) remains unexplored. This paper investigates the column balanced block-wise pruning on Transformer and designs an FPGA acceleration engine to customize the balanced blockwise matrix multiplication. We implement the Transformer model with proper hardware scheduling, and the experiments show that the Transformer inference on FPGA achieves 10.35 ms latency with the batch size of 32, which is 10.96  speed up comparing to CPU platform and 2.08  speed up comparing to GPU platform.