A High-Speed CNN Hardware Accelerator with Regular Pruning

Yuan Song, Bi Wu, Tian Yuan, Weiqiang Liu
Nanjing University of Aeronautics and Astronautics


Abstract

The deployment of convolutional neural networks (CNNs) in resource-constrained applications is limited due to its huge amount of parameters and computations. Therefore, the compression of CNN models, such as pruning and quantization, is necessary. In this paper, a hybrid compression strategy is investigated to compress the network. This method divides a CNN model into two parts according to the convolution (CONV) layers and fully-connected (FC) layers, where different pruning methods are applied, respectively. Since the CONV layers are computationally intensive, a hardware-oriented regular pruning (HRP) is proposed. HRP guarantees the weight distribution of the pruned CONV layers is regular, which can promote the high-speed calculation on the parallel architecture. To obtain a high compression rate, non-structured pruning is introduced to the FC layers to eliminate more redundant parameters. The experimental results show that compared to the baseline, the proposed hybrid compression strategy achieves a 31.74X compress rate improvement with a negligible top-5 accuracy loss (0.25%) for VGG-16 on ILSVRC2012 data set. Furthermore, a hardware accelerator based on the HRP is implemented on Xilinx VCU118 evaluation board. Compared to the state-of-the-art designs, the proposed accelerator reaches a maximum performance of 110.6 frames per second (FPS).