A High-Speed CNN Hardware Accelerator with Regular Pruning

Yuan Song, Bi Wu, Tian Yuan, Weiqiang Liu
Nanjing University of Aeronautics and Astronautics


The deployment of convolutional neural networks (CNNs) in resource-constrained applications is limited due to its huge amount of parameters and computations. Therefore, the compression of CNN models, such as pruning and quantization, is necessary. In this paper, a hybrid compression strategy is investigated to compress the network. This method divides a CNN model into two parts according to the convolution (CONV) layers and fully-connected (FC) layers, where different pruning methods are applied, respectively. Since the CONV layers are computationally intensive, a hardware-oriented regular pruning (HRP) is proposed. HRP guarantees the weight distribution of the pruned CONV layers is regular, which can promote the high-speed calculation on the parallel architecture. To obtain a high compression rate, non-structured pruning is introduced to the FC layers to eliminate more redundant parameters. The experimental results show that compared to the baseline, the proposed hybrid compression strategy achieves a 31.74X compress rate improvement with a negligible top-5 accuracy loss (0.25%) for VGG-16 on ILSVRC2012 data set. Furthermore, a hardware accelerator based on the HRP is implemented on Xilinx VCU118 evaluation board. Compared to the state-of-the-art designs, the proposed accelerator reaches a maximum performance of 110.6 frames per second (FPS).