Regularization-Free Structural Pruning for GPU Inference Acceleration

Chuliang Guo1, Yanbing Yang2, Li Zhang1, Shaodi Wang3, He Li4, Keyu Long2, Xunzhao Yin1, Cheng Zhuo1
1Zhejiang University, 2The Second Research Institute of Civil Aviation Administration of China, Chengdu, China, 3WITIN Tech Co. Ltd., Beijing, China, 4University of Cambridge, Cambridge, UK


Pruning is recently prevalent in deep neural network compression to save memory footprint and accelerate network inference. Unstructured pruning, i.e., fine-grained pruning, helps preserve model accuracy, while structural pruning, i.e., coarse-grained pruning, is preferred for general-purpose platforms such as GPUs. This paper proposes a regularization-free structural pruning scheme to take advantage of both unstructured and structural pruning by heuristically mixing vectorwise fine-grained and block-wise coarse-grained pruning masks with an AND operation. Experimental results demonstrate that the proposal can achieve higher model accuracy and higher sparsity ratio of VGG-16 on CIFAR-10 and CIFAR-100 compared with commonly applied block and balanced sparsity.