CSCMAC - Cyclic Sparsely Connected Neural Network Manycore Accelerator

Hirenkumar Paneliya1, Morteza Hosseini1, Avesta Sasan2, Houman Homayoun2, Tinoosh Mohsenin1
1University of Maryland Baltimore County, 2George Mason University


Abstract

This paper presents an energy-efficient, domain-specific manycore accelerator also referred to as the "CSCMAC" - Cyclic Sparsely Connected Neural Network Manycore Accelerator, which effectively maps and executes deep neural networks (DNNs) compressed with cyclic sparsely connected (CSC) architectures. CSC layers are architectures that structurally compress and sparsify DNNs, which can reduce the memory footprint of fully connected (FC) layers from O(N^2) to O(NlogN), and is shown to be hardware implementable-friendly. We implement CSC layers for inference on a manycore unit, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring indicate that by replacing FC layers with CSC layers, we can achieve 46x and 6x compression respectively within a margin of 2% accuracy loss. A 64-cluster architecture of the CSCMAC is fully placed and routed using 65nm, TSMC CMOS technology. The layout of each cluster occupies an area of 0.73 mm^2 and consumes 230.2 mW power at 980 MHz clock frequency. Our proposed CSCMAC achieves 1.48x higher throughput and 1.49x lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves 85x higher throughput and consumes 66.4x lower energy compared to NVIDIA Jetson TX2 platform implementation.