A Lightweight Error-Resiliency Mechanism for Deep Neural Networks

Brunno F. Goldstein1, Victor C. Ferreira1, Sudarshan Srinivasan2, Dipankar Das2, Alexandre S. Nery3, Sandip Kundu4, Felipe M. G. França1
1Federal University of Rio de Janeiro, 2Intel Labs, 3University of Brasilia, 4University of Massachusetts Amherst


Abstract

In recent years, Deep Neural Networks (DNNs) have made inroads into a number of applications involving pattern recognition -- from facial recognition to self-driving cars. Some of these applications, such as self-driving cars, have real-time requirements, where specialized DNN hardware accelerators help meet those requirements. Since DNN execution time is dominated by convolution, Multiply-and-Accumulate (MAC) units are at the heart of these accelerators. As hardware accelerators push the performance limits with strict power constraints, reliability is often compromised. In particular, power-constrained DNN accelerators are more vulnerable to transient and intermittent hardware faults due to particle hits, manufacturing variations, and fluctuations in power supply voltage and temperature. Methods such as hardware replication have been used to deal with these reliability problems in the past. Unfortunately, the duplication approach is untenable in a power constrained environment. This paper introduces a low-cost error-resiliency scheme that targets MAC units employed in conventional DNN accelerators. We evaluate the reliability improvements from the proposed architecture using a set of 6 CNNs over varying bit error rates (BER) and demonstrate that our proposed solution can achieve more than 99% of fault coverage with a 5-bits arithmetic code, complying with the ASIL-D level of ISO26262 standards with a negligible area and power overhead. Additionally, we evaluate the proposed detection mechanism coupled with a word masking correction scheme, demonstrating no loss of accuracy up to a BER of 10-2.