Preliminary Program

Low-Precision Quantization Techniques for Hardware-Implementation-Friendly BERT Models

Xinpei Zhang¹, Yi Ding¹, Mingfei Yu¹, Shin-ichi O'uchi², Masahiro Fujita³
¹the University of Tokyo, ²AIST-UTokyo AI Chip Design Open Innovation Laboratory, ³the University of Tokyo, AIST-UTokyo AI Chip Design Open Innovation Laboratory

Abstract

Transformer-based deep learning models have been widely recognized as highly effective for NLP(natural language processing) tasks, among which BERT(Bidirectional Encoder Representations from Transformers) is achieving outstanding performance on popular benchmarks. However, it is also noticed that the large amount of parameters, along with computational burden, are constraining its implementation on top of hardware platforms with limited computing and memory resources. In this work, having hardware implementation in mind, we introduce and evaluate 2 quantization techniques - clipping, and two piece-wise quantization, which have been implemented in convolutional neural network (CNN) models - to quantize this originally heavy model into the one requiring smaller number of bits. Our experimental results have revealed that, with the implementation of clipping and piece-wise quantization in an independent or joint manner, it is possible to maintain the accuracy of the BERT model after operating quantization with lower bit-width in activation and model quantized for smaller hardware implementation. Evaluations on four typical NLP tasks prove that, with 8-bit integer activations, even if the weights are quantized to only 4-bit integer, the loss of performance is less than 4% We show various quantization results for weights and activations, which indicate that from 4-bit to 8-bit quantization for both of weights and activations can be used with good accuracy with all weights in the model quantized.