Low-Precision Quantization Techniques for Hardware-Implementation-Friendly BERT Models

Xinpei Zhang1, Yi Ding1, Mingfei Yu1, Shin-ichi O'uchi2, Masahiro Fujita3
1the University of Tokyo, 2AIST-UTokyo AI Chip Design Open Innovation Laboratory, 3the University of Tokyo, AIST-UTokyo AI Chip Design Open Innovation Laboratory


Abstract

Transformer-based deep learning models have been widely recognized as highly effective for NLP(natural language processing) tasks, among which BERT(Bidirectional Encoder Representations from Transformers) is achieving outstanding performance on popular benchmarks. However, it is also noticed that the large amount of parameters, along with computational burden, are constraining its implementation on top of hardware platforms with limited computing and memory resources. In this work, having hardware implementation in mind, we introduce and evaluate 2 quantization techniques - clipping, and two piece-wise quantization, which have been implemented in convolutional neural network (CNN) models - to quantize this originally heavy model into the one requiring smaller number of bits. Our experimental results have revealed that, with the implementation of clipping and piece-wise quantization in an independent or joint manner, it is possible to maintain the accuracy of the BERT model after operating quantization with lower bit-width in activation and model quantized for smaller hardware implementation. Evaluations on four typical NLP tasks prove that, with 8-bit integer activations, even if the weights are quantized to only 4-bit integer, the loss of performance is less than 4% We show various quantization results for weights and activations, which indicate that from 4-bit to 8-bit quantization for both of weights and activations can be used with good accuracy with all weights in the model quantized.