Quantization of ASR models

Motivation for launching the project by the customer:
“smart” devices are becoming increasingly popular: watches, speakers, cameras, refrigerators. As a rule, the resources on such devices are limited, which leads to the problem of adapting neural network models to specific machines, when it is necessary to simplify or modify the network architecture in order to reduce the size of the model and speed up its operation at the application stage. However, restrictions on the neural network may be more significant. Thus, the computer on which the model is supposed to run can only support calculations in low-bit precision. In this case, quantization of the model is necessary. Currently, on the one hand, quantization is an actively developing area. On the other hand, few works are devoted to the quantization of neural network models based on transformers. The project requires to propose an honest method for low-bit SOTA quantization of the ASR transformer architecture without significant loss of quality of the final model on a publicly available dataset for training and validating LibriSpeech ASR models.

Description of the initial situation:

There are architectures of ASR transformer models and their implementations;
There are methods for quantizing neural network models (not necessarily transformers);
Quantization of Transformers is a relatively under-researched area;
There are frequent cases of publication of articles where dishonest quantization is used. Monitoring the fairness of quantization is a separate issue that requires attention;
The use of quantization “head-on” greatly degrades the quality of the neural network;
Some modules in the model are more sensitive to quantization than others (for example, Embedding layer, SoftMax).

Project goals:

select a strategy for quantizing the ASR model based on transformers, if necessary, also making changes to the network architecture so that the quality of the quantized model (WER metric) is not much worse than the quality of the model with full accuracy (a quality deterioration of several percent is allowed).

MIL Team solution:
take the available SOTA implementation of the transformer-based ASR model architecture and build quantization into it. For successful and fair quantization, it is necessary, firstly, to implement quantized versions for all modules used within the Torch model. Secondly, we need a convenient tool for replacing original Torch modules with quantized ones. Thirdly, it is necessary to conduct a series of experiments to select the best quantization strategy.

To build the model we used:
ASR transformer architecture implemented in the open Fairseq repository, presented in the article Transformers with convolutional context for ASR.
LibriSpeech dataset for training a speech recognition model, consisting of pairs of audio and text files with English speech.

Simulation results: under NDA.
Customer: under NDA
Technology stack: Python (PyTorch, torcaudio, Fairseq, SentencePiece, DeepLabV3, SRCNN)