Diké
Collaborative project


Research project about identifying and mitigating bias, fairness and ethic issues when compressing NLP models. Funded by ANR (AAPG 2021, 2022–2025)

Diké  
Collaborative project

Rationale

Natural Language Processing (NLP) is a sub-domain of Artificial Intelligence (AI) that aims at the automation of processing any written text. Not only does it support a wide range of text analysis applications (classification, sentiment analysis, grammar checker, spam detection, etc.) but also generation tasks such as machine translation, text summarization, conversational agents (chatbots), question answering, etc.

Deep learning is the cornerstone to create modern NLP systems. The “transformer" [1] allows parallel training on GPUs – nowadays the de facto standard to train large models – and this was first achieved with the BERT [2] language model. Unlike the previous mainstream “train once, apply once" approach, BERT is generic enough to allow a “train only once, apply many times" approaches. The BERT architecture has been refined several times in order to achieve state-of-the-art performance over and over again, each time a follow-up improved architecture was proposed, it was mischievously coined as a character from the Muppet show [3]: ELMO, ROSITA, KERMIT, ERNIE, ALBERT (henceforth referred as “Muppet models"). The beauty of the Muppet models is that they are not reserved to GAFAMs, but one can eas- ily download them and fine-tune them to its own task with as few as 10 lines of open source code.

The only cloud on the horizon is the size of these models. BERT trains 110 million parameters, BERT Large represents 340 million parameters. OpenAI’s GPT-3 model reaches 175 billion parameters. A segment of research therefore focus on compressing Muppet models, since deploying them in production is expensive. Three main compression techniques exist :

  • Weights Quantization, where each parameter of a model is a float (usually encoded in 32-bit precision). This technique tries to compress these floats in a lower number of bits, or even as integers. Ultimately, weights compression can go as far as binarizing the weights [4].
  • Model Pruning a technique that aims at removing weights which are identified as less relevant for a trained model. Since training any large model requires a lot of computation [5], the objective of model pruning is then to retain only a small number of connected neurons, so as to save both space and computational time while preserving a reasonable accuracy for the task at hand [6, 7].
  • Model Distillation consists in training a smaller model, or “student model” that aims to mimic the predictions of a larger pre-trained model, or “teacher model” to ease training [8]. The earliest model such as DistilBERT [9] and DistilRoberta are able to reduce their size by 87% and still retain 96% of their accuracy.

In that scientific context, the Diké project builds upon the following key observation. All compression techniques only focus on preserving the model accuracy on a given task. But there is no such thing as a free lunch: if we remove most of these weights (sometimes 99%!) and the accuracy of the model remains the same if not higher, we may be trading for model bias, fairness or ethics without knowing it. We study the hypothesis that what is discarded may be related to harmful side effects. If compression leads to more bias, and less fairness and ethics, this is a major issue as most models are ten lines of code away from any developer [10, 11].

The objectives of the DIKÉ project are to focus on model compression effects in NLP, by creating bilingual (English and French) bias, fairness and ethics datasets (WP1), to devise evaluation metrics and to perform evaluation campaigns on compression techniques (WP2), and to propose new neural architectures for less biased, fairer and more ethical compression techniques (WP3).

[1] Ashish Vaswani et al. “Attention is all you need”. In: Proc. of NIPS 2017. 2017, pp. 5998–6008 [2] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018) [3] Patrick Xia et al. “Which *BERT? A Survey Organizing Contextualized Encoders”. In: Proc. of EMNLP 2020. ACL, Nov. 2020, pp. 7516–7533 [4] Christos Louizos et al. “Bayesian compression for deep learning”. In: Advances in neural information processing systems 2017, pp. 3288–3298 [5] Emma Strubell et al. “Energy and Policy Considerations for Deep Learning in NLP”. In: Proc. of ACL 2019. Florence, Italy: ACL, July 2019, pp. 3645–3650 [6] Jonathan Frankle et al. “The lottery ticket hypothesis: Finding sparse, trainable networks”. In: arXiv:1803.03635 (2018). [7] Christos Louizos et al. Learning Sparse Neural Networks through L0 Regularization. 2018. arXiv: 1712.01312 [stat.ML]. [8] Geoffrey Hinton et al. Distilling the Knowledge in a Neural Network. 2015. arXiv: 1503.02531 [stat.ML]. [9] Victor Sanh et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2020. eprint: arXiv: 1910.01108. [10] Sara Hooker et al. Characterising Bias in Compressed Models. 2020. arXiv: 2010.03058 [cs.LG]. [11] Sara Hooker et al. What Do Compressed Deep Neural Networks Forget? 2019. arXiv: 1911.05248 [cs.LG].