Research project about identifying and mitigating bias, fairness and ethic issues when compressing NLP models. Funded by ANR (AAPG 2021, 2022–2025)
Natural Language Processing (NLP) is a sub-domain of Artificial Intelligence (AI) that aims at the automation of processing any written text. Not only does it support a wide range of text analysis applications (classification, sentiment analysis, grammar checker, spam detection, etc.) but also generation tasks such as machine translation, text summarization, conversational agents (chatbots), question answering, etc.
Deep learning is the cornerstone to create modern NLP systems. The “transformer" [1] allows parallel training on GPUs – nowadays the de facto standard to train large models – and this was first achieved with the BERT [2] language model. Unlike the previous mainstream “train once, apply once" approach, BERT is generic enough to allow a “train only once, apply many times" approaches. The BERT architecture has been refined several times in order to achieve state-of-the-art performance over and over again, each time a follow-up improved architecture was proposed, it was mischievously coined as a character from the Muppet show [3]: ELMO, ROSITA, KERMIT, ERNIE, ALBERT (henceforth referred as “Muppet models"). The beauty of the Muppet models is that they are not reserved to GAFAMs, but one can eas- ily download them and fine-tune them to its own task with as few as 10 lines of open source code.
The only cloud on the horizon is the size of these models. BERT trains 110 million parameters, BERT Large represents 340 million parameters. OpenAI’s GPT-3 model reaches 175 billion parameters. A segment of research therefore focus on compressing Muppet models, since deploying them in production is expensive. Three main compression techniques exist :
In that scientific context, the Diké project builds upon the following key observation. All compression techniques only focus on preserving the model accuracy on a given task. But there is no such thing as a free lunch: if we remove most of these weights (sometimes 99%!) and the accuracy of the model remains the same if not higher, we may be trading for model bias, fairness or ethics without knowing it. We study the hypothesis that what is discarded may be related to harmful side effects. If compression leads to more bias, and less fairness and ethics, this is a major issue as most models are ten lines of code away from any developer [10, 11].
The objectives of the DIKÉ project are to focus on model compression effects in NLP, by creating bilingual (English and French) bias, fairness and ethics datasets (WP1), to devise evaluation metrics and to perform evaluation campaigns on compression techniques (WP2), and to propose new neural architectures for less biased, fairer and more ethical compression techniques (WP3).
[1] Ashish Vaswani et al. “Attention is all you need”. In: Proc. of NIPS 2017. 2017, pp. 5998–6008 [2] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018) [3] Patrick Xia et al. “Which *BERT? A Survey Organizing Contextualized Encoders”. In: Proc. of EMNLP 2020. ACL, Nov. 2020, pp. 7516–7533 [4] Christos Louizos et al. “Bayesian compression for deep learning”. In: Advances in neural information processing systems 2017, pp. 3288–3298 [5] Emma Strubell et al. “Energy and Policy Considerations for Deep Learning in NLP”. In: Proc. of ACL 2019. Florence, Italy: ACL, July 2019, pp. 3645–3650 [6] Jonathan Frankle et al. “The lottery ticket hypothesis: Finding sparse, trainable networks”. In: arXiv:1803.03635 (2018). [7] Christos Louizos et al. Learning Sparse Neural Networks through L0 Regularization. 2018. arXiv: 1712.01312 [stat.ML]. [8] Geoffrey Hinton et al. Distilling the Knowledge in a Neural Network. 2015. arXiv: 1503.02531 [stat.ML]. [9] Victor Sanh et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2020. eprint: arXiv: 1910.01108. [10] Sara Hooker et al. Characterising Bias in Compressed Models. 2020. arXiv: 2010.03058 [cs.LG]. [11] Sara Hooker et al. What Do Compressed Deep Neural Networks Forget? 2019. arXiv: 1911.05248 [cs.LG].