Our paper "When Quantization Affects Confidence of Large Language Models?" is accepted to NAACL 2024
TL;DR
We investigate the impact of LLM compression on three aspects within QA tasks: (i) model confidences, (ii) calibration error, and (iii) predictive entropy.
Link for the paper : https://arxiv.org/abs/2405.00632.
LLMs are widely used in a variety of natural language generation applications and have been proven to achieve high performance in zero and few-shot prompting, providing results on par with fine-tuned baselines, especially in commonsense reasoning tasks. Kaplan et.al (2023) show that emerging abilities come with the scale increase, which makes well-performing larger models less accessible and limits their practical usability. A range of efficient compression and acceleration methods, including quantization, have been developed that help to alleviate high latency and extensive storage demands. Despite its efficacy as a compression technique, recent works show that quantization may degrade the initial performance and amplify the sensitivity of an LLM to certain linguistic phenomena and stereotypes. However, less attention has been paid to explaining the compression loss, particularly its variance across different texts.
Contributions
In this paper, we extend the existing research on the compression loss estimation; in particular, we measure the impact of quantization on the confidence of LLMs that can be initially overconfident in both right and wrong predictions.
Our contributions are the following ones: (1) we investigate how quantization with GPTQ influences the calibration and confidence of LLMs, (2) we assess the confidence alignment between compressed and full LLMs at scale, (3) we explain the quantization loss from the initial confidence perspective.
Our null hypothesis is that the compressed vs. full predictive probability distributions are indistinguishable since prior work discussed a negligible accuracy drop in performance after quantization. We analyze the relationship between models by comparing calibration scores—indicating a model’s ability to accurately reflect true probabilities—before and after quantization.
Findings
We demonstrate that quantization leads to an increase in calibration error and statistically significant changes in confidence levels for correct predictions.
Through a detailed examination of confidence shifts, we identify instances of confidence change occurring in data where models lack confidence before quantization.
Overall, our findings provide insights into quantization loss and suggest a potential direction for future work, emphasizing the need to focus on calibrating LLMs, specifically on uncertain examples. For example, future work may focus on integrating the models' calibration, such as temperature scaling, into the quantization pipeline. Also, we have demonstrated that different model families, including LLaMA, Mistral, BLOOM, and OPT, exhibit varying degrees of susceptibility to quantization, as measured by changes in confidence levels. This suggests another direction for future research – benchmarking LLMs based on their response to quantization-induced confidence shifts.