Abstract
Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation, rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency, (2) Single-Scale RMSNorm, preventing channel-wise amplification, and (3) a learnable embedding projection, redistributing activation magnitudes. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (versus 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
| Original language | English |
|---|---|
| Title of host publication | Long Papers |
| Editors | Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 12582-12600 |
| Number of pages | 19 |
| ISBN (Electronic) | 9798891762510 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Vienna, Austria Duration: 2025 Jul 27 → 2025 Aug 1 |
Publication series
| Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
|---|---|
| Volume | 1 |
| ISSN (Print) | 0736-587X |
Conference
| Conference | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 |
|---|---|
| Country/Territory | Austria |
| City | Vienna |
| Period | 25/7/27 → 25/8/1 |
Bibliographical note
Publisher Copyright:© 2025 Association for Computational Linguistics.
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language
- Computer Science Applications
Fingerprint
Dive into the research topics of 'Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS