Hugging Face Introduces mmBERT, a Multilingual Encoder for 1,800+ Languages
Briefly

Hugging Face Introduces mmBERT, a Multilingual Encoder for 1,800+ Languages
"Hugging Face has released mmBERT, a new multilingual encoder trained on more than 3 trillion tokens across 1,833 languages. The model builds on the ModernBERT architecture and is the first to significantly improve upon XLM-R, a long-time baseline for multilingual understanding tasks. mmBERT uses a progressive training schedule instead of training on all languages at once. It starts with 60 high-resource languages, expands to 110, and finally includes all 1,833 languages."
"The model reduces its masking ratio from 30% to 5% and adjusts the sampling distribution to represent smaller languages better. This "progressive language addition" approach proved critical for coverage without overfitting. For example, Faroese and Tigrinya - introduced only in the final 100B-token decay phase - still showed substantial performance gains thanks to this strategy. Community members were curious about this balancing act."
"In response, Tom Aarsen, Hugging Face engineer and maintainer of Sentence Transformers, explained: This was checked by evaluating on some of the low-resource languages that are only introduced in the final 100B tokens, such as Tigrinya and Faroese. They observed substantial improvements when these languages were included in the last phase. mmBERT builds on the ModernBERT architecture, inheriting its fast, memory-efficient backbone with Flash Attention 2 and unpadded sequence processing, allowing for 8,192-token contexts."
mmBERT is a multilingual encoder trained on more than 3 trillion tokens spanning 1,833 languages. The model leverages the ModernBERT architecture with Flash Attention 2 and unpadded sequence processing to enable efficient 8,192-token contexts. Training followed a progressive language addition schedule: beginning with 60 high-resource languages, expanding to 110, and finally introducing all 1,833 languages. The training reduced the masking ratio from 30% to 5% and adjusted sampling to better represent smaller languages. The progressive schedule prevented overfitting while improving coverage; low-resource languages like Faroese and Tigrinya showed substantial gains when added in the final 100B-token decay phase. The base model has 110M non-embedding parameters and a 140M-parameter variant is available for lighter workloads.
Read at InfoQ
Unable to calculate read time
[
|
]