Kimi's K2 Opensource Language Model Supports Dynamic Resource Availability and New Optimizer
Briefly

Kimi's K2 Opensource Language Model Supports Dynamic Resource Availability and New Optimizer
"The release introduces MuonClip, a new optimizer that builds on the Muon optimizer by adding a QK-clip technique designed to address training instability, which the team reports resulted in "zero loss spike" during pre-training. The model comes in two variants: a base version and K2 Thinking, with the latter claiming state-of-the-art results on benchmarks testing reasoning, coding, and agent capabilities, including 44.9% on Humanity's Last Exam (HLE) with tools, 60.2% on BrowseComp,"
"The team validated MuonClip through a series of scaling experiments. They first trained a mid-scale model with 9 billion activated parameters and 53 billion total parameters using the standard Muon optimizer. The researchers then tested whether QK-Clip affects model performance, finding that MuonClip maintains the optimization characteristics of Muon without negatively impacting the loss trajectory. For the full-scale Kimi K2 model, the team applied MuonClip with a tau value of 100 (τ = 100) and tracked maximum attention logits throughout training."
K2 is a Mixture-of-Experts large language model with 32 billion activated parameters and 1.04 trillion total parameters trained on 15.5 trillion tokens. MuonClip extends the Muon optimizer by adding a QK-clip mechanism to address training instability and reportedly produced zero loss spike during pre-training. K2 is available in a base variant and K2 Thinking, with K2 Thinking reporting strong benchmark results in reasoning, coding, and agent tasks. Scaling experiments show MuonClip preserves Muon optimization behavior and reduces maximum attention logits without manual intervention. Training used an NVIDIA H800 GPU cluster with NVLink, NVSwitch, and 8×400 Gbps RoCE interconnects.
Read at InfoQ
Unable to calculate read time
[
|
]