No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B recently became the first to break past an average score of 80.
Saturation occurs when models outgrow benchmark tests, akin to moving from middle school to high school, or due to overfitting when models memorize answers. We need new benchmarks to fairly assess model capabilities.
[
add
]
[
|
|
...
]