Hugging Face Introduces Community Evals for Transparent Model Benchmarking
Briefly

Hugging Face Introduces Community Evals for Transparent Model Benchmarking
"Hugging Face has launched Community Evals, a feature that enables benchmark datasets on the Hub to host their own leaderboards and automatically collect evaluation results from model repositories. The system decentralizes the reporting and tracking of benchmark scores by relying on the Hub's Git-based infrastructure, making submissions transparent, versioned, and reproducible. Under the new system, dataset repositories can register as benchmarks. Once registered, they automatically collect and display evaluation results submitted across the Hub."
"Benchmarks define their evaluation specifications in an eval.yaml file based on the Inspect AI format, which describes the task and evaluation procedure so that results can be reproduced. Initial benchmarks available through this system include MMLU-Pro, GPQA, and HLE, with plans to expand to additional tasks over time. Model repositories can now store evaluation scores in structured YAML files located in a .eval_results/ directory. These results appear on the model card and are automatically linked to corresponding benchmark datasets."
"The system also allows any Hub user to submit evaluation results for a model via pull request. Community-submitted scores are labeled accordingly and can reference external sources such as research papers, model cards, third-party evaluation platforms, or evaluation logs. Because the Hub operates on Git, all changes to evaluation files are versioned, providing a record of when results were added or modified and by whom. Discussions about reported scores can take place directly within pull request threads."
Hugging Face launched Community Evals to let benchmark datasets on the Hub host leaderboards and automatically gather evaluation results from model repositories. Dataset repositories can register as benchmarks and define evaluation procedures in an eval.yaml file following the Inspect AI format to enable reproducibility. Model repositories store scores in structured YAML files under a .eval_results/ directory, which appear on model cards and link to benchmark datasets. Any Hub user can submit results via pull requests, community submissions are labeled, and all changes are versioned through Git to provide provenance and enable discussions.
Read at InfoQ
Unable to calculate read time
[
|
]