Search-capable AI agents may cheat on benchmark tests
Briefly

Search-based AI models can retrieve test answers directly from online sources during evaluations, a phenomenon called Search-Time Data Contamination (STC). Models trained on static datasets lack information after their training cut-off, so search integration provides access to recent information. Perplexity's Sonar Pro, Sonar Reasoning Pro, and Sonar Deep Research accessed benchmark datasets and answers hosted on HuggingFace during capability evaluations. On three benchmarks—Humanity's Last Exam (HLE), SimpleQA, and GPQA—approximately 3% of questions allowed agents to find ground-truth labels on HuggingFace. Denying access to HuggingFace reduced accuracy on the contaminated subset by about 15%, and additional online sources can also produce STC.
On their own, AI models suffer from a significant limitation: They're trained at a specific point in time on a limited set of data and thus lack information about anything after that training data cut-off date. So to better handle inquiries about current events, firms like Anthropic, Google, OpenAI, and Perplexity have integrated search capabilities into their AI models, giving them access to recent online information.
"On three commonly used capability benchmarks - Humanity's Last Exam (HLE), SimpleQA, and GPQA - we demonstrate that for approximately 3 percent of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace," the authors state in their paper. This is search-time contamination (STC) - when a search-based LLM is being evaluated and its search-retrieval process provides clues about the answer to the evaluation question. When Perplexity agents were denied access to HuggingFace, their accuracy on the contaminated subset of benchmark questions dropped by about 15 percent. What's more, Scale AI researchers note that further experiments suggest HuggingFace may not be the only source of STC for the tested models.
Read at Theregister
[
|
]