Mathematicians issue a major challenge to AIshow us your work

"The effort, called First Proof, is detailed in a preprint that was posted last Thursday. These are brand-new problems that cannot be found in any LLM's [large language model's] training data, says Andrew Sutherland, a mathematician at the Massachusetts Institute of Technology, who was not involved with the new exam. This seems like a much better experiment than any I have seen to date, he adds, referring to the difficulty in testing how well AIs can do math."

"Because mathematical proofs follow a checkable sequence of logical steps, their conclusion is true or false beyond any subjective measure. And that may offer a better way to compare LLMs' prowess than evaluating how convincing their poetry is. Start-ups dedicated to AI for mathematics have recently recruited a number of high-profile mathematicians. These efforts have had some early successes: In 2025 an advanced version of Google's Gemini Deep Think achieved a gold-level score on the International Mathematical Olympiad, an exam for prodigious high schoolers."

First Proof presents a controlled, week-long exam that challenges AI systems with newly created unsolved pure mathematics problems relevant to current research. The problems are not present in any large language model training data, creating a rigorous test of genuine mathematical capability. Mathematical proofs provide fully checkable logical steps, producing objectively true or false conclusions and serving as a strong benchmark for AI performance. Recent AI milestones include a Gemini model earning a gold-level International Mathematical Olympiad score, AIs solving multiple Erdős problems, and a start-up solving several research-level questions, though prior tests lacked full experimental control.

#ai-for-mathematics #automated-theorem-proving #benchmarking #llm-evaluation

Read at www.scientificamerican.com

Unable to calculate read time

Collection

[

...

]

Mathematicians issue a major challenge to AIshow us your workMathematicians issue a major challenge to AIshow us your work Briefly

Mathematicians issue a major challenge to AIshow us your work
Mathematicians issue a major challenge to AIshow us your work
Briefly