A three-person OpenAI team trained a general-purpose reasoning model using International Mathematical Olympiad problems to develop sustained, hours-long autonomous reasoning and proof writing. The model focused on evaluating ambiguity and nuance, skills positioned as necessary for future real-world tasks and for progress toward artificial general intelligence. The IMO presents six rigorous problems requiring pages-long written proofs that span multiple mathematical fields, making it a stringent test of creativity and stepwise logic. The experimental system operated under the same time and resource constraints as student contestants and aimed to break problems into steps, check work, and adapt strategies.
A few months before the 2025 International Mathematical Olympiad (IMO) in July, a three-person team at OpenAI made a long bet that they could use the competition's brutally tough problems to train an artificial intelligence model to think on its own for hours so that it was capable of writing math proofs. Their goal wasn't simply to create an AI that could do complex math but one that could evaluate ambiguity and nuanceskills AIs will need if they are to someday take on many challenging real-world tasks.
The IMO, held this year on Australia's Sunshine Coast, is the world's premier math competition for high schoolers, bringing together top contenders from more than 100 countries. All are given the same six problemsthree per day, each worth seven pointsto solve over two days. But these problems are nothing like what you probably remember from high school. Rather than a brief numeric answer, each demands sustained reasoning and creativity in the form of a pages-long written proof.
The OpenAI team of researchers and engineersAlex Wei, Sheryl Hsu and Noam Brownused a general-purpose reasoning model: an AI designed to think through challenging problems by breaking them into steps, checking its own work and adapting its approach as it goes. Though AI systems couldn't officially compete as participants, the notoriously tough test served as a demonstration of what they can do, and the AIs tackled this year's questions in the same test format and with the same constraints as human participants.
Collection
[
|
...
]