The release of OpenAI’s o3 AI model has sparked scrutiny after independent tests by Epoch AI revealed significant discrepancies in benchmark results. OpenAI initially claimed o3 could solve over 25% of FrontierMath problems, vastly outperforming competitors. However, Epoch found that the actual score was around 10%. While OpenAI’s lower-bound figures matched Epoch’s findings, the differences illustrate potential issues in testing transparency and methodology, suggesting that OpenAI might have used more advanced resources or a different problem subset for their evaluations.
OpenAI's initial claim of over 25% accuracy for the o3 model on FrontierMath has been contested by independent tests showing only approximately 10% accuracy, raising transparency issues.
Epoch AI's testing suggests that the discrepancies may stem from differing testing setups; OpenAI's utilization of more powerful internal configurations, or a different subset of test problems.
Collection
[
|
...
]