OpenAI's o3 AI model scores lower on a benchmark than the company initially implied | TechCrunch
Briefly

The release of OpenAI’s o3 AI model has sparked scrutiny after independent tests by Epoch AI revealed significant discrepancies in benchmark results. OpenAI initially claimed o3 could solve over 25% of FrontierMath problems, vastly outperforming competitors. However, Epoch found that the actual score was around 10%. While OpenAI’s lower-bound figures matched Epoch’s findings, the differences illustrate potential issues in testing transparency and methodology, suggesting that OpenAI might have used more advanced resources or a different problem subset for their evaluations.
OpenAI's initial claim of over 25% accuracy for the o3 model on FrontierMath has been contested by independent tests showing only approximately 10% accuracy, raising transparency issues.
Epoch AI's testing suggests that the discrepancies may stem from differing testing setups; OpenAI's utilization of more powerful internal configurations, or a different subset of test problems.
Read at TechCrunch
[
|
]