
"Scientists affiliated with Omni Calculator, a Poland-based maker of online calculators, and with universities in France, Germany, and Poland, devised a math benchmark called ORCA (Omni Research on Calculation in AI), which poses a series of math-oriented natural language questions in a wide variety of technical and scientific fields. Then they put five leading LLMs to the test. ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 all scored a failing grade of 63 percent or less."
"There are various other benchmarks used to assess the math capabilities of AI models, such as GSM8K and MATH-500. If you were to judge by AI models' scores on many of these tests, you might assume machine learning has learned nearly everything, with some models scoring 0.95 or above. But benchmarks, as we've noted, are often designed without much scientific rigor."
"The researchers behind the ORCA (Omni Research on Calculation in AI) Benchmark - Claudia Herambourg, Dawid Siuda, Julia Kopczyńska, Joao R. L. Santos, Wojciech Sas, and Joanna Śmietańska-Nowak - argue that while models like OpenAI's GPT-4 have scored well on tests like GSM8K and MATH, prior research shows LLMs still make errors of logic and arithmetic. According to Oxford University's Our World in Data site, which measures AI models' performance relative to a human baseline score of 0, math reasoning for AI models scores -7.44 (based on April 2024 data)."
Large language models struggle with reliable calculation and formal math reasoning. Omni Calculator and European university teams created ORCA, a benchmark of math-oriented natural-language problems across technical and scientific domains. Five leading LLMs — ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 — scored 63 percent or less. Common benchmarks such as GSM8K and MATH-500 can produce inflated performance metrics because many benchmark items have leaked into training data. Prior research and an Our World in Data metric show persistent errors in logic and arithmetic, with an April 2024 math-reasoning score of -7.44 relative to humans.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]