OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

"GDPval measures how models tackle 1,320 tasks associated with 44 occupations -- mostly knowledge work jobs -- across the top nine industries that contribute more than 5% to US gross domestic product (GDP). Using data from the May 2024 US Bureau of Labor Statistics (BLS) and the Department of Labor's O*NET database, OpenAI included some expected professions, like software engineers, lawyers, and video editors, as well as some less commonly touched by AI as of now, including detectives, pharmacists, and social workers."

"OpenAI's new evaluation, GDPval, aims to change that by "measuring how AI performs on real-world, economically valuable tasks," the company said in an announcement Thursday. Companies and third-party testers already use industry benchmarks and other evaluations to determine how capable models are at tasks like coding and math. However, these can lean more academic than would be realistic once models are deployed; GDPval aims to narrow that gap between theory and practice."

AI tool proliferation has produced inconsistent productivity gains, with many enterprise projects failing and managers receiving unsatisfactory AI-generated work that increases labor. A recent MIT report found a 95% failure rate for enterprise AI projects. Evidence suggests AI benefits strong development teams while hindering weaker ones. OpenAI launched GDPval to measure AI performance on economically valuable, real-world tasks. GDPval evaluates 1,320 tasks across 44 occupations covering nine industries that each contribute over 5% to US GDP. The evaluation uses May 2024 BLS data and O*NET, and tasks were created by professionals averaging 14 years’ experience to reflect real work products such as a legal brief, an engineering blueprint, or a customer support conversation.

#openai #gdpval #ai-evaluation #workplace-productivity

Read at ZDNET

Unable to calculate read time

Collection

[

...

]

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprisingOpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising Briefly

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising
OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising
Briefly