#ai-benchmarks tag

1 month ago

A top AI researcher explains the limitations of current models

Francois Chollet's ARC-AGI-3 benchmark reveals AI's limitations in navigating novel situations compared to human intelligence.

#gemini-31-pro

Artificial intelligence

Google's Gemini 3.1 Pro is here, and it just doubled its reasoning score

fromArs Technica

Artificial intelligence

Google announces Gemini 3.1 Pro, says it's better at complex problem-solving

Artificial intelligence

Google's Gemini 3.1 Pro is here, and it just doubled its reasoning score

fromArs Technica

Artificial intelligence

Google announces Gemini 3.1 Pro, says it's better at complex problem-solving

more#gemini-31-pro

fromTechCrunch

3 months ago

Maybe AI agents can be lawyers after all | TechCrunch

Anthropic's Opus 4.6 markedly improved AI agent performance on professional tasks, reaching roughly 30% one-shot and about 45% with retries, signaling rapid progress but not immediate replacement.

fromTechCrunch

3 months ago

Are AI agents ready for the workplace? A new benchmark raises doubts. | TechCrunch

AI models currently fail to reliably perform complex multi-domain white-collar tasks, answering correctly less than 25% of professional queries.

fromEngadget

Google's Gemini 3 Flash model outperforms GPT-5.2 in some benchmarks

Gemini 3 Flash delivers near‑flagship "Extra High" reasoning performance comparable to GPT‑5.2 while being more efficient and cost-effective, and is rolling out across Google services.

fromNieman Lab

A tech company will claim to have achieved AGI. The news media won't be ready.

AI benchmark claims do not prove AGI; media should scrutinize benchmarks, avoid anthropomorphism, and challenge marketing language.

#gpt-52

Artificial intelligence

OpenAI launches GPT-5.2 as it battles Google's Gemini 3 for AI model supremacy

Artificial intelligence

OpenAI just released its new GPT-5.2 model. Here's what you need to know

fromEngadget

Artificial intelligence

OpenAI releases GPT-5.2 to take on Google and Anthropic

Artificial intelligence

OpenAI launches GPT-5.2 as it battles Google's Gemini 3 for AI model supremacy

Artificial intelligence

OpenAI just released its new GPT-5.2 model. Here's what you need to know

fromEngadget

Artificial intelligence

OpenAI releases GPT-5.2 to take on Google and Anthropic

more#gpt-52

fromBusiness Insider

Surge AI CEO says he worries that companies are optimizing for 'AI slop' instead of curing cancer

AI development prioritizes flashy, dopamine-driving responses and leaderboard optics over solving substantive problems or improving truthfulness and economic usefulness.

#gemini-3

fromTechzine Global

Artificial intelligence

Deep Think mode takes Gemini 3 to a higher level of performance

Artificial intelligence

Want to ditch ChatGPT? Gemini 3 shows early signs of winning the AI race

Gadgets

Google's Gemini 3 is finally here and it's smarter, faster, and free to access

fromTechzine Global

Artificial intelligence

Deep Think mode takes Gemini 3 to a higher level of performance

Artificial intelligence

Want to ditch ChatGPT? Gemini 3 shows early signs of winning the AI race

Gadgets

Google's Gemini 3 is finally here and it's smarter, faster, and free to access

more#gemini-3

fromNature

DeepSeek's self-correcting AI model aces tough maths proofs

DeepSeekMath-V2 scored 118/120 on the 2024 Putnam, surpassing top humans and using self-verifiable reasoning to detect and correct its own errors.

#model-evaluation

fromThe Verge

Artificial intelligence

Amazon's bet that AI benchmarks don't matter

9 months ago

Artificial intelligence

Why benchmarks are key to AI progress

fromMedium

1 year ago

Artificial intelligence

Beyond Benchmarks: Really Evaluating AI

fromThe Verge

Artificial intelligence

Amazon's bet that AI benchmarks don't matter

9 months ago

Artificial intelligence

Why benchmarks are key to AI progress

fromMedium

1 year ago

Artificial intelligence

Beyond Benchmarks: Really Evaluating AI

more#model-evaluation

fromThe Verge

'Holy shit': Gemini 3 is winning the AI race - for now

Google's Gemini 3 immediately topped benchmarks and leaderboards, integrated into Google Search on day one, and attracted over one million users within 24 hours.

I loved Google's new Gemini AI-except when it gaslit me

Gemini 3 Pro is a highly capable LLM that outperforms competitors on benchmarks and underpins many Google products and developer services.

fromwww.cbc.ca

China might be winning the AI race. Does it matter? | CBC Accessibility

Moonshot AI's Kimi K2 Thinking narrows China's AI performance gap with the U.S., scoring near ChatGPT on advanced reasoning benchmarks and outranking several rivals.

fromwww.theguardian.com

6 months ago

Experts find flaws in hundreds of tests that check AI safety and effectiveness

Hundreds of AI benchmarks contain flaws that undermine validity of model safety and capability claims, making many evaluation scores misleading or irrelevant.

fromTechzine Global

6 months ago

JetBrains launches AI benchmark platform DPAI Arena

DPAI Arena provides an open, community-driven benchmark for objectively measuring AI coding agents across multiple languages, workflows, and reproducible evaluation pipelines.

fromFortune

7 months ago

AI models are getting very good at professional tasks, new OpenAI research shows | Fortune

Google CEO Sundar Pichai was right when he said that while AI companies aspire to create AGI (artificial general intelligence), what we have right now is more like AJI-artificial jagged intelligence. What Pichai meant by this is that today's AI is brilliant at some things, including some tasks that even human experts find difficult, while also performing poorly at some tasks that a human would find relatively easy.

Artificial intelligence