#ai-evaluation tag

OpenAI Releases List of Work Tasks It Says ChatGPT Can Already Replace

ChatGPT maker OpenAI has released a new evaluation, dubbed GDPval, to measure how well its AIs perform on "economically valuable, real-world tasks across 44 occupations." "People often speculate about AI's broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing," the company wrote in an accompanying blog post. "Evaluations like GDPval help ground conversations about future AI improvements in evidence rather than guesswork, and can help us track model improvement over time," OpenAI added.

Artificial intelligence

fromInfoQ

2 weeks ago

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

Google Stax provides an objective, data-driven, repeatable framework for AI model evaluation with customizable datasets, default and custom evaluators, and LLM-based judges.

Artificial intelligence

fromZDNET

3 weeks ago

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

OpenAI's GDPval evaluates AI performance on 1,320 real-world tasks across 44 occupations to measure economic impact and narrow theory-practice gaps.

#generative-ai

fromFortune

3 weeks ago

Venture

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

fromNew Scientist

4 weeks ago

Artificial intelligence

Around one-third of AI search tool answers make unsupported claims

fromFortune

3 weeks ago

Venture

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

fromNew Scientist

4 weeks ago

Artificial intelligence

Around one-third of AI search tool answers make unsupported claims

more#generative-ai

Artificial intelligence

fromBusiness Insider

3 weeks ago

Read the deck an ex-Waymo engineer used to raise $3.75 million from Sheryl Sandberg and Kindred Ventures

Scorecard raised $3.75M to build an AI evaluation platform that tests AI agents for performance, safety, and faster deployment for startups and enterprises.

Artificial intelligence

fromFast Company

1 month ago

The gap between AI hype and newsroom reality

AI tools often underperform as informed researchers and summarizers for newsroom-specific tasks like local government meeting transcripts.

Artificial intelligence

fromodsc.medium.com

2 months ago

The ODSC AI West 2025 Preliminary Schedule, Mastering AI Evaluation, Building Real World Agentic Applications, and GPT-5 News.

ODSC AI West 2025 offers training on MLOps, RAG, and AI Agents, featuring over 250 industry experts to enhance practical skills.

Artificial intelligence

fromAxios

2 months ago

OpenAI's GPT-5 targets coders

GPT-5 combines a large language model with reasoning capabilities, improving task efficiency and reducing hallucinations, with notable applications in health.

Artificial intelligence

fromMedium

3 months ago

Two Indispensable Tools for Measuring the Quality of AI Systems

Evaluating free text outputs of LLMs requires significant human judgment and high-quality ground truth data.

Artificial intelligence

fromHackernoon

1 year ago

AI Still Can't Explain a Joke-or a Metaphor-Like a Human Can | HackerNoon

Human evaluation is crucial for assessing AI's understanding of multimodal figurative language.

fromBusiness Insider

4 months ago

Anthropic's Claude plays 'for peace over victory' in a game of Diplomacy against other AI

Diplomacy is a strategic board game set on a map of Europe in 1901 - a time when tensions between the continent's most powerful countries were simmering in the lead-up to World War I.

Artificial intelligence

fromwww.mercurynews.com

4 months ago

Chatbot Arena group goes from academic project at UC Berkeley to $600 million startup

LMArena has raised $100 million in seed funding, demonstrating significant investor confidence in AI evaluation tools.

Artificial intelligence

fromMedium

5 months ago

Evaluation Mindset: Taming the Gen AI Dragon

Evaluation in AI is a mindset, not a resource issue; it requires ongoing inquiry and critical thinking for successful application deployment.

Artificial intelligence

fromMedium

5 months ago

Beyond Benchmarks: Really Evaluating AI

Benchmarks help standardize test sets for AI models, ensuring fair evaluation of performance.

Artificial intelligence

fromHackernoon

4 months ago

Chameleon AI Shows Competitive Edge Over LLaMa-2 and Other Models | HackerNoon

Chameleon exhibits competitive performance against leading text-only language models, excelling particularly in commonsense reasoning.

The evaluations indicate that Chameleon is capable of outperforming larger models like Llama-2 in specific benchmarks.

fromMedium

5 months ago

The problems with running human evals

Result ambiguity can come in different forms. The lack of agreement among raters is the most common one, known as Inter Rater Reliability (IRR).

Artificial intelligence

fromHackernoon

6 months ago

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data: Single-Subject Evaluations | HackerNoon

Evaluation metrics improve significantly with increased fine-tuning data.

Artificial intelligence

fromInfoWorld

6 months ago

Vector Institute aims to clear up confusion about AI model performance

DeepSeek and OpenAI's o1 models excel in performance, yet AI models still face significant challenges across various tasks.

#ai-evaluation#ai-evaluation

OpenAI Releases List of Work Tasks It Says ChatGPT Can Already Replace

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

Around one-third of AI search tool answers make unsupported claims

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

Around one-third of AI search tool answers make unsupported claims

Read the deck an ex-Waymo engineer used to raise $3.75 million from Sheryl Sandberg and Kindred Ventures

The gap between AI hype and newsroom reality

The ODSC AI West 2025 Preliminary Schedule, Mastering AI Evaluation, Building Real World Agentic Applications, and GPT-5 News.

OpenAI's GPT-5 targets coders

Two Indispensable Tools for Measuring the Quality of AI Systems

AI Still Can't Explain a Joke-or a Metaphor-Like a Human Can | HackerNoon

Anthropic's Claude plays 'for peace over victory' in a game of Diplomacy against other AI

Chatbot Arena group goes from academic project at UC Berkeley to $600 million startup

Evaluation Mindset: Taming the Gen AI Dragon

Beyond Benchmarks: Really Evaluating AI

Chameleon AI Shows Competitive Edge Over LLaMa-2 and Other Models | HackerNoon

The problems with running human evals

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data: Single-Subject Evaluations | HackerNoon

Vector Institute aims to clear up confusion about AI model performance

#ai-evaluation
#ai-evaluation