#ai-evaluation
#ai-evaluation

[ follow ]

Panels of peers are needed to gauge AI's trustworthiness - experts are not enough

Expert-only evaluation methods like the 'Sunstein test' risk concentrating AI trustworthiness judgments among elites, thereby reinforcing existing power structures shaping AI objectives.

Startup companies

fromFortune

1 week ago

Meet the world's youngest self-made billionaire, who skipped finals to make an empire out of teaching AI 'what only humans know' | Fortune

A startup matched overseas engineers with companies, expanded into human-in-the-loop AI evaluation services, and scaled rapidly to a $10 billion valuation.

fromTechCrunch

2 weeks ago

Laude Institute announces first batch of 'Slingshots' AI grants | TechCrunch

On Thursday, the Laude Institute announced its first batch of Slingshots grants, aimed at "advancing the science and practice of artificial intelligence." Designed as an accelerator for researchers, the Slingshots program is meant to provide resources that would be unavailable in most academic settings, whether it's funding, compute power, or product and engineering support. In exchange, the recipients pledge to produce some final work product, whether it's a startup, an open-source codebase, or another type of artifact.

Artificial intelligence

fromInfoWorld

2 weeks ago

Databricks adds customizable evaluation tools to boost AI agent accuracy

Agent Bricks' Agent-as-a-Judge, Tunable Judges, and Judge Builder enable enterprises to customize evaluations and align agent behavior with business-specific standards.

fromNature

4 weeks ago

We need a new Turing test to assess AI's real-world knowledge

Some lawyers have learnt that the hard way, and have been fined for filing AI-generated court briefs that misrepresented principles of law and cited non-existent cases. The same is true in other fields. For example, AI models can pass the gold-standard test in finance - the Chartered Financial Analyst exam - yet score poorly on simple tasks required of entry-level financial analysts (see go.nature.com/42tbrgb).

Artificial intelligence

fromGeeky Gadgets

4 weeks ago

How to Fix Your AI Prompt Writing : 6 Principles That Actually Work

What if the secret to unlocking AI's full potential wasn't in the technology itself but in how we use it? After spending over 200 hours teaching AI to write, Nate B Jones discovered that the biggest mistakes aren't about algorithms or software limitations, they're about human misunderstanding. Too often, we assume AI can read between the lines of vague instructions or magically produce brilliance without guidance. The result? Generic, uninspired content that misses the mark.

Artificial intelligence

fromNature

1 month ago

AI language models killed the Turing test: do we even need a replacement?

Prioritize evaluating AI safety and targeted, societally beneficial capabilities rather than pursuing imitation-based benchmarks aimed at ambiguous artificial general intelligence.

fromFuturism

1 month ago

OpenAI Releases List of Work Tasks It Says ChatGPT Can Already Replace

ChatGPT maker OpenAI has released a new evaluation, dubbed GDPval, to measure how well its AIs perform on "economically valuable, real-world tasks across 44 occupations." "People often speculate about AI's broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing," the company wrote in an accompanying blog post. "Evaluations like GDPval help ground conversations about future AI improvements in evidence rather than guesswork, and can help us track model improvement over time," OpenAI added.

Artificial intelligence

fromInfoQ

1 month ago

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

Google Stax provides an objective, data-driven, repeatable framework for AI model evaluation with customizable datasets, default and custom evaluators, and LLM-based judges.

Artificial intelligence

fromZDNET

2 months ago

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

OpenAI's GDPval evaluates AI performance on 1,320 real-world tasks across 44 occupations to measure economic impact and narrow theory-practice gaps.

#generative-ai

fromFortune

2 months ago

Venture

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

fromNew Scientist

2 months ago

Artificial intelligence

Around one-third of AI search tool answers make unsupported claims

fromFortune

2 months ago

Venture

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

fromNew Scientist

2 months ago

Artificial intelligence

Around one-third of AI search tool answers make unsupported claims

more#generative-ai

Artificial intelligence

fromBusiness Insider

2 months ago

Read the deck an ex-Waymo engineer used to raise $3.75 million from Sheryl Sandberg and Kindred Ventures

Scorecard raised $3.75M to build an AI evaluation platform that tests AI agents for performance, safety, and faster deployment for startups and enterprises.

Artificial intelligence

fromFast Company

3 months ago

The gap between AI hype and newsroom reality

AI tools often underperform as informed researchers and summarizers for newsroom-specific tasks like local government meeting transcripts.

Artificial intelligence

fromodsc.medium.com

3 months ago

The ODSC AI West 2025 Preliminary Schedule, Mastering AI Evaluation, Building Real World Agentic Applications, and GPT-5 News.

ODSC AI West 2025 offers training on MLOps, RAG, and AI Agents, featuring over 250 industry experts to enhance practical skills.

Artificial intelligence

fromAxios

3 months ago

OpenAI's GPT-5 targets coders

GPT-5 combines a large language model with reasoning capabilities, improving task efficiency and reducing hallucinations, with notable applications in health.

Artificial intelligence

fromMedium

5 months ago

Two Indispensable Tools for Measuring the Quality of AI Systems

Evaluating free text outputs of LLMs requires significant human judgment and high-quality ground truth data.

Artificial intelligence

fromHackernoon

1 year ago

AI Still Can't Explain a Joke-or a Metaphor-Like a Human Can | HackerNoon

Human evaluation is crucial for assessing AI's understanding of multimodal figurative language.

Artificial intelligence

fromBusiness Insider

5 months ago

Anthropic's Claude plays 'for peace over victory' in a game of Diplomacy against other AI

Using games like Diplomacy can effectively evaluate and compare the capabilities of large language models (LLMs).

Artificial intelligence

fromwww.mercurynews.com

6 months ago

Chatbot Arena group goes from academic project at UC Berkeley to $600 million startup

LMArena has raised $100 million in seed funding, demonstrating significant investor confidence in AI evaluation tools.

Artificial intelligence

fromMedium

6 months ago

Evaluation Mindset: Taming the Gen AI Dragon

Evaluation in AI is a mindset, not a resource issue; it requires ongoing inquiry and critical thinking for successful application deployment.

Artificial intelligence

fromMedium

6 months ago

Beyond Benchmarks: Really Evaluating AI

Benchmarks help standardize test sets for AI models, ensuring fair evaluation of performance.

Artificial intelligence

fromHackernoon

6 months ago

Chameleon AI Shows Competitive Edge Over LLaMa-2 and Other Models | HackerNoon

Chameleon exhibits competitive performance against leading text-only language models, excelling particularly in commonsense reasoning.

The evaluations indicate that Chameleon is capable of outperforming larger models like Llama-2 in specific benchmarks.

fromMedium

6 months ago

The problems with running human evals

Result ambiguity can come in different forms. The lack of agreement among raters is the most common one, known as Inter Rater Reliability (IRR).

Artificial intelligence

fromHackernoon

7 months ago

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data: Single-Subject Evaluations | HackerNoon

Evaluation metrics improve significantly with increased fine-tuning data.

Artificial intelligence

fromInfoWorld

7 months ago

Vector Institute aims to clear up confusion about AI model performance

DeepSeek and OpenAI's o1 models excel in performance, yet AI models still face significant challenges across various tasks.

[ Load more ]

#ai-evaluation#ai-evaluation

Panels of peers are needed to gauge AI's trustworthiness - experts are not enough

Meet the world's youngest self-made billionaire, who skipped finals to make an empire out of teaching AI 'what only humans know' | Fortune

Laude Institute announces first batch of 'Slingshots' AI grants | TechCrunch

Databricks adds customizable evaluation tools to boost AI agent accuracy

We need a new Turing test to assess AI's real-world knowledge

How to Fix Your AI Prompt Writing : 6 Principles That Actually Work

AI language models killed the Turing test: do we even need a replacement?

OpenAI Releases List of Work Tasks It Says ChatGPT Can Already Replace

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

Around one-third of AI search tool answers make unsupported claims

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

Around one-third of AI search tool answers make unsupported claims

Read the deck an ex-Waymo engineer used to raise $3.75 million from Sheryl Sandberg and Kindred Ventures

The gap between AI hype and newsroom reality

The ODSC AI West 2025 Preliminary Schedule, Mastering AI Evaluation, Building Real World Agentic Applications, and GPT-5 News.

OpenAI's GPT-5 targets coders

Two Indispensable Tools for Measuring the Quality of AI Systems

AI Still Can't Explain a Joke-or a Metaphor-Like a Human Can | HackerNoon

Anthropic's Claude plays 'for peace over victory' in a game of Diplomacy against other AI

Chatbot Arena group goes from academic project at UC Berkeley to $600 million startup

Evaluation Mindset: Taming the Gen AI Dragon

Beyond Benchmarks: Really Evaluating AI

Chameleon AI Shows Competitive Edge Over LLaMa-2 and Other Models | HackerNoon

The problems with running human evals

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data: Single-Subject Evaluations | HackerNoon

Vector Institute aims to clear up confusion about AI model performance

#ai-evaluation
#ai-evaluation