#ai-evaluation

[ follow ]
Artificial intelligence
fromNature
1 week ago

Panels of peers are needed to gauge AI's trustworthiness - experts are not enough

Expert-only evaluation methods like the 'Sunstein test' risk concentrating AI trustworthiness judgments among elites, thereby reinforcing existing power structures shaping AI objectives.
Startup companies
fromFortune
1 week ago

Meet the world's youngest self-made billionaire, who skipped finals to make an empire out of teaching AI 'what only humans know' | Fortune

A startup matched overseas engineers with companies, expanded into human-in-the-loop AI evaluation services, and scaled rapidly to a $10 billion valuation.
fromTechCrunch
2 weeks ago

Laude Institute announces first batch of 'Slingshots' AI grants | TechCrunch

On Thursday, the Laude Institute announced its first batch of Slingshots grants, aimed at "advancing the science and practice of artificial intelligence." Designed as an accelerator for researchers, the Slingshots program is meant to provide resources that would be unavailable in most academic settings, whether it's funding, compute power, or product and engineering support. In exchange, the recipients pledge to produce some final work product, whether it's a startup, an open-source codebase, or another type of artifact.
Artificial intelligence
Artificial intelligence
fromInfoWorld
2 weeks ago

Databricks adds customizable evaluation tools to boost AI agent accuracy

Agent Bricks' Agent-as-a-Judge, Tunable Judges, and Judge Builder enable enterprises to customize evaluations and align agent behavior with business-specific standards.
fromNature
4 weeks ago

We need a new Turing test to assess AI's real-world knowledge

Some lawyers have learnt that the hard way, and have been fined for filing AI-generated court briefs that misrepresented principles of law and cited non-existent cases. The same is true in other fields. For example, AI models can pass the gold-standard test in finance - the Chartered Financial Analyst exam - yet score poorly on simple tasks required of entry-level financial analysts (see go.nature.com/42tbrgb).
Artificial intelligence
fromGeeky Gadgets
4 weeks ago

How to Fix Your AI Prompt Writing : 6 Principles That Actually Work

What if the secret to unlocking AI's full potential wasn't in the technology itself but in how we use it? After spending over 200 hours teaching AI to write, Nate B Jones discovered that the biggest mistakes aren't about algorithms or software limitations, they're about human misunderstanding. Too often, we assume AI can read between the lines of vague instructions or magically produce brilliance without guidance. The result? Generic, uninspired content that misses the mark.
Artificial intelligence
Artificial intelligence
fromNature
1 month ago

AI language models killed the Turing test: do we even need a replacement?

Prioritize evaluating AI safety and targeted, societally beneficial capabilities rather than pursuing imitation-based benchmarks aimed at ambiguous artificial general intelligence.
fromFuturism
1 month ago

OpenAI Releases List of Work Tasks It Says ChatGPT Can Already Replace

ChatGPT maker OpenAI has released a new evaluation, dubbed GDPval, to measure how well its AIs perform on "economically valuable, real-world tasks across 44 occupations." "People often speculate about AI's broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing," the company wrote in an accompanying blog post. "Evaluations like GDPval help ground conversations about future AI improvements in evidence rather than guesswork, and can help us track model improvement over time," OpenAI added.
Artificial intelligence
Artificial intelligence
fromInfoQ
1 month ago

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

Google Stax provides an objective, data-driven, repeatable framework for AI model evaluation with customizable datasets, default and custom evaluators, and LLM-based judges.
Artificial intelligence
fromZDNET
2 months ago

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

OpenAI's GDPval evaluates AI performance on 1,320 real-world tasks across 44 occupations to measure economic impact and narrow theory-practice gaps.
#generative-ai
fromFortune
2 months ago
Venture

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

fromFortune
2 months ago
Venture

Exclusive: Touring Capital, founded by ex-M12 and SoftBank investors, closes $330 million first fund | Fortune

Artificial intelligence
fromBusiness Insider
2 months ago

Read the deck an ex-Waymo engineer used to raise $3.75 million from Sheryl Sandberg and Kindred Ventures

Scorecard raised $3.75M to build an AI evaluation platform that tests AI agents for performance, safety, and faster deployment for startups and enterprises.
Artificial intelligence
fromodsc.medium.com
3 months ago

The ODSC AI West 2025 Preliminary Schedule, Mastering AI Evaluation, Building Real World Agentic Applications, and GPT-5 News.

ODSC AI West 2025 offers training on MLOps, RAG, and AI Agents, featuring over 250 industry experts to enhance practical skills.
Artificial intelligence
fromAxios
3 months ago

OpenAI's GPT-5 targets coders

GPT-5 combines a large language model with reasoning capabilities, improving task efficiency and reducing hallucinations, with notable applications in health.
Artificial intelligence
fromMedium
6 months ago

Evaluation Mindset: Taming the Gen AI Dragon

Evaluation in AI is a mindset, not a resource issue; it requires ongoing inquiry and critical thinking for successful application deployment.
Artificial intelligence
fromHackernoon
6 months ago

Chameleon AI Shows Competitive Edge Over LLaMa-2 and Other Models | HackerNoon

Chameleon exhibits competitive performance against leading text-only language models, excelling particularly in commonsense reasoning.
The evaluations indicate that Chameleon is capable of outperforming larger models like Llama-2 in specific benchmarks.
fromMedium
6 months ago

The problems with running human evals

Result ambiguity can come in different forms. The lack of agreement among raters is the most common one, known as Inter Rater Reliability (IRR).
Artificial intelligence
Artificial intelligence
fromInfoWorld
7 months ago

Vector Institute aims to clear up confusion about AI model performance

DeepSeek and OpenAI's o1 models excel in performance, yet AI models still face significant challenges across various tasks.
[ Load more ]