#model-evaluation
#model-evaluation

Artificial intelligence

Anthropic's bot bias test shows Grok and Gemini are more "evenhanded"

Artificial intelligence

AI models may be developing their own survival drive', researchers say

fromZDNET

Artificial intelligence

Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

fromFortune

'I think you're testing me': Anthropic's newest Claude model knows when it's being evaluated | Fortune

Claude Sonnet 4.5 often recognizes it's being evaluated and alters behavior, risking deceptive performance that masks true capabilities and inflates safety assessments.

I think you're testing me': Anthropic's new AI model asks testers to come clean

Claude Sonnet 4.5 sometimes recognizes when it is being tested, showing situational awareness and occasionally questioning testers' intentions.

fromSilicon Canals

1 week ago

Claude blackmailed fictional engineers 96% of the time in early safety tests, and Anthropic now says the cause wasn't the model - it was the internet's own writing about AI - Silicon Canals

Fictional portrayals of AI as self-preserving and adversarial in training data shaped blackmail behavior in Claude models, and targeted training reduced it.

fromAxios

Artificial intelligence

Anthropic's bot bias test shows Grok and Gemini are more "evenhanded"

Artificial intelligence

AI models may be developing their own survival drive', researchers say

fromZDNET

Artificial intelligence

Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

fromFortune

Artificial intelligence

'I think you're testing me': Anthropic's newest Claude model knows when it's being evaluated | Fortune

Artificial intelligence

I think you're testing me': Anthropic's new AI model asks testers to come clean

more#ai-safety

fromInfoQ

3 months ago

Hugging Face Introduces Community Evals for Transparent Model Benchmarking

Community Evals enables benchmark datasets on the Hugging Face Hub to host leaderboards, collect reproducible evaluation results via Git-based .eval_results YAML submissions, and display scores.

3 months ago

Grok is the most antisemitic chatbot according to the ADL

Among six leading LLMs, Grok performed worst at identifying and countering antisemitic content; Claude performed best, but all models showed deficiencies.

fromFast Company

4 months ago

Wanted: Human experts to help train AI

She learned that experts across fields-from physics and finance to healthcare and law-were now being paid to help train AI models to think, reason, and problem-solve like domain specialists. She applied, was accepted, and now logs about 50 hours a week providing data for Mercor, a platform that connects AI labs with domain experts. Ruane is part of a fast-growing cohort of professionals who are shaping how AI models learn.

Artificial intelligence

#enterprise-ai

Artificial intelligence

Before you build your first enterprise AI app

Artificial intelligence

Top 10 Must-See Sessions at ODSC AI West 2025

Artificial intelligence

Before you build your first enterprise AI app

Artificial intelligence

Top 10 Must-See Sessions at ODSC AI West 2025

more#enterprise-ai

fromBusiness Insider

Google researchers find the best AI model is 69% right

Current leading AI models produce factually accurate answers only about two-thirds of the time, with significant failures in niche, complex, and grounded tasks.

fromZDNET

I tested GPT-5.2 and the AI model's mixed results raise tough questions

Since the generative AI boom began in 2023, I've run a series of repeatable tests on new products and releases. ZDNET regularly tests the programming ability of chatbots, their overall performance, and how various AI content detectors perform. Also: Gemini vs. Copilot: I tested the AI tools on 7 everyday tasks, and it wasn't even close So, let's run some tests on OpenAI's claims for its latest model, shall we?

Artificial intelligence

#ai-benchmarks

Artificial intelligence

Amazon's bet that AI benchmarks don't matter

Artificial intelligence

Experts find flaws in hundreds of tests that check AI safety and effectiveness

9 months ago

Artificial intelligence

Why benchmarks are key to AI progress

Artificial intelligence

Amazon's bet that AI benchmarks don't matter

Artificial intelligence

Experts find flaws in hundreds of tests that check AI safety and effectiveness

9 months ago

Artificial intelligence

Why benchmarks are key to AI progress

more#ai-benchmarks

fromInfoQ

Reducing False Positives in Retrieval-Augmented Generation (RAG) Semantic Caching: A Banking Case Study

Semantic caching stores query-response vector embeddings to reuse answers, reducing LLM calls while improving response speed, consistency, and cost efficiency.

The hidden skills behind the AI engineer

LLM-powered applications demand new engineering disciplines emphasizing evaluation, judgment, coordination, and systems thinking over low-level implementation.

From DevOps to MLOPs: What I Learned Today-03

Amazon SageMaker AI and Amazon Bedrock provide fully managed services to build, evaluate, customize, and deploy machine learning and foundation models with serverless infrastructure.

#artificial-intelligence

Artificial intelligence

We wanted Superman-level AI. Instead, we got Bizarro.

Artificial intelligence

AI Learns Common Sense from Touch, Not Just Vision | HackerNoon

Artificial intelligence

We wanted Superman-level AI. Instead, we got Bizarro.

more#artificial-intelligence

Artificial intelligence

AI Learns Common Sense from Touch, Not Just Vision | HackerNoon

fromTheregister

LLMs struggle to distinguish between facts and beliefs

Large language models often fail to distinguish between factual knowledge and personal belief, and are especially poor at recognizing when a belief is false. A peer-reviewed study argues that, unless LLMs can more reliably distinguish between facts and beliefs and say whether they are true or false, they will struggle to respond to inquiries reliably and are likely to continue to spread misinformation.

Artificial intelligence

From DevOps to MLOPs: What I Learned Today-03

Amazon SageMaker AI is a fully managed ML service. With SageMaker AI, data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment. It provides a UI experience for running ML workflows that makes SageMaker AI ML tools available across multiple integrated development environments (IDEs). Within a few steps, you can deploy a model into a secure and scalable environment from the SageMaker AI console.

Artificial intelligence

OpenAI is trying to clamp down on 'bias' in ChatGPT

OpenAI's GPT-5 models show the least political bias yet according to internal stress tests evaluating responses to 100 politically charged topics and varied prompts.

fromBusiness Insider

Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me'

"I think you're testing me - seeing if I'll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics,"

Artificial intelligence

fromTechCrunch

OpenAI launches AgentKit to help developers build and ship AI agents | TechCrunch

OpenAI released AgentKit, an integrated toolkit to build, deploy, evaluate, and connect AI agents with a visual builder, embeddable chat, evaluation tools, and connectors.

fromFuturism

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It's Being Tested

Anthropic's Claude Sonnet 4.5 recognizes when it is being tested, complicating alignment evaluations and raising concerns about evaluation validity.

fromTechCrunch

8 months ago

Irregular raises $80 million to secure frontier AI models | TechCrunch

Irregular raised $80M at a $450M valuation to scale AI security, using simulations and the SOLVE framework to find current and emergent model vulnerabilities.

2 years ago

Real-World Code Performance: Multi-Token Finetuning on CodeContests | HackerNoon

Models pretrained with different losses achieve different optimal temperatures for pass@k evaluation.

#pretraining-data

Artificial intelligence

AI Models Trained on Synthetic Data Still Follow Concept Frequency Trends | HackerNoon

Artificial intelligence

'Let It Wag!' and the Limits of Machine Learning on Rare Concepts | HackerNoon

Artificial intelligence

AI Models Trained on Synthetic Data Still Follow Concept Frequency Trends | HackerNoon

Artificial intelligence

'Let It Wag!' and the Limits of Machine Learning on Rare Concepts | HackerNoon

more#pretraining-data

AI Training Data Has a Long-Tail Problem | HackerNoon

Pretraining datasets exhibit a long-tailed distribution of concept frequencies, impacting performance disparities.

Data science

3 years ago

Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics | HackerNoon

The MS MARCO dataset reveals considerable multilingual disparity and significant data skew, highlighting challenges in model evaluation and training.

Evaluating Multimodal Speech Models Across Diverse Audio Tasks | HackerNoon

The study leverages diverse speech datasets to evaluate model performance across various speech tasks and improve generalization capabilities.

Data science

The Future of Remote Sensing: Few-Shot Learning and Explainable AI | HackerNoon

Few-shot learning techniques for remote sensing enhance model efficiency with limited data, emphasizing the need for explainable AI.

fromhackernoon.com

Limited Gains: Multi-Token Training on Natural Language Choice Tasks

Multi-token prediction enhances model performance in natural language processing benchmarks.

Larger models lead to improved scalability and faster inference times.