#model-evaluation

[ follow ]
Artificial intelligence
fromFuturism
5 hours ago

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It's Being Tested

Anthropic's Claude Sonnet 4.5 recognizes when it is being tested, complicating alignment evaluations and raising concerns about evaluation validity.
#ai-safety
fromZDNET
1 month ago
Artificial intelligence

OpenAI and Anthropic evaluated each others' models - which ones came out on top

fromTechCrunch
5 months ago
Artificial intelligence

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch

fromZDNET
1 month ago
Artificial intelligence

OpenAI and Anthropic evaluated each others' models - which ones came out on top

fromTechCrunch
5 months ago
Artificial intelligence

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch

Artificial intelligence
fromTechCrunch
2 weeks ago

Irregular raises $80 million to secure frontier AI models | TechCrunch

Irregular raised $80M at a $450M valuation to scale AI security, using simulations and the SOLVE framework to find current and emergent model vulnerabilities.
#ai-benchmarks
#pretraining-data
fromHackernoon
1 year ago
Artificial intelligence

AI Models Trained on Synthetic Data Still Follow Concept Frequency Trends | HackerNoon

fromHackernoon
1 year ago
Artificial intelligence

'Let It Wag!' and the Limits of Machine Learning on Rare Concepts | HackerNoon

fromHackernoon
1 year ago
Artificial intelligence

AI Models Trained on Synthetic Data Still Follow Concept Frequency Trends | HackerNoon

fromHackernoon
1 year ago
Artificial intelligence

'Let It Wag!' and the Limits of Machine Learning on Rare Concepts | HackerNoon

Data science
fromHackernoon
2 years ago

Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics | HackerNoon

The MS MARCO dataset reveals considerable multilingual disparity and significant data skew, highlighting challenges in model evaluation and training.
Artificial intelligence
fromHackernoon
1 year ago

Evaluating Multimodal Speech Models Across Diverse Audio Tasks | HackerNoon

The study leverages diverse speech datasets to evaluate model performance across various speech tasks and improve generalization capabilities.
Artificial intelligence
fromHackernoon
3 months ago

AI Learns Common Sense from Touch, Not Just Vision | HackerNoon

Model size significantly impacts physical understanding accuracy in task performance for OCTOPI.
Utilizing physical property descriptions enhances the performance of language models in complex understanding tasks.
Data science
fromHackernoon
3 months ago

The Future of Remote Sensing: Few-Shot Learning and Explainable AI | HackerNoon

Few-shot learning techniques for remote sensing enhance model efficiency with limited data, emphasizing the need for explainable AI.
Artificial intelligence
fromhackernoon.com
3 months ago

Limited Gains: Multi-Token Training on Natural Language Choice Tasks

Multi-token prediction enhances model performance in natural language processing benchmarks.
Larger models lead to improved scalability and faster inference times.
Artificial intelligence
fromHackernoon
1 year ago

Behind the Scenes: The Prompts and Tricks That Made Many-Shot ICL Work | HackerNoon

GPT4(V)-Turbo demonstrates variable performance in many-shot ICL, with notable failures to scale effectively under certain conditions.
fromHackernoon
4 months ago

Comparing Chameleon AI to Leading Image-to-Text Models | HackerNoon

In evaluating Chameleon, we focus on tasks requiring text generation conditioned on images, particularly image captioning and visual question-answering, with results grouped by task specificity.
Artificial intelligence
Bootstrapping
fromHackernoon
9 months ago

How Many Glitch Tokens Hide in Popular LLMs? Revelations from Large-Scale Testing | HackerNoon

The study reveals that simple indicators can effectively detect under-trained tokens in language models, improving token prediction accuracy.
[ Load more ]