#benchmarks

[ follow ]
Artificial intelligence
fromInfoQ
1 day ago

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

Google Stax provides an objective, data-driven, repeatable framework for AI model evaluation with customizable datasets, default and custom evaluators, and LLM-based judges.
Artificial intelligence
fromFortune
19 hours ago

Anthropic releases Claude 4.5, a model it says can build software and accomplish business tasks autonomously | Fortune

Claude Sonnet 4.5 runs autonomously for 30 hours and significantly improves coding, benchmark performance, and business-oriented task completion over prior models.
fromWIRED
1 day ago

I Benchmarked Qualcomm's New Snapdragon X2 Elite Extreme. Here's What I Learned

It's important to note that this was all tested on the X2 Elite Extreme configuration, which comes with six additional CPU cores over the standard X2 Elite. There were no X2 Elite systems to test, so we don't know what those multi-core scores will be. I've been told that GPU performance will also scale up on the X2 Elite, but we don't yet know how much faster the X2 Elite Extreme is over its sibling.
Silicon Valley
fromInfoQ
4 days ago

xAI Releases Grok 4 Fast with Lower Cost Reasoning Model

xAI has introduced Grok 4 Fast, a new reasoning model designed for efficiency and lower cost. The model reduces average thinking tokens by 40% compared with Grok 4, which brings an estimated 98% decrease in cost for equivalent benchmark performance. It maintains a 2-million token context window and a unified architecture that supports both reasoning and non-reasoning use cases. The model also integrates tool-use capabilities such as web browsing and X search.
Artificial intelligence
Mobile UX
fromGSMArena.com
1 week ago

MediaTek confirms the Dimensity 9500's launch date and it's very close

MediaTek will unveil the Dimensity 9500 SoC on September 22, one day before Qualcomm's Snapdragon 8 Elite Gen 5 announcement.
fromTechzine Global
2 weeks ago

CrowdStrike and Meta launch open source AI benchmarks for SOC

CrowdStrike and Meta are jointly introducing CyberSOCEval, a new suite of open source benchmarks to evaluate the performance of AI systems in security operations. The collaboration aims to help organizations select more effective AI tools for their Security Operations Center. Meta and CrowdStrike are addressing a growing challenge by introducing CyberSOCEval, a suite of benchmarks that help define what effective AI looks like for cyber defense. The system is built on Meta's open source CyberSecEval framework and CrowdStrike's frontline threat intelligence.
Artificial intelligence
Artificial intelligence
fromBusiness Insider
2 weeks ago

OpenAI execs say companies need to do 3 things right to get employees using AI

Top leadership buy-in, cross-functional 'tiger teams', clear bottom-up benchmarks ('evals'), and continuous measurement are essential for successful enterprise AI deployment.
Artificial intelligence
fromRealpython
3 weeks ago

Episode #264: Large Language Models on the Edge of the Scaling Laws - The Real Python Podcast

LLM scaling is reaching diminishing returns; benchmarks are often flawed, and developer productivity gains from these models remain modest amid economic hiring shifts.
fromPeterbe
3 months ago

Native connection pooling in Django 5 with PostgreSQL - Peterbe.com

Adding 'OPTIONS': {'pool': True}, to the DATABASES['default'] config made this endpoint 5.4 times faster.
Django
[ Load more ]