#benchmarks

[ follow ]
#gemini-31-pro
fromTechCrunch
2 days ago
Artificial intelligence

Google's new Gemini Pro model has record benchmark scores-again | TechCrunch

fromTechCrunch
2 days ago
Artificial intelligence

Google's new Gemini Pro model has record benchmark scores-again | TechCrunch

fromTheregister
4 days ago

Anthropic's latest Sonnet is better at using computers

The tweaks to Sonnet 4.6 have taken it past the pricier Opus 4.6 in two of 13 benchmark categories: agentic financial analysis (Finance Agent v1.1, 63.3 percent vs. 60.1 percent) and office tasks (GDPVal-AA Elo, 1633 vs. 1606). Opus 4.6 wins in six of the 13 categories, in tests that show rival Gemini 3 Pro and GPT-5.2 each leading in 2 of 13 categories. But benchmark tests should not be taken too seriously.
Artificial intelligence
fromTechCrunch
4 days ago

Anthropic releases Sonnet 4.6 | TechCrunch

Anthropic has released a new version of its mid-size Sonnet model, keeping pace with the company's four-month update cycle. In a post announcing the new model, Anthropic emphasized improvements in coding, instruction-following, and computer use. Sonnet 4.6 will be the default model for Free and Pro plan users. The beta release of Sonnet 4.6 will include a context window of 1 million tokens, twice the size of the largest window previously available for Sonnet.
Artificial intelligence
Mobile UX
fromGSMArena.com
5 days ago

Snapdragon-powered Galaxy S26 Ultra smashes Exynos-powered S26 in single-core performance

Galaxy S26 Ultra uses Snapdragon 8 Elite Gen 5 globally while S26 and S26+ use Exynos 2600 in some regions; Snapdragon leads single-core, multi-core similar.
Artificial intelligence
fromTechzine Global
1 week ago

OpenAI swaps Nvidia for Cerebras with GPT-5.3-Codex-Spark

GPT-5.3-Codex-Spark is a Cerebras-optimized, low-latency encoding model generating over 1,000 tokens/sec to enable immediate, minimal, real-time developer code adjustments.
Artificial intelligence
fromZDNET
1 week ago

I tried vibe coding for free to save $1,200 a year - and it was a total disaster

Free, local AI coding tools failed to replace a paid cloud coding model due to unreliable edits, unexplained regressions, and time-consuming debugging.
Artificial intelligence
fromFast Company
2 weeks ago

OpenAI's GPT-5.3-Codex thinks deeper and wider about coding work

GPT-5.3-Codex extends Codex capabilities to broader work tasks, combining GPT-5.2-Codex coding with GPT-5.2 reasoning and running 25% faster.
#ai
fromTechCrunch
3 weeks ago

Tiny startup Arcee AI built a 400B open source LLM from scratch to best Meta's Llama | TechCrunch

But tiny 30-person startup Arcee AI disagrees. The company just released a truly and permanently open (Apache license) general-purpose, foundation model called Trinity, and Arcee claims that at 400B parameters, it is among the largest open-source foundation models ever trained and released by a U.S. company. Arcee says Trinity compares to Meta's Llama 4 Maverick 400B, and Z.ai GLM-4.5, a high-performing open-source model from China's Tsinghua University, according to benchmark tests conducted using base models (very little post training).
Artificial intelligence
Python
fromInfoWorld
3 weeks ago

CPython vs. PyPy: Which Python runtime has the better JIT?

PyPy remains far faster for raw numerical workloads, but CPython's new native JIT and no-GIL builds close the gap in other workloads and enable threading.
Artificial intelligence
fromInfoWorld
3 weeks ago

Alibaba's Qwen3-Max-Thinking expands enterprise AI model choices

Qwen3-Max-Thinking delivers benchmark-level reasoning comparable to top models while adding adaptive tool use and test-time scaling to improve factual accuracy and reasoning.
#gemini-3-flash
fromTechCrunch
2 months ago
Artificial intelligence

Google launches Gemini 3 Flash, makes it the default model in the Gemini app | TechCrunch

fromTechCrunch
2 months ago
Artificial intelligence

Google launches Gemini 3 Flash, makes it the default model in the Gemini app | TechCrunch

Artificial intelligence
fromTheregister
2 months ago

IBM CUGA stalks enterprises looking to deploy AI agents

CUGA automates complex enterprise workflows but completes many tasks only about half the time, exposing performance limits and governance concerns.
fromTechzine Global
2 months ago

Google enhances Gemini Deep Research with Interactions API

Google has released a new version of Gemini Deep Research. This is an agent designed to automate complex research tasks. The agent runs on Gemini 3 Pro. The model can process handwriting, graphs, and mathematical notation. It incorporates this visual information directly into reports and search queries. As a result, the system can not only search textual sources, but also retrieve data that was previously difficult to automate, according to SiliconANGLE.
Artificial intelligence
#gpt-52
fromZDNET
2 months ago

Does the new Flux.2 beat Nano Banana Pro? You can try it for yourself - for free

Some specific improvements of the model include support for up to 10 reference images, meaning you can incorporate a lot more elements from different pictures in your final product; improved photorealism and detail; more accurate text rendering, a task image generating models frequently struggle with; better prompt following; and a better understanding of real-world knowledge, according to Black Forest Labs.
Artificial intelligence
#enterprise-ai
fromInfoWorld
2 months ago
Artificial intelligence

Anthropic's Claude Opus 4.5 pricing cut signals a shift in the enterprise AI market

fromInfoWorld
2 months ago
Artificial intelligence

Anthropic's Claude Opus 4.5 pricing cut signals a shift in the enterprise AI market

#gemini-3-pro
#gemini-3
fromFortune
3 months ago
Artificial intelligence

Gemini 3 and Antigravity, explained: Why Google's latest AI releases are a big deal | Fortune

fromFortune
3 months ago
Artificial intelligence

Gemini 3 and Antigravity, explained: Why Google's latest AI releases are a big deal | Fortune

Miscellaneous
fromIndependent
3 months ago

Government waves white flag on its housing targets with launch of its new strategy

Government abandons annual housing benchmarks due to construction slowdown; Taoiseach insists major investment in housing will succeed.
fromArs Technica
3 months ago

OpenAI walks a tricky tightrope with GPT-5.1's eight new personalities

On Wednesday, OpenAI released GPT-5.1 Instant and GPT-5.1 Thinking, two updated versions of its flagship AI models now available in ChatGPT. The company is wrapping the models in the language of anthropomorphism, claiming that they're warmer, more conversational, and better at following instructions. The release follows complaints earlier this year that its previous models were excessively cheerful and sycophantic, along with an opposing controversy among users over how OpenAI modified the default GPT-5 output style after several suicide lawsuits.
Artificial intelligence
fromFuturism
3 months ago

Researchers "Embodied" an LLM Into a Robot Vacuum and It Suffered an Existential Crisis Thinking About Its Role in the World

The "Butter-Bench" test, as detailed in a yet-to-be-peer-reviewed paper, is a "benchmark that evaluates practical intelligence in embodied LLM." In the test, the robot had to navigate to an office kitchen, have butter be placed on a tray attached to its back, confirm the pickup, deliver it to a marked location, and finally return to its charging dock. The results of the Butter-Bench experiment, the researchers conceded, were dubious.
Artificial intelligence
#ai-evaluation
fromInfoQ
4 months ago
Artificial intelligence

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

fromInfoQ
4 months ago
Artificial intelligence

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

fromBuffer: All-you-need social media toolkit for small businesses
5 months ago

What Is a Good Facebook Engagement Rate? Data From 52 Million+ Posts

One of the most common questions creators and brands ask: "Is my engagement rate good?" The answer depends on your follower count. A 5% engagement rate looks very different for a neighborhood café with 500 fans than for a news publisher with half a million. That's why we analyzed 52 million Facebook posts across 213,000 accounts with over 6.9 billion engagements collectively, to see how engagement rates shift by follower tier.
Online marketing
Artificial intelligence
fromZDNET
4 months ago

Even the best AI agents are thwarted by this protocol - what can be done

Even top AI models struggle to use Model Context Protocol, requiring many interaction rounds and MCP-specific training to handle complex multi-server tasks.
Artificial intelligence
fromInfoQ
4 months ago

Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours

Claude Sonnet 4.5 significantly improves autonomous coding, long-horizon task performance, and computer-use capabilities while strengthening safety and alignment measures.
Artificial intelligence
fromTheregister
4 months ago

Microsoft adds Copilot adoption benchmarks to Viva Insights

Microsoft added Copilot adoption benchmarks to Viva Insights, enabling managers to compare active Copilot usage across cohorts, roles, regions, and other companies.
Artificial intelligence
fromInfoQ
4 months ago

Google DeepMind Launches Gemini 2.5 Computer Use Model to Power UI-Controlling AI Agents

Gemini 2.5 Computer Use enables AI agents to perceive and manipulate graphical user interfaces—clicking, typing, scrolling—via a looped screenshot-and-action API, showing strong benchmark performance.
Artificial intelligence
fromFortune
4 months ago

Anthropic releases Claude 4.5, a model it says can build software and accomplish business tasks autonomously | Fortune

Claude Sonnet 4.5 runs autonomously for 30 hours and significantly improves coding, benchmark performance, and business-oriented task completion over prior models.
fromWIRED
4 months ago

I Benchmarked Qualcomm's New Snapdragon X2 Elite Extreme. Here's What I Learned

It's important to note that this was all tested on the X2 Elite Extreme configuration, which comes with six additional CPU cores over the standard X2 Elite. There were no X2 Elite systems to test, so we don't know what those multi-core scores will be. I've been told that GPU performance will also scale up on the X2 Elite, but we don't yet know how much faster the X2 Elite Extreme is over its sibling.
Silicon Valley
fromInfoQ
4 months ago

xAI Releases Grok 4 Fast with Lower Cost Reasoning Model

xAI has introduced Grok 4 Fast, a new reasoning model designed for efficiency and lower cost. The model reduces average thinking tokens by 40% compared with Grok 4, which brings an estimated 98% decrease in cost for equivalent benchmark performance. It maintains a 2-million token context window and a unified architecture that supports both reasoning and non-reasoning use cases. The model also integrates tool-use capabilities such as web browsing and X search.
Artificial intelligence
Mobile UX
fromGSMArena.com
5 months ago

MediaTek confirms the Dimensity 9500's launch date and it's very close

MediaTek will unveil the Dimensity 9500 SoC on September 22, one day before Qualcomm's Snapdragon 8 Elite Gen 5 announcement.
fromTechzine Global
5 months ago

CrowdStrike and Meta launch open source AI benchmarks for SOC

CrowdStrike and Meta are jointly introducing CyberSOCEval, a new suite of open source benchmarks to evaluate the performance of AI systems in security operations. The collaboration aims to help organizations select more effective AI tools for their Security Operations Center. Meta and CrowdStrike are addressing a growing challenge by introducing CyberSOCEval, a suite of benchmarks that help define what effective AI looks like for cyber defense. The system is built on Meta's open source CyberSecEval framework and CrowdStrike's frontline threat intelligence.
Artificial intelligence
Artificial intelligence
fromRealpython
5 months ago

Episode #264: Large Language Models on the Edge of the Scaling Laws - The Real Python Podcast

LLM scaling is reaching diminishing returns; benchmarks are often flawed, and developer productivity gains from these models remain modest amid economic hiring shifts.
fromPeterbe
7 months ago

Native connection pooling in Django 5 with PostgreSQL - Peterbe.com

Adding 'OPTIONS': {'pool': True}, to the DATABASES['default'] config made this endpoint 5.4 times faster.
Django
[ Load more ]