#benchmarks
#benchmarks

3 weeks ago

Anthropic releases Sonnet 4.6 | TechCrunch

Anthropic has released a new version of its mid-size Sonnet model, keeping pace with the company's four-month update cycle. In a post announcing the new model, Anthropic emphasized improvements in coding, instruction-following, and computer use. Sonnet 4.6 will be the default model for Free and Pro plan users. The beta release of Sonnet 4.6 will include a context window of 1 million tokens, twice the size of the largest window previously available for Sonnet.

Artificial intelligence

Mobile UX

fromGSMArena.com

3 weeks ago

Snapdragon-powered Galaxy S26 Ultra smashes Exynos-powered S26 in single-core performance

Galaxy S26 Ultra uses Snapdragon 8 Elite Gen 5 globally while S26 and S26+ use Exynos 2600 in some regions; Snapdragon leads single-core, multi-core similar.

4 weeks ago

OpenAI swaps Nvidia for Cerebras with GPT-5.3-Codex-Spark

GPT-5.3-Codex-Spark is a Cerebras-optimized, low-latency encoding model generating over 1,000 tokens/sec to enable immediate, minimal, real-time developer code adjustments.

fromZDNET

I tried vibe coding for free to save $1,200 a year - and it was a total disaster

Free, local AI coding tools failed to replace a paid cloud coding model due to unreliable edits, unexplained regressions, and time-consuming debugging.

OpenAI's GPT-5.3-Codex thinks deeper and wider about coding work

GPT-5.3-Codex extends Codex capabilities to broader work tasks, combining GPT-5.2-Codex coding with GPT-5.2 reasoning and running 25% faster.

#ai

Artificial intelligence

Anthropic improves AI coding with Claude Opus 4.6

Artificial intelligence

DreamWorks CEO turned VC Jeff Katzenberg says that AI is not going to be a 'zero-sum game'

Artificial intelligence

Anthropic improves AI coding with Claude Opus 4.6

Artificial intelligence

DreamWorks CEO turned VC Jeff Katzenberg says that AI is not going to be a 'zero-sum game'

more#ai

Tiny startup Arcee AI built a 400B open source LLM from scratch to best Meta's Llama | TechCrunch

But tiny 30-person startup Arcee AI disagrees. The company just released a truly and permanently open (Apache license) general-purpose, foundation model called Trinity, and Arcee claims that at 400B parameters, it is among the largest open-source foundation models ever trained and released by a U.S. company. Arcee says Trinity compares to Meta's Llama 4 Maverick 400B, and Z.ai GLM-4.5, a high-performing open-source model from China's Tsinghua University, according to benchmark tests conducted using base models (very little post training).

Artificial intelligence

Python

CPython vs. PyPy: Which Python runtime has the better JIT?

PyPy remains far faster for raw numerical workloads, but CPython's new native JIT and no-GIL builds close the gap in other workloads and enable threading.

Alibaba's Qwen3-Max-Thinking expands enterprise AI model choices

Qwen3-Max-Thinking delivers benchmark-level reasoning comparable to top models while adding adaptive tool use and test-time scaling to improve factual accuracy and reasoning.

#gemini-3-flash

Artificial intelligence

Google releases Gemini 3 Flash, promising improved intelligence and efficiency

Artificial intelligence

Google launches Gemini 3 Flash, makes it the default model in the Gemini app | TechCrunch

Artificial intelligence

Google releases Gemini 3 Flash, promising improved intelligence and efficiency

Artificial intelligence

Google launches Gemini 3 Flash, makes it the default model in the Gemini app | TechCrunch

more#gemini-3-flash

fromTheregister

IBM CUGA stalks enterprises looking to deploy AI agents

CUGA automates complex enterprise workflows but completes many tasks only about half the time, exposing performance limits and governance concerns.

Google enhances Gemini Deep Research with Interactions API

Google has released a new version of Gemini Deep Research. This is an agent designed to automate complex research tasks. The agent runs on Gemini 3 Pro. The model can process handwriting, graphs, and mathematical notation. It incorporates this visual information directly into reports and search queries. As a result, the system can not only search textual sources, but also retrieve data that was previously difficult to automate, according to SiliconANGLE.

Artificial intelligence

#gpt-52

Software development

GPT-5.2 launched, OpenAI's answer to Gemini 3 Pro

Artificial intelligence

OpenAI is clapping back at Google's Gemini 3 with a new GPT-5.2

Artificial intelligence

OpenAI releases GPT-5.2 after "code red" Google threat alert

Software development

GPT-5.2 launched, OpenAI's answer to Gemini 3 Pro

Artificial intelligence

OpenAI is clapping back at Google's Gemini 3 with a new GPT-5.2

Artificial intelligence

OpenAI releases GPT-5.2 after "code red" Google threat alert

more#gpt-52

fromZDNET

Does the new Flux.2 beat Nano Banana Pro? You can try it for yourself - for free

Some specific improvements of the model include support for up to 10 reference images, meaning you can incorporate a lot more elements from different pictures in your final product; improved photorealism and detail; more accurate text rendering, a task image generating models frequently struggle with; better prompt following; and a better understanding of real-world knowledge, according to Black Forest Labs.

Artificial intelligence

#enterprise-ai

fromComputerworld

Software development

Anthropic's Claude Opus 4.5 pricing cut signals a shift in the enterprise AI market

Artificial intelligence

Anthropic's Claude Opus 4.5 pricing cut signals a shift in the enterprise AI market

Artificial intelligence

OpenAI execs say companies need to do 3 things right to get employees using AI

fromComputerworld

Software development

Anthropic's Claude Opus 4.5 pricing cut signals a shift in the enterprise AI market

Artificial intelligence

Anthropic's Claude Opus 4.5 pricing cut signals a shift in the enterprise AI market

Artificial intelligence

OpenAI execs say companies need to do 3 things right to get employees using AI

Artificial intelligence

Google's Gemini 3 Pro: Better than mid

Artificial intelligence

Gemini 3 may be the moment Google pulls away in the AI arms race

Artificial intelligence

Google's Gemini 3 Pro: Better than mid

Artificial intelligence

Gemini 3 may be the moment Google pulls away in the AI arms race

Artificial intelligence

Gemini 3 and Antigravity, explained: Why Google's latest AI releases are a big deal | Fortune

fromDeveloper Tech News

Artificial intelligence

Gemini 3: Google enables new agentic AI workflows for developers

fromFortune

Artificial intelligence

Gemini 3 and Antigravity, explained: Why Google's latest AI releases are a big deal | Fortune

fromDeveloper Tech News

Artificial intelligence

Gemini 3: Google enables new agentic AI workflows for developers

Government waves white flag on its housing targets with launch of its new strategy

Government abandons annual housing benchmarks due to construction slowdown; Taoiseach insists major investment in housing will succeed.

OpenAI walks a tricky tightrope with GPT-5.1's eight new personalities

On Wednesday, OpenAI released GPT-5.1 Instant and GPT-5.1 Thinking, two updated versions of its flagship AI models now available in ChatGPT. The company is wrapping the models in the language of anthropomorphism, claiming that they're warmer, more conversational, and better at following instructions. The release follows complaints earlier this year that its previous models were excessively cheerful and sycophantic, along with an opposing controversy among users over how OpenAI modified the default GPT-5 output style after several suicide lawsuits.

Artificial intelligence

fromFuturism

Researchers "Embodied" an LLM Into a Robot Vacuum and It Suffered an Existential Crisis Thinking About Its Role in the World

The "Butter-Bench" test, as detailed in a yet-to-be-peer-reviewed paper, is a "benchmark that evaluates practical intelligence in embodied LLM." In the test, the robot had to navigate to an office kitchen, have butter be placed on a tray attached to its back, confirm the pickup, deliver it to a marked location, and finally return to its charging dock. The results of the Butter-Bench experiment, the researchers conceded, were dubious.

Artificial intelligence

#ai-evaluation

Artificial intelligence

Laude Institute announces first batch of 'Slingshots' AI grants | TechCrunch

Artificial intelligence

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

Artificial intelligence

Laude Institute announces first batch of 'Slingshots' AI grants | TechCrunch

fromBuffer: All-you-need social media toolkit for small businesses

Artificial intelligence

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

more#ai-evaluation

What Is a Good Facebook Engagement Rate? Data From 52 Million+ Posts

One of the most common questions creators and brands ask: "Is my engagement rate good?" The answer depends on your follower count. A 5% engagement rate looks very different for a neighborhood café with 500 fans than for a news publisher with half a million. That's why we analyzed 52 million Facebook posts across 213,000 accounts with over 6.9 billion engagements collectively, to see how engagement rates shift by follower tier.

Online marketing

fromComputerworld

Anthropic releases new version of its smaller Haiku model

Haiku 4.5 matches Sonnet 4 performance while costing one-third and running more than twice as fast, enabling parallel low-resource agents.

fromZDNET

Even the best AI agents are thwarted by this protocol - what can be done

Even top AI models struggle to use Model Context Protocol, requiring many interaction rounds and MCP-specific training to handle complex multi-server tasks.

Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours

Claude Sonnet 4.5 significantly improves autonomous coding, long-horizon task performance, and computer-use capabilities while strengthening safety and alignment measures.

fromTheregister

Microsoft adds Copilot adoption benchmarks to Viva Insights

Microsoft added Copilot adoption benchmarks to Viva Insights, enabling managers to compare active Copilot usage across cohorts, roles, regions, and other companies.

Google DeepMind Launches Gemini 2.5 Computer Use Model to Power UI-Controlling AI Agents

Gemini 2.5 Computer Use enables AI agents to perceive and manipulate graphical user interfaces—clicking, typing, scrolling—via a looped screenshot-and-action API, showing strong benchmark performance.

fromFortune

Anthropic releases Claude 4.5, a model it says can build software and accomplish business tasks autonomously | Fortune

Claude Sonnet 4.5 runs autonomously for 30 hours and significantly improves coding, benchmark performance, and business-oriented task completion over prior models.

fromWIRED

I Benchmarked Qualcomm's New Snapdragon X2 Elite Extreme. Here's What I Learned

It's important to note that this was all tested on the X2 Elite Extreme configuration, which comes with six additional CPU cores over the standard X2 Elite. There were no X2 Elite systems to test, so we don't know what those multi-core scores will be. I've been told that GPU performance will also scale up on the X2 Elite, but we don't yet know how much faster the X2 Elite Extreme is over its sibling.

Silicon Valley

xAI Releases Grok 4 Fast with Lower Cost Reasoning Model

xAI has introduced Grok 4 Fast, a new reasoning model designed for efficiency and lower cost. The model reduces average thinking tokens by 40% compared with Grok 4, which brings an estimated 98% decrease in cost for equivalent benchmark performance. It maintains a 2-million token context window and a unified architecture that supports both reasoning and non-reasoning use cases. The model also integrates tool-use capabilities such as web browsing and X search.

Artificial intelligence

Mobile UX

fromGSMArena.com

MediaTek confirms the Dimensity 9500's launch date and it's very close

MediaTek will unveil the Dimensity 9500 SoC on September 22, one day before Qualcomm's Snapdragon 8 Elite Gen 5 announcement.

CrowdStrike and Meta launch open source AI benchmarks for SOC

CrowdStrike and Meta are jointly introducing CyberSOCEval, a new suite of open source benchmarks to evaluate the performance of AI systems in security operations. The collaboration aims to help organizations select more effective AI tools for their Security Operations Center. Meta and CrowdStrike are addressing a growing challenge by introducing CyberSOCEval, a suite of benchmarks that help define what effective AI looks like for cyber defense. The system is built on Meta's open source CyberSecEval framework and CrowdStrike's frontline threat intelligence.

Artificial intelligence

fromRealpython

fromBuffer: All-you-need social media toolkit for small businesses

Episode #264: Large Language Models on the Edge of the Scaling Laws - The Real Python Podcast

LLM scaling is reaching diminishing returns; benchmarks are often flawed, and developer productivity gains from these models remain modest amid economic hiring shifts.

Social media marketing

What Is A Good Instagram Engagement Rate? Data from 27 Million+ Instagram Posts

Engagement rates decline as follower count increases; benchmarks show expected engagement across follower tiers from under 1,000 to over 1,000,000.

Mobile UX

fromGSMArena.com

vivo X300 runs Geekbench, confirms its chipset

The Vivo X300 will launch with MediaTek's Dimensity 9500 chipset, 16GB RAM, and Android 16.

Software development

8 months ago

Microsoft entices Windows 10 users with performance gains

Windows 11 shows 2.3 times faster performance than Windows 10 according to Microsoft benchmarks.

fromPeterbe

8 months ago

Native connection pooling in Django 5 with PostgreSQL - Peterbe.com

Adding 'OPTIONS': {'pool': True}, to the DATABASES['default'] config made this endpoint 5.4 times faster.

Django