Anthropic's latest Sonnet is better at using computers

"The tweaks to Sonnet 4.6 have taken it past the pricier Opus 4.6 in two of 13 benchmark categories: agentic financial analysis (Finance Agent v1.1, 63.3 percent vs. 60.1 percent) and office tasks (GDPVal-AA Elo, 1633 vs. 1606). Opus 4.6 wins in six of the 13 categories, in tests that show rival Gemini 3 Pro and GPT-5.2 each leading in 2 of 13 categories. But benchmark tests should not be taken too seriously."

"Sonnet 4.6 defaults to a context window of 200K, like Opus 4.6 and Haiku 4.5 - that's the amount of material (tokens) that the model can process. But Opus 4.6, Sonnet 4.6, Sonnet 4.5, and Sonnet 4 all offer a 1M token context window for those involved in beta testing - usage tier four and organizations with custom rate limits."

Anthropic updated Sonnet to version 4.6, claiming improved coding ability, enhanced automation of computer tasks, and stronger reasoning and planning capabilities. The release follows a recent Opus version bump and sees Sonnet 4.6 outperform Opus 4.6 in two of 13 benchmark categories: agentic financial analysis (Finance Agent v1.1, 63.3% vs. 60.1%) and office tasks (GDPVal-AA Elo, 1633 vs. 1606). Opus 4.6 wins six categories while Gemini 3 Pro and GPT-5.2 each lead in two categories; benchmark results carry caveats. Sonnet 4.6 defaults to a 200K token context window with a 1M option for beta testers. The model scored 72.5 on OSWorld-Verified, up from 28.0 by Sonnet 3.7, and shows improved resistance to prompt injection without increased malicious-use risk.

#model-update #benchmarks #context-window #automation #safety

Read at Theregister

Unable to calculate read time

Collection

[

...

]

Anthropic's latest Sonnet is better at using computersAnthropic's latest Sonnet is better at using computers Briefly

Anthropic's latest Sonnet is better at using computers
Anthropic's latest Sonnet is better at using computers
Briefly