
"If you have been closely tracking Anthropic's releases, you may recall that in May, Claude Opus 4 and Sonnet 4 scored highest amongst frontier models on the industry-standard software engineering benchmark test (SWE-bench), which evaluates LLMs' abilities to solve real-world software engineering tasks sourced from GitHub. Claude Opus 4.1, released in August, surpassed it."
"Now, Claude Sonnet 4.5 has lapped that last model, outperforming it on the SWE-bench Verified evaluation, a human-filtered subset of the SWE-bench. Anthropic said that on the SWE-bench Verified, Sonnet 4.5 held its focus for more than 30 hours on complex, multi-step tasks. This capability is specifically useful for agentic tasks, which oftentimes require solo work in the background for extended periods of time."
Claude Sonnet 4.5 is released as a next-generation model with upgrades across performance. The model reportedly outperforms prior Anthropic releases and leading competitor models on SWE-bench Verified, a human-filtered software engineering benchmark. Sonnet 4.5 demonstrated sustained focus for more than 30 hours on complex, multi-step tasks, supporting long-running agentic workloads. Anthropic also updated Claude Code tools and the Claude for Chrome extension. Earlier Anthropic milestones included Claude Opus 4 and Sonnet 4 scoring highly in May and Claude Opus 4.1 surpassing them in August. The model targets coding, complex agents, reasoning, and mathematical capabilities.
Read at ZDNET
Unable to calculate read time
Collection
[
|
...
]