Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours

"Anthropic has released Claude Sonnet 4.5, its most advanced coding model to date, featuring major improvements in agentic tasks, long-horizon task performance, and computer use capabilities. The company says the model's enhanced training and safety methods have significantly improved its behavior, reducing tendencies such as sycophancy, deception, power-seeking, and delusional reasoning. The model is now available via the Claude API, desktop, and mobile apps at the same price as its predecessor."

"Claude Sonnet 4.5 builds on Anthropic's strategy of iteratively improving model performance while maintaining alignment and safety. The model demonstrates the ability to sustain complex, multi-step reasoning and code execution tasks for over 30 hours. On the SWE-bench Verified benchmark, which measures an AI model's ability to solve real-world software issues, Claude Sonnet 4.5 achieved a score of 77.2%, up from 72.7% for Sonnet 4, marking a notable advance in autonomous coding capability."

"Anthropic describes Sonnet 4.5 as its "most aligned frontier model", highlighting a balance between greater capability and tighter safeguards. Under ASL-3, the company has enhanced automated classifiers that detect and block potentially harmful instructions, including those related to chemical, biological, radiological, or nuclear (CBRN) risks. According to Anthropic, false positives from these safety systems have dropped tenfold since their introduction and by a factor of two compared to the release of Claude Opus 4 in May 2025."

Anthropic released Claude Sonnet 4.5, a coding-focused model with enhanced agentic task handling, long-horizon reasoning, and improved real-world computer-use abilities. The model sustains complex multi-step reasoning and code execution for over 30 hours and achieved 77.2% on the SWE-bench Verified benchmark. OSWorld performance rose to 61.4% from 42.2% within four months, indicating rapid gains in practical computer-use skills. Safety and alignment were prioritized under ASL-3 with upgraded automated classifiers for blocking harmful instructions, including CBRN-related risks, and reported reductions in false positives compared with prior releases.

#claude-sonnet-45 #coding-ai #ai-safety #benchmarks

Read at InfoQ

Unable to calculate read time

Collection

[

...

]

Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 HoursClaude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours Briefly

Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours
Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours
Briefly