
"Windsurf has introduced Arena Mode inside its IDE allowing developers to compare large language models side by side while working on real coding tasks. The feature is designed to let users evaluate models directly within their existing development context, rather than relying on public benchmarks or external evaluation websites. Arena Mode runs two Cascade agents in parallel on the same prompt, with the underlying model identities hidden during the session."
"Developers interact with both agents using their normal workflow, including access to their codebase, tools, and context. After reviewing the outputs, users can select which response performed better, and those votes are used to calculate model rankings. The results feed into both a personal leaderboard based on an individual's votes and a global leaderboard aggregated across the Windsurf user base."
Windsurf introduced Arena Mode inside its IDE to let developers compare large language models side by side while working on real coding tasks. Arena Mode runs two Cascade agents in parallel on the same prompt with model identities hidden, and developers interact with both agents using their normal workflow, including access to codebase, tools, and context. After reviewing outputs, users select which response performed better and those votes calculate model rankings. Results populate personal and global leaderboards. Arena Mode supports testing specific models or predefined groups, synchronized or branched follow-ups, session finalization and recording. Free access is offered for a limited time and Windsurf plans more models and granular leaderboards by task, language, and team.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]