Google DeepMind Launches Gemini 2.5 Computer Use Model to Power UI-Controlling AI Agents
Briefly

Google DeepMind Launches Gemini 2.5 Computer Use Model to Power UI-Controlling AI Agents
"The Computer Use model brings Gemini's multimodal reasoning and visual understanding to environments like browsers and mobile apps, where AI must perceive the on-screen context and act accordingly. Early evaluations show the model performing strongly on several interface control benchmarks, including Online-Mind2Web, WebVoyager, and AndroidWorld. In tests reported by DeepMind and Browserbase, it reached around 70% accuracy on the Online-Mind2Web benchmark, with response times shorter than those of other publicly evaluated systems."
"In practical terms, the model operates in a loop via a new computer_use tool exposed through the Gemini API. Developers provide the model with a screenshot of the environment, a task description, and a record of previous actions. The model then returns structured function calls representing actions such as "click," "type," or "scroll." The client executes these actions, captures a new screenshot, and feeds it back to the model - repeating the cycle until the task is complete."
"Google DeepMind has recently released the Gemini 2.5 Computer Use model, a specialized variant of its Gemini 2.5 Pro system designed to enable AI agents to interact directly with graphical user interfaces. The new model allows developers to build agents that can click, type, scroll, and manipulate interactive elements on web pages."
Google DeepMind released Gemini 2.5 Computer Use, a Gemini 2.5 Pro variant that enables AI agents to interact directly with graphical user interfaces. The model supports clicking, typing, scrolling, and manipulation of interactive web elements. The system leverages multimodal reasoning and visual understanding to perceive on-screen context in browsers and mobile apps. Early evaluations on benchmarks such as Online-Mind2Web, WebVoyager, and AndroidWorld showed around 70% accuracy on Online-Mind2Web and shorter response times than other publicly evaluated systems. The model operates through a computer_use tool that exchanges screenshots, task descriptions, and action histories, returning structured function calls for actions. The model is currently optimized for browsers but shows promise for mobile UI control and potential desktop expansion. Some developers warn that current implementations can be slow and may be replaceable by direct API calls.
Read at InfoQ
Unable to calculate read time
[
|
]