When will browser agents do real work?
Briefly

When will browser agents do real work?
"Computer-use models perceive and act like humans do. They analyze the browser screen as an image and issue clicks or text inputs at coordinates, which is powerful in theory, but fragile in practice. Rendering differences, latency, and the difficulty of parsing complex layouts all contribute to unreliability. For agents operating at enterprise scale, even a 1% failure rate can be unacceptable."
"They look at screenshots, interpret them using multimodal models, and output low-level actions like "click (210,260)" or "type "Peter Pan"." This mimics how a human would use a computer-reading visible text, locating buttons visually, and clicking where needed. The upside is universality: the model doesn't need structured data, just pixels. The downside is precision and performance: visual models are slower, require scrolling through the entire page, and struggle with subtle state changes between screenshots ("Is this button clickable yet?")."
Browser agents are approaching production readiness and are already used in critical sectors such as healthcare and insurance. OpenAI released Operator in January 2025 as the first large-scale computer-use model that controlled its own browser by moving the mouse and clicking like a human. OpenAI discontinued Operator in August and integrated it into ChatGPT Agent Mode, adding both visual and text-based browsers. Computer-use models perceive screens as images and issue coordinate-based clicks, making them powerful but fragile due to rendering differences, latency, and complex layouts; even a 1% failure rate can be unacceptable at enterprise scale. Vision-based agents use screenshots and multimodal models to output low-level actions and offer universality, but they suffer precision and performance limitations. DOM-based agents operate directly on the Document Object Model (DOM).
Read at InfoWorld
Unable to calculate read time
[
|
]