Using AI to click around on a website burns 45x as many tokens as just using APIs
Briefly

Using AI to click around on a website burns 45x as many tokens as just using APIs
"Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app's HTTP endpoints directly. Same Claude Sonnet, same pinned dataset, same task. The interface is the only variable. The API agent completed the task in just eight calls. It listed pending customer reviews, accepted them, and marked the order delivered."
"The vision agent, however, found only one of four pending reviews because it failed to scroll the page where it would have seen the three other reviews hidden off-screen. Analyzing and interpreting a web page visually is fundamentally more challenging for an AI model than interacting with API calls and tools."
Vision agents that mimic human interaction through image processing and optical character recognition are substantially more expensive and less effective than API agents. A comparison between Claude Sonnet operating through screenshots versus direct API calls revealed the API agent completed tasks in eight calls while the vision agent struggled with off-screen content and required approximately 17 calls. Vision agents must analyze and interpret web pages visually, consuming far more tokens than API agents that receive structured data directly. Even with prompt optimization, vision agents remain inefficient. Businesses deploying AI agents for computer automation should prioritize API-based approaches to reduce costs and improve reliability.
Read at theregister
Unable to calculate read time
[
|
]