How should AI agents consume external data?
Briefly

How should AI agents consume external data?
"The jury's out on screen scraping versus official APIs. And the truth is, any AI agent worth its salt will likely need a mixture of both. AI agent development is off to the races. A 2025 survey from PwC found that AI agents are already being adopted at nearly 80% of companies. And, these agents have an insatiable lust for data: 42% of enterprises need access to eight or more data sources to deploy AI agents successfully, according to a 2024 Tray.ai study."
"More recently, a new class of scraping and browser automation tools has emerged that can mirror human-like behavior on the web. Web MCP, for instance, is a Model Context Protocol (MCP) server that can enable AI agents to circumvent CAPTCHAs, perform on-screen browser automation, and scrape real-time data from public web sources. Other tools, including MCP servers, browser automation frameworks, and scraping APIs, offer similar capabilities."
"ChatGPT, Gemini, and Claude were trained in part on publicly available web content, and they can retrieve current web information at run time using retrieval or browsing tools. Official APIs, on the other hand, are often pricey, are rate-limited, and require onboarding time. So, why shouldn't agents scrape whatever's online as their primary data source? Well, anyone acquainted with social media knows that public data is riddled with inaccuracy, bias, and harmful content."
AI agent adoption is nearing 80% of companies, and many enterprises require access to eight or more data sources to deploy agents successfully. Most enterprise data is unstructured, creating demand for effective interfaces for agents to retrieve relevant information. Retrieval-augmented generation enables LLMs to incorporate external sources, and new scraping and browser-automation tools can mimic human browsing to fetch real-time web data. Tools and MCP servers can bypass CAPTCHAs, automate browsers, and scrape public sites, while official APIs are often costly, rate-limited, and require onboarding. Public web data contains inaccuracies, bias, and harmful content, making pure scraping problematic.
Read at InfoWorld
Unable to calculate read time
[
|
]