Benchmarking AI Agents on Kubernetes
Briefly

Benchmarking AI Agents on Kubernetes
"Brandon Foley published a benchmarking study on the CNCF blog showing that AI coding agents can find and fix isolated bugs. However, they often struggle to understand system-wide impacts. This challenges the idea that improved code retrieval is the main way to enhance automated bug fixing."
"He opened pull requests from the Kubernetes repository as a benchmark. These were real bugs, actively fixed by real contributors. Each agent only received the issue description, with no PR description or diff to suggest a solution. Three agent configurations were tested against nine Kubernetes bug reports spanning kubelet, scheduler, networking, storage, and apps subsystems."
"The first used RAG-only retrieval via KAITO RAG Engine backed by Qdrant, combining BM25 keyword matching with embedding-based semantic search. The second took a hybrid approach, requiring RAG-first discovery followed by local filesystem access. The third relied entirely on a local clone of the repository with no retrieval index. All sessions ran the same model (Claude Opus 4.6), the same five-minute timeout, and the same output format; the only variable was how each agent could see code."
"On speed and cost, the results were clear. RAG-only was consistently the fastest at an average of 76 seconds, since it skips filesystem navigation entirely and generates from retrieved snippets. Hybrid was the slowest at around two and a half minutes on average, as the mandatory RAG-first phase adds overhead before local exploration begins. On token economics, Hybrid proved the most expensive, not because it reads more code, but because it makes the most model invocations, and since the API is stateless, every call replays the full conversation history."
AI coding agents can locate and repair isolated bugs, but they frequently struggle to reason about system-wide effects. A benchmarking setup used real Kubernetes pull requests that corresponded to actively fixed bugs. Each agent received only the issue description, without pull request context or code diffs. Nine bug reports across kubelet, scheduler, networking, storage, and apps subsystems were tested with three configurations. One used retrieval-augmented generation with a Qdrant-backed RAG engine combining keyword and semantic search. Another required RAG-first discovery followed by local filesystem access. The third used only a local repository clone with no retrieval index. RAG-only was fastest, while hybrid was slowest and most expensive due to more model invocations and repeated conversation replay.
Read at InfoQ
Unable to calculate read time
[
|
]