
"Traditional enterprise systems are built on a simple assumption: the same input produces the same output. Agentic systems break that assumption, and much of today's ecosystem has adapted by evaluating variability rather than eliminating it. Over the past two years, a growing class of evaluation frameworks has emerged to make agent behavior observable and measurable. Tools such as LangSmith, Arize Phoenix, Promptfoo, Ragas, and OpenAI Evals capture execution traces and apply qualitative or LLM-based scoring to judge outcomes."
"These tools are essential for monitoring safety and performance, but they introduce a different testing model. Results are rarely binary. Teams increasingly rely on thresholds, retries, and soft failures to cope with evaluator variance. For example, industry coverage of AI agent testing notes that traditional QA assumptions break down for agents because outputs are probabilistic and evaluation often requires more flexible, probabilistic frameworks rather than strict pass/fail assertions."
"In parallel, some teams have rediscovered a more traditional approach, targeting repeatability and determinism in testing using the record and replay pattern. Borrowed from integration testing tools like vcr.py, the pattern captures real API interactions once and replays them deterministically in future test runs. LangChain now recommends this technique explicitly for LLM testing, noting that recording HTTP requests and responses can make CI runs fast, cheap, and predictable."
Agentic systems produce probabilistic outputs that invalidate the traditional assumption that identical inputs yield identical outputs. Engineering teams face difficulty testing non-deterministic behaviors and often evaluate variability instead of eliminating it. Evaluation frameworks such as LangSmith, Arize Phoenix, Promptfoo, Ragas, and OpenAI Evals capture execution traces and apply qualitative or LLM-based scoring to measure behavior. Those monitoring tools create non-binary results, prompting teams to use thresholds, retries, and soft failures to handle evaluator variance. Some teams adopt record-and-replay testing, capturing real API interactions once and replaying them deterministically. LangChain recommends recording HTTP requests and responses to make CI runs fast, cheap, and predictable. Docker's Cagent is positioned to support deterministic testing for agents.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]