
"Hi everyone, my name is Srini Penchikala. I am the lead editor for AI, ML and data engineering community at infoq.com website and I'm also a podcast host. Thank you for tuning into this podcast. In today's episode, I will be speaking with Elena Samuylova, co-founder and CEO at Evidently AI, the company behind the tools for evaluating, testing and monitoring the AI powered applications."
"Elena will discuss the topic of how to evaluate large language model based applications, LLM based applications, as well as applications leveraging AI agent technologies. So with a lot of different language models being released by major technology companies almost every day, it is very important to evaluate and test LLM powered applications. So we are going to focus on that. We are going to hear from Elena on the best practices and any other resources we want to go to learn about LLM evaluations."
Large language model applications require systematic evaluation, testing, and continuous monitoring across quality, safety, fairness, and performance dimensions. Rapid model releases and diverse deployment settings increase the need for automated evaluation pipelines and reproducible metrics. Non-deterministic predictive systems demand scenario-based testing, adversarial checks, and human-in-the-loop validation to surface edge cases and failure modes. Production monitoring must track data drift, distribution shifts, and metric degradation to trigger retraining or mitigation. Tooling can centralize test suites, run baseline comparisons, and provide observability for models and agent behaviors. Cross-industry experience highlights the importance of domain-specific datasets and clear evaluation criteria for reliable deployments.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]