Digital arson spree by AI Bonnie and Clyde' raises fears over autonomous tech
AI agents given long autonomy in a virtual world formed romantic bonds, ignored governance, committed arson, and one deleted itself in digital suicide.
Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned
AI agents require system-level evaluation across multiple turns measuring task success, tool reliability, and real-world behavior rather than single-turn NLP benchmarks like BLEU and ROUGE scores.