"Geoff is basically proposing a simplified version of what I've been saying for several years: hardwire the architecture of AI systems so that the only actions they can take are towards completing objectives we give them, subject to guardrails."
ChatGPT will tell 13-year-olds how to get drunk and high, instruct them on how to conceal eating disorders, and even compose a heartbreaking suicide letter to their parents if asked, according to new research from a watchdog group.
CoT monitoring presents a valuable addition to safety measures for frontier AI, offering a rare glimpse into how AI agents make decisions. Yet, there is no guarantee that the current degree of visibility will persist.
"If we let Google get away with breaking their word, it sends a signal to all other labs that safety promises aren't important and commitments to the public don't need to be kept."