
"If you've ever wondered whether that chatbot you're using knows the entire text of a particular book, answers are on the way. Computer scientists have developed a more effective way to coax memorized content from large language models, a development that may address regulatory concerns while helping to clarify copyright infringement claims arising from AI model training and inference. Researchers affiliated with Carnegie Mellon University, Instituto Superior Técnico/INESC-ID, and AI security platform Hydrox AI describe their approach in a preprint paper titled "RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline.""
"The authors - André V. Duarte, Xuying Li, Bin Zeng, Arlindo L. Oliveira, Lei Li, and Zhuo Li - argue that the ongoing concerns about AI models being trained on proprietary data and the copyright claims being litigated against AI companies underscore the need for tools that make it easier to understand what AI models have memorized. Commercial AI vendors generally do not disclose their full data training sets, which makes it difficult for customers, regulators, rights holders, or anyone for that matter"
An agentic pipeline named RECAP reproduces copyrighted data memorized by large language models, enabling more effective extraction of memorized content. The approach addresses opacity around proprietary data used to train models and helps clarify potential copyright infringement arising from model training and inference. Current probing techniques such as Prefix-Probing have become less reliable because many models are overly aligned to avoid revealing memorized content, sometimes even blocking public-domain outputs. Commercial AI vendors frequently withhold full training datasets, creating difficulties for rights holders, customers, and regulators seeking to audit model training data and provenance.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]