Researchers Just Found Something That Could Shake the AI Industry to Its Core

"For years now, AI companies, including Google, Meta, Anthropic, and OpenAI, have insisted that their large language models aren't technically storing copyrighted works in their memory and instead "learn" from their training data like a human mind. It's a carefully worded distinction that's been integral to their attempts to defend themselves against a rapidly growing barrage of legal challenges. It also cuts to the core of copyright law itself."

"But, crucially, the " fair use" doctrine holds that others can use copyrighted materials for purposes like criticism, journalism, and research. That's been the AI industry's defense in court against accusations of infringement; OpenAI CEO Sam Altman has gone as far as to say that it's "over " if the industry isn't allowed to freely leverage copyrighted data to train its models."

"Now, a damning new study could put AI companies on the defensive. In it, Stanford and Yale researchers found compelling evidence that AI models are actually copying all that data, not "learning" from it. Specifically, four prominent LLMs - OpenAI's GPT-4.1, Google's Gemini 2.5 Pro, xAI's Grok 3, and Anthropic's Claude 3.7 Sonnet - happily reproduced lengthy excerpts from popular - and protected - works, with a stunning degree of accuracy. They found that Claude outputted "entire books near-verbatim" with an accuracy rate of 95.8 percent."

Evidence shows major large language models can reproduce copyrighted works verbatim, including lengthy near‑verbatim passages and entire books at very high accuracy. AI companies have claimed models 'learn' from data rather than store copyrighted material, and have relied on fair use and training exemptions in legal defenses. Copyright grants creators exclusive rights to reproduce, adapt, distribute, perform, and display works; fair use permits certain uses for commentary, journalism, and research. Rights holders contend models were trained on pirated and copyrighted content without fair remuneration, prompting lawsuits and settlements. These reproduction findings intensify legal and policy disputes over training data and compensation.

#llms #copyright #fair-use #training-data

Read at Futurism

Unable to calculate read time

Collection

[

...

]

Researchers Just Found Something That Could Shake the AI Industry to Its CoreResearchers Just Found Something That Could Shake the AI Industry to Its Core Briefly

Researchers Just Found Something That Could Shake the AI Industry to Its Core
Researchers Just Found Something That Could Shake the AI Industry to Its Core
Briefly