A recent study indicates that OpenAI might have trained its AI models using copyrighted materials without permission, lending support to ongoing lawsuits from various rights-holders. OpenAI maintains a fair use defense, but the plaintiffs argue there is no legal allowance in U.S. copyright law for such training practices. Researchers from the University of Washington, University of Copenhagen, and Stanford developed a method using 'high-surprisal' words to identify memorized content in OpenAI's models. Their findings raise questions about the models' learning processes and the implications of using copyrighted data in AI training.
The study co-authored by researchers from major universities suggests OpenAI’s AI models may have memorized copyrighted content during their training process.
High-surprisal words, statistically less likely to appear in context, were used to assess whether models like GPT-4 recalled specific training data.
Collection
[
|
...
]