The article discusses accusations against OpenAI regarding its training of AI models on copyrighted materials without proper licensing. A new paper from the AI Disclosures Project alleges that OpenAI's GPT-4o model has been trained on paywalled books from O'Reilly Media without a licensing agreement. The analysis highlights a shift in how AI models are trained, with some companies moving towards synthetic data yet emphasizing the importance of real-world data. Utilizing methods like DE-COP, the paper examines how effectively models can distinguish between public and non-public sources, raising ethical implications for AI development.
The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O'Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O'Reilly Media. O'Reilly doesn't have a licensing agreement with OpenAI, the paper says.
GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content [...] compared to OpenAI's earlier model GPT-3.5 Turbo.
While a number of AI labs including OpenAI have begun embracing AI-generated data to train AI as they exhaust real-world sources, few have eschewed real-world data entirely.
The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models' training data.
Collection
[
|
...
]