""A lot of these AI businesses are looking for readily available, structured databases of content," Robert Hahn, head of business affairs and licensing for The Guardian, told . "The Internet Archive's API would have been an obvious place to plug their own machines into and suck out the IP.""
""We are blocking the Internet Archive's bot from accessing the Times because the Wayback Machine provides unfettered access to Times content - including by AI companies - without authorization," a representative from the newspaper confirmed to Nieman Lab."
Internet Archive has served as a resource for journalists by preserving deleted tweets and providing academic texts for background research. The rise of AI has created new tension as publishers fear AI companies and their bots use the Archive's collections to indirectly scrape articles for model training. Several major publications have begun blocking or limiting the Archive's bots and cataloging access to prevent unauthorized harvesting of their content. Publishers cite the Archive's API and the Wayback Machine as easy avenues for automated scraping. Some publishers have pursued lawsuits against AI companies or negotiated licensing deals that mainly compensate publishing companies rather than individual writers. Copyright and piracy disputes related to AI training data also involve fiction writers, visual artists, and musicians.
Read at Engadget
Unable to calculate read time
Collection
[
|
...
]