
"As part of its mission to preserve the web, the Internet Archive operates crawlers that capture webpage snapshots. Many of these snapshots are accessible through its public-facing tool, the Wayback Machine. But as AI bots scavenge the web for training data to feed their models, the Internet Archive's commitment to free information access has turned its digital library into a potential liability for some news publishers."
"The publisher decided to limit the Internet Archive's access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit's repository of over one trillion webpage snapshots. Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive's APIs and filter out its article pages from the Wayback Machine's URLs interface."
""A lot of these AI businesses are looking for readily available, structured databases of content," he said. "The Internet Archive's API would have been an obvious place to plug their own machines into and suck out the IP." (He admits the Wayback Machine itself is "less risky," since the data is not as well-structured.)"
Internet Archive runs crawlers that capture webpage snapshots accessible via the Wayback Machine. AI companies harvesting web content for training data have made archived pages a potential source of copyrighted material. The Guardian limited the Internet Archive's access to published articles, excluding itself from the Archive's APIs and filtering article pages from the Wayback Machine while allowing regional and landing pages to remain. The Guardian views the Archive's APIs as especially risky because they provide structured access. Other publishers, including the Financial Times, block bots from AI firms and third parties to protect paywalled stories and IP.
Read at Nieman Lab
Unable to calculate read time
Collection
[
|
...
]