
"You can divide the recent history of LLM data scraping into a few phases. There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts. Once apps like ChatGPT became popular and companies started commercializing models, the matter of training data became instantly and extremely contentious."
"They're up against sophisticated actors. Lavishly funded start-ups and tech megafirms are looking for high-quality data wherever they can find it, offline and on, and web scraping has turned into an arms race. There are scrapers masquerading as search engines or regular users, and blocked companies are building undercover crawlers. Website operators, accustomed to having at least nominal control over whether search engines index their content, are seeing the same thing in their data: swarms of voracious machines making constant attempts to harvest their content,"
LLM data scraping moved from an experimental era that largely ignored sourcing ethics to a heated, commercialized phase after consumer chatbots appeared. Creators, publishers, and platforms pursued licensing agreements and legal action while major AI firms negotiated individual deals to secure data access. Scraping intensified despite agreements as well-funded startups and tech giants hunt for high-quality data everywhere, using disguised and covert crawlers. Website operators and infrastructure providers report massive, repeated harvesting attempts that strain systems and erode control over content. Leaked lists of scraped sites allege inclusion of copyrighted, pirated, adult, and original news material, raising legal and ethical concerns.
Read at Intelligencer
Unable to calculate read time
Collection
[
|
...
]