The Nonprofit Feeding the Entire Internet to AI Companies

"For more than a decade, the nonprofit has been scraping billions of webpages to build a massive archive of the internet. This database-large enough to be measured in petabytes-is made freely available for research. In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models."

"to study book banning in various countries, among other things. In a 2012 interview, Gil Elbaz, the founder of Common Crawl, said of its archive that "we just have to make sure that people use it in the right way. Fair use says you can do certain things with the world's data, and as long as people honor that and respect the copyright of this data, then everything's great.""

Common Crawl maintains a petabyte-scale archive built from scraping billions of webpages and makes that archive freely available for research. Major AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used the archive to train large language models. The archive contains content from paywalled major news websites, allowing AI firms to access high-quality journalism without paying publishers. Public statements claim the organization scrapes only "freely available content" and avoids "going behind any 'paywalls.'" The foundation's practices raise concerns about copyright, transparency, and use of publishers' content in AI training.

#common-crawl #web-scraping #llm-training #paywalled-journalism

Read at The Atlantic

Unable to calculate read time

Collection

[

...

]

The Nonprofit Feeding the Entire Internet to AI CompaniesThe Nonprofit Feeding the Entire Internet to AI Companies Briefly

The Nonprofit Feeding the Entire Internet to AI Companies
The Nonprofit Feeding the Entire Internet to AI Companies
Briefly