Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use.
If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems, illustrating the emerging crisis in data consent.
Bots used to be a welcome thing to see in your web analytics, because it meant that your site was indexed by a search engine. However, bots for the purpose of generative AI take everything and those who run sites don't get much, if anything, in return.
The decline in data availability seems warranted, as more barriers are being put up by sites to restrict scraping for the purpose of adding to AI datasets.
Collection
[
|
...
]