The AI revolution is running out of data. What can researchers do?
Briefly

The past decade of explosive improvement in AI has been driven in large part by making neural networks bigger and training them on ever-more data. This scaling has proved surprisingly effective at making large language models (LLMs) such as those that power the chatbot ChatGPT both more capable of replicating conversational language and of developing emergent properties such as reasoning. However, some specialists say that we are now approaching the limits of scaling.
Researchers at Epoch AI projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years' time, at the same time that data owners—such as newspaper publishers—are starting to crack down on how their content can be used.
Shayne Longpre, an AI researcher at MIT, states that the tightening access to data is causing a crisis in the size of the 'data commons.' He strongly suspects that the imminent bottleneck in training data could already be starting to pinching AI development.
While restrictions on data access may slow down rapid improvement in AI systems, developers are finding workarounds. I don't think anyone is panicking at the large AI companies, suggesting that innovative solutions could emerge despite the looming data scarcity.
Read at Nature
[
|
]