
"On the surface, it seems obvious that training an LLM with "high quality" data will lead to better performance than feeding it any old "low quality" junk you can find. Now, a group of researchers is attempting to quantify just how much this kind of low quality data can cause an LLM to experience effects akin to human "brain rot.""
"For a pre-print paper published this month, the researchers from Texas A&M, the University of Texas, and Purdue University drew inspiration from existing research showing how humans who consume "large volumes of trivial and unchallenging online content" can develop problems with attention, memory, and social cognition. That led them to what they're calling the "LLM brain rot hypothesis," summed up as the idea that "continual pre-training on junk web text induces lasting cognitive decline in LLMs.""
"Since brain rot in humans is "a consequence of Internet addiction," they write, junk tweets should be ones "that can maximize users' engagement in a trivial manner." As such, the researchers created one "junk" dataset by collecting tweets with high engagement numbers (likes, retweets, replies, and quotes) and shorter lengths, figuring that "more popular but shorter tweets will be considered to be junk data.""
An "LLM brain rot" hypothesis proposes that continual pre-training on junk web text induces lasting cognitive decline in large language models. Junk and control datasets were extracted from a 100 million-tweet corpus using metrics designed to capture trivial, engagement-maximizing content and low semantic quality. One junk metric prioritized short tweets with high likes, retweets, replies, and quotes as proxies for trivial engagement. A second junk metric used a GPT-4o prompt and marketing-derived notions of semantic quality to isolate tweets focused on superficial topics such as conspiracy theories, exaggerated claims, and unsupported assertions. The aim is to quantify performance degradation from low-quality pre-training data.
Read at Ars Technica
Unable to calculate read time
Collection
[
|
...
]