Training on “junk data” can lead to LLM “brain rot”

Introduction to LLM Brain Rot

On the surface, it seems obvious that training an LLM with “high quality” data will lead to better performance than feeding it any old “low quality” junk you can find. Now, a group of researchers is attempting to quantify just how much this kind of low quality data can cause an LLM to experience effects akin to human “brain rot.”

The LLM Brain Rot Hypothesis

For a pre-print paper published this month, the researchers from Texas A&M, the University of Texas, and Purdue University drew inspiration from existing research showing how humans who consume “large volumes of trivial and unchallenging online content” can develop problems with attention, memory, and social cognition. That led them to what they’re calling the “LLM brain rot hypothesis,” summed up as the idea that “continual pre-training on junk web text induces lasting cognitive decline in LLMs.”

Defining Junk Web Text

Figuring out what counts as “junk web text” and what counts as “quality content” is far from a simple or fully objective process, of course. But the researchers used a few different metrics to tease a “junk dataset” and “control dataset” from HuggingFace’s corpus of 100 million tweets.

Metrics for Junk Tweets

Since brain rot in humans is “a consequence of Internet addiction,” they write, junk tweets should be ones “that can maximize users’ engagement in a trivial manner.” As such, the researchers created one “junk” dataset by collecting tweets with high engagement numbers (likes, retweets, replies, and quotes) and shorter lengths, figuring that “more popular but shorter tweets will be considered to be junk data.”

Semantic Quality of Tweets

For a second “junk” metric, the researchers drew from marketing research to define the “semantic quality” of the tweets themselves. Using a complex GPT-4o prompt, they sought to pull out tweets that focused on “superficial topics (like conspiracy theories, exaggerated claims, unsupported assertions or superficial lifestyle content)” or that had an “attention-drawing style (such as sensationalized headlines using clickbait language or excessive trigger words).” A random sample of these LLM-based classifications was spot-checked against evaluations from three graduate students with a 76 percent matching rate.

Conclusion

The research on LLM brain rot highlights the importance of the quality of data used to train large language models. As the use of LLMs becomes more widespread, it is essential to consider the potential effects of low-quality data on their performance and to develop strategies for mitigating these effects.

FAQs

Q: What is LLM brain rot?
A: LLM brain rot refers to the potential decline in cognitive abilities of large language models (LLMs) caused by continual pre-training on low-quality data.
Q: How did the researchers define junk web text?
A: The researchers used metrics such as high engagement numbers and shorter lengths, as well as semantic quality, to define junk web text.
Q: What are the potential consequences of LLM brain rot?
A: The potential consequences of LLM brain rot include decline in attention, memory, and social cognition abilities of LLMs.
Q: Why is it important to consider the quality of data used to train LLMs?
A: It is essential to consider the quality of data used to train LLMs to mitigate the potential effects of low-quality data on their performance and to ensure that they are able to perform at their best.