LLM collapse: The danger of training LLMs on AI-generated data

Published: 12-06-2026, 4:22 PM

LLM collapse: The danger of training LLMs on AI-generated data

Telegram Group Join Now

What happens when a new generation of large language models (LLMs) are trained on data produced by their predecessors? As internet content becomes increasingly populated with synthetic data, this could lead to a peculiar problem – LLMs can collapse under the weight of AI-generated training data.

Researchers warn it can lead to the ‘collapse of models’ as they are fed data produced by the first-generation LLMs, depending less on the original data. This, in turn, could make later models misperceive reality.

Also Read

One of the most useful Apple Wallet features arriving with iOS 27 is headed to Disney World later this year

Shell prepares to sell offshore wind farms amid shift away from renewable energy

Starlink Constellation Crosses 10,600 Satellites After Latest SpaceX Launch

Plugin developed by ProSEOBlogger. Get free gpl themes.

According to a paper published in Nature, using model-generated content in training caused irreversible defects in the resulting models, making the tails of the original content distribution disappear. To sustain AI development in the long term, the authors argue that access to the original data source must be preserved.

“We discover that indiscriminately learning from data produced by other models causes ‘model collapse’,” the authors of the paper, ‘AI models collapse when trained on recursively generated data,’ said.

Raghava Rao Mukkamala, Professor in the Department of Digitalization at Copenhagen Business School, Denmark, explains that current Generative AI models produce new data by learning statistical patterns from massive training datasets, much of which is scraped from the Internet.

“Unlike humans, who communicate through reasoned intent, experience, and logical argumentation, AI models generate content by applying probabilistic models to patterns they observed in their training data,” he said.

“The Nature paper showed that this recursive training cycle causes models to progressively lose track of the true underlying data distribution in the real world. It showed that rare and diverse patterns are often the first to disappear, making AI outputs increasingly homogeneous, repetitive, and detached from reality,” he told businessline.

Their findings suggest that relying on AI-generated content for future training of AI models may degrade model quality over time. Finally, this study highlights that preserving access to authentic, real-world, and human-generated data is absolutely essential for maintaining the diversity, accuracy, and reliability of future AI systems.

Kashyap Kompella, Chief Executive Officer of RPA2AI Research, said that AI models have been improving because of three main scaling factors: more compute, larger models, and more training data.

“For the last decade, the industry treated web-scale human content as a vast natural resource. That assumption is now breaking. The public availability of high-quality human-generated text is finite. If scaling trends continue, language models could fully use the available stock of public human-generated text by 2032,” he said.

This, however, does not mean “there is no more data.” It means the easiest, cheapest, broadest pool of public human text is no longer enough to keep scaling models in the old way.

“The industry is therefore moving toward licensed data, proprietary enterprise data, human feedback, interaction logs, multimodal data, simulation data, and synthetic data,” he said.

Stating that synthetic data is not automatically bad, it is already useful in code, math, robotics, gaming, autonomous driving, privacy-safe testing, rare-case simulation, and instruction tuning. “The problem begins when synthetic data becomes a substitute for a real-world signal rather than a controlled supplement. Model collapse occurs when AI models are repeatedly trained on outputs from earlier models,” he said.

Who will have an edge?

For the AI Vendors, data quality becomes a strategic moat. Vendors with access to licensed archives, proprietary usage data, enterprise data, multimodal streams, and verified human feedback will have an advantage over vendors relying mainly on public web crawls.

Users will encounter more polished but less original content. The internet will contain more content that looks clear, formatted, and confident but is derivative, repetitive, or weakly sourced.

Published on June 12, 2026

Source link
#LLM #collapse #danger #training #LLMs #AIgenerated #data