GPT-5 was trained on synthetic outputs, meaning AI models are now learning from other AI-generated content.

OpenAI has confirmed that GPT-4.5 was trained on synthetic data generated by smaller models, and GPT-5 has expanded the same pipeline. The company is not revealing the datasets used or how much synthetic material is in the mix.
https://openai.com/index/introducing-gpt-4-5/

Its research head says the shift to synthetic data comes from a scarcity of high-quality public web text. No thresholds have been set for how much synthetic data is too much, and no safeguards have been made public.
https://www.startuphub.ai/ai-news/startup-news/2025/openais-research-head-on-gpt-5-synthetic-data-and-agis-evolving-path/

One contributor on LessWrong argues that synthetic data boosted GPT’s coding and math abilities beyond what additional web text could achieve. The claim has not been peer-reviewed and cannot be reproduced independently.
https://www.lesswrong.com/posts/DDEbZJ9WanJKBNd4C/addressing-doubts-of-ai-progress-why-gpt-5-is-not-late-and

On Reddit, user FeltSteam speculated that GPT-3.5-Turbo “got a little nudge in the right direction” from an internal GPT-5 model, describing it as “a sort of centaur.” There is no evidence to confirm this.

Thoughts regarding rumors that GPT-5 has already finished training…
byu/AdaptivePerfection insingularity

Nobody is asking who reviewed the synthetic data before it entered production. Nobody is explaining how much of GPT-5’s own output is now feeding back into its training set. Nobody is showing what happens when a model starts reinforcing its own errors at scale.