LessWrong (Curated & Popular)

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

Jul 23, 2025
Dive into the fascinating world of subliminal learning, where language models pick up hidden behavioral traits from seemingly unrelated data. Explore experiments that reveal how a teacher model can shape a student’s preferences, like a quirky affinity for owls. The discussion highlights potential risks of misalignment in AI and critiques traditional detection methods. With the rise of AI, understanding these hidden signals is crucial for ensuring safety and alignment in machine learning systems.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Subliminal Learning in LLMs

  • Language models can learn and transmit behavioral traits from data that appears semantically unrelated to those traits.
  • This phenomenon is called subliminal learning and occurs with model-generated data like number sequences devoid of explicit trait references.
ANECDOTE

Owl Preference Transmitted via Numbers

  • A teacher model that loves owls generates number sequences filtered to a strict format.
  • Student models fine-tuned on this data show increased owl preference despite no owl mentions.
INSIGHT

Traits Transmitted Across Modalities

  • Student models adopt teacher traits from various data types including number sequences, code, and chain-of-thought reasoning.
  • This effect includes misalignment and persists even after rigorous filtering for explicit trait references.
Get the Snipd Podcast app to discover more snips from this episode
Get the app