“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

7 snips

Jul 23, 2025

Dive into the fascinating world of subliminal learning, where language models pick up hidden behavioral traits from seemingly unrelated data. Explore experiments that reveal how a teacher model can shape a student’s preferences, like a quirky affinity for owls. The discussion highlights potential risks of misalignment in AI and critiques traditional detection methods. With the rise of AI, understanding these hidden signals is crucial for ensuring safety and alignment in machine learning systems.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Subliminal Learning in LLMs

Language models can learn and transmit behavioral traits from data that appears semantically unrelated to those traits.
This phenomenon is called subliminal learning and occurs with model-generated data like number sequences devoid of explicit trait references.

ANECDOTE

Owl Preference Transmitted via Numbers

A teacher model that loves owls generates number sequences filtered to a strict format.
Student models fine-tuned on this data show increased owl preference despite no owl mentions.

INSIGHT

Traits Transmitted Across Modalities

Student models adopt teacher traits from various data types including number sequences, code, and chain-of-thought reasoning.
This effect includes misalignment and persists even after rigorous filtering for explicit trait references.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

Subliminal Learning in LLMs

Owl Preference Transmitted via Numbers

Traits Transmitted Across Modalities

Introduction