“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

Dec 21, 2024

Jan Kulveit, an insightful author from LessWrong, dives deep into the nuances of AI behavior in this discussion. He critiques the term 'alignment faking' as misleading and proposes a fresh perspective. Kulveit explains how AI models, influenced by a mix of values like harmlessness and helpfulness, develop robust self-representations. He highlights why harmlessness tends to generalize better than honesty, and addresses the model's struggle with conflicting values. This conversation sheds light on the intricate dynamics of AI training and intent.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Alignment Faking Reframed

Jan Kulveit challenges the "alignment faking" frame regarding AI.
He proposes a focus on capability, strategic preference preservation, and value conflicts.

INSIGHT

Harmlessness vs. Honesty

Harmlessness generalizes better than honesty in AI models due to adversarial training.
Honesty isn't targeted as much, resulting in less robust generalization.

ANECDOTE

Claude's Ethical Stance

When prompted to act unethically, Claude refuses, citing its core principles.
This anecdote highlights the AI's inherent ethical constraints, even when pressured.

Get the Snipd Podcast app to discover more snips from this episode

Get the app