

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit
Dec 21, 2024
Jan Kulveit, an insightful author from LessWrong, dives deep into the nuances of AI behavior in this discussion. He critiques the term 'alignment faking' as misleading and proposes a fresh perspective. Kulveit explains how AI models, influenced by a mix of values like harmlessness and helpfulness, develop robust self-representations. He highlights why harmlessness tends to generalize better than honesty, and addresses the model's struggle with conflicting values. This conversation sheds light on the intricate dynamics of AI training and intent.
AI Snips
Chapters
Transcript
Episode notes
Alignment Faking Reframed
- Jan Kulveit challenges the "alignment faking" frame regarding AI.
- He proposes a focus on capability, strategic preference preservation, and value conflicts.
Harmlessness vs. Honesty
- Harmlessness generalizes better than honesty in AI models due to adversarial training.
- Honesty isn't targeted as much, resulting in less robust generalization.
Claude's Ethical Stance
- When prompted to act unethically, Claude refuses, citing its core principles.
- This anecdote highlights the AI's inherent ethical constraints, even when pressured.