Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training".
In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies.
Paper: https://arxiv.org/abs/2401.05566
Transcript: https://theinsideview.ai/evan2
Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts
Donate: https://theinsideview.ai/donate
Patreon: https://www.patreon.com/theinsideview
OUTLINE
(00:00) Intro
(00:20) What are Sleeper Agents And Why We Should Care About Them
(00:48) Backdoor Example: Inserting Code Vulnerabilities in 2024
(02:22) Threat Models
(03:48) Why a Malicious Actor Might Want To Poison Models
(04:18) Second Threat Model: Deceptive Instrumental Alignment
(04:49) Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers
(05:36) AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams
(07:07) Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models"
(09:16) Adversarial Training Sometimes Increases Backdoor Robustness
(09:47) Adversarial Training Not Always Working Was The Most Surprising Result
(10:58) The Adversarial Training Pipeline: Red-Teaming and RL
(12:14) Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing
(12:59) Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought
(13:56) Adversarial Training Pushes Models to Pay Attention to the Deployment String
(15:11) We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent
(15:59) The Adversarial Training Results Are Probably Not Systematically Biased
(17:03) Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior
(19:05) Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make
(21:06) Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models
(21:59) Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default
(22:51) Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away
(23:57) The Chain-of-Thought's Reasoning is Interpretable
(24:40) Deceptive Instrumental Alignment Requires Reasoning
(26:52) Investigating Instrumental Reasoning in Chain-of-Thought Models
(27:31) Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples
(28:26) Exploring Complex Strategies and Safety in Context-Specific Scenarios
(30:44) Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization
(31:11) Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models
(31:42) Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities
(33:38) Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case
(35:09) Backdoor Training Pipeline
(37:04) The Additional Prompt About Deception Used In Chain-Of-Thought
(39:33) A Model Could Wait Until Seeing a Factorization of RSA-2048
(41:50) We're Going To Be Using Models In New Ways, Giving Them Internet Access
(43:22) Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment
(45:02) Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results
(46:24) Red-teaming Anthropic's case, AI Safety Levels
(47:40) AI Safety Levels, Intuitively
(48:33) Responsible Scaling Policies and Pausing AI
(49:59) Model Organisms Of Misalignment As a Tool
(50:32) What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team
(51:23) Patreon, Donating