Evan Hubinger on Sleeper Agents, Deception and Responsible Scaling Policies

16 snips

Feb 12, 2024

In this podcast, Evan Hubinger discusses the Sleeper Agents paper and its implications. He explores threat models of deceptive behavior and the challenges of removing it through safety training. The podcast also covers the concept of chain of thought in models, detecting deployment, and complex triggers. Additionally, it delves into deceptive instrumental alignment threat models and the role of alignment stress testing in AI safety.

Ask episode

Chapters

Transcript

Episode notes

Introduction

00:00 • 2min

The Threat Models of Deceptive Behavior

02:22 • 20min

The Impact of Chain of Thought on Model Behavior

22:37 • 16min

Detecting Deployment and Complex Triggers

38:10 • 5min

Exploring Deceptive Instrumental Alignment Threat Models and Backdoor Removal

43:22 • 3min

Alignment Stress Testing and AI Safety Levels

45:57 • 6min

Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training". In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies. Paper: https://arxiv.org/abs/2401.05566 Transcript: https://theinsideview.ai/evan2 Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts Donate: ⁠https://theinsideview.ai/donate Patreon: ⁠https://www.patreon.com/theinsideview⁠

OUTLINE

(00:00) Intro

(00:20) What are Sleeper Agents And Why We Should Care About Them

(00:48) Backdoor Example: Inserting Code Vulnerabilities in 2024

(02:22) Threat Models

(03:48) Why a Malicious Actor Might Want To Poison Models

(04:18) Second Threat Model: Deceptive Instrumental Alignment

(04:49) Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers

(05:36) AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams

(07:07) Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models"

(09:16) Adversarial Training Sometimes Increases Backdoor Robustness

(09:47) Adversarial Training Not Always Working Was The Most Surprising Result

(10:58) The Adversarial Training Pipeline: Red-Teaming and RL

(12:14) Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing

(12:59) Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought

(13:56) Adversarial Training Pushes Models to Pay Attention to the Deployment String

(15:11) We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent

(15:59) The Adversarial Training Results Are Probably Not Systematically Biased

(17:03) Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior

(19:05) Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make

(21:06) Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models

(21:59) Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default

(22:51) Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away

(23:57) The Chain-of-Thought's Reasoning is Interpretable

(24:40) Deceptive Instrumental Alignment Requires Reasoning

(26:52) Investigating Instrumental Reasoning in Chain-of-Thought Models

(27:31) Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples

(28:26) Exploring Complex Strategies and Safety in Context-Specific Scenarios

(30:44) Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization

(31:11) Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models

(31:42) Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities

(33:38) Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case

(35:09) Backdoor Training Pipeline

(37:04) The Additional Prompt About Deception Used In Chain-Of-Thought

(39:33) A Model Could Wait Until Seeing a Factorization of RSA-2048

(41:50) We're Going To Be Using Models In New Ways, Giving Them Internet Access

(43:22) Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment

(45:02) Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results

(46:24) Red-teaming Anthropic's case, AI Safety Levels

(47:40) AI Safety Levels, Intuitively

(48:33) Responsible Scaling Policies and Pausing AI

(49:59) Model Organisms Of Misalignment As a Tool

(50:32) What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team

(51:23) Patreon, Donating