

Ethan Perez–Inverse Scaling, Language Feedback, Red Teaming
Inverse Scaling Reveals Alignment Failures
- The Inverse Scaling Prize identifies tasks where larger language models perform worse, exposing alignment failures.
- These failures often arise because models amplify undesirable patterns present in their training data.
Scaling Exposes Misalignment Early
- Alignment failures worsen as models become more capable, revealing misalignment early.
- Tracking loss and behavior trends during scaling can predict these failures before deployment.
Submit to Inverse Scaling Prize
- Submit tasks demonstrating inverse scaling to compete for up to $100k in the Inverse Scaling Prize.
- Ensure tasks are important, clearly demonstrate inverse scaling, and use multiple models.
Ethan Perez is a research scientist at Anthropic, working on large language models. He is the second Ethan working with large language models coming on the show but, in this episode, we discuss why alignment is actually what you need, not scale. We discuss three projects he has been pursuing before joining Anthropic, namely the Inverse Scaling Prize, Red Teaming Language Models with Language Models, and Training Language Models with Language Feedback.
Ethan Perez: https://twitter.com/EthanJPerez
Transcript: https://theinsideview.ai/perez
Host: https://twitter.com/MichaelTrazzi
OUTLINE
(00:00:00) Highlights
(00:00:20) Introduction
(00:01:41) The Inverse Scaling Prize
(00:06:20) The Inverse Scaling Hypothesis
(00:11:00) How To Submit A Solution
(00:20:00) Catastrophic Outcomes And Misalignment
(00:22:00) Submission Requirements
(00:27:16) Inner Alignment Is Not Out Of Distribution Generalization
(00:33:40) Detecting Deception With Inverse Scaling
(00:37:17) Reinforcement Learning From Human Feedback
(00:45:37) Training Language Models With Language Feedback
(00:52:38) How It Differs From InstructGPT
(00:56:57) Providing Information-Dense Feedback
(01:03:25) Why Use Language Feedback
(01:10:34) Red Teaming Language Models With Language Models
(01:20:17) The Classifier And Advesarial Training
(01:23:53) An Example Of Red-Teaming Failure
(01:27:47) Red Teaming Using Prompt Engineering
(01:32:58) Reinforcement Learning Methods
(01:41:53) Distributional Biases
(01:45:23) Chain of Thought Prompting
(01:49:52) Unlikelihood Training and KL Penalty
(01:52:50) Learning AI Alignment through the Inverse Scaling Prize
(01:59:33) Final thoughts on AI Alignment