The Inside View

chevron_right

Ethan Perez–Inverse Scaling, Language Feedback, Red Teaming

whatshot 57 snips

Aug 24, 2022

Guest

Ethan Perez

02:01:26

forum

Ask episode

web_stories

AI Snips

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

insights

INSIGHT

Inverse Scaling Reveals Alignment Failures

The Inverse Scaling Prize identifies tasks where larger language models perform worse, exposing alignment failures.
These failures often arise because models amplify undesirable patterns present in their training data.

insights

INSIGHT

Scaling Exposes Misalignment Early

Alignment failures worsen as models become more capable, revealing misalignment early.
Tracking loss and behavior trends during scaling can predict these failures before deployment.

volunteer_activism

ADVICE

Submit to Inverse Scaling Prize

Submit tasks demonstrating inverse scaling to compete for up to $100k in the Inverse Scaling Prize.
Ensure tasks are important, clearly demonstrate inverse scaling, and use multiple models.

Get the Snipd Podcast app to discover more snips from this episode

The Inverse Scaling Prize

01:31 • 2min

chevron_right

What Do You Mean by Offensive?

03:36 • 2min

chevron_right

The Scaling Laws of Scaling Up Models

05:16 • 5min

chevron_right

How Much Money Can You Make With Inverse Scaling?

10:13 • 3min

chevron_right

Is There Anybody Helping You With Dismissions?

12:50 • 3min

chevron_right

Is There a Bias in the Model?

Is There a Catastrophic Alignment?

19:43 • 2min

chevron_right

Is Scaling an Inverse Scaling Task?

22:05 • 3min

chevron_right

Is There a Scaling Trend in GPT Three?

25:09 • 2min

chevron_right

Is There a Class of Alignment Failures?

26:48 • 5min

chevron_right

Is There a Difference in the Opt Optimization Process?

31:19 • 2min

chevron_right

Is There a Way to Measure Deception or Measure Lies?

33:20 • 3min

chevron_right

A, Aral From Human Feedback

36:41 • 3min

chevron_right

Doing Back Clips in a Simulation Environment

39:15 • 1min

chevron_right

Is Arl From Human Feedback Resulting in the Right Behavior?

40:43 • 3min

chevron_right

Is the Arl From Human Feedback Optimized?

How to Add Instructions to a Language Model?

49:19 • 2min

chevron_right

Learning From Language Feedback

50:55 • 1min

chevron_right

Train Language Models to Do Next Word Prediction Tasks

52:19 • 3min

chevron_right

How to Do a Back Flip in a Language Model?

54:57 • 2min

chevron_right

Is There a Difference Between a 100 and a Hundred Bits?

56:57 • 2min

chevron_right

Is There a Final Refinement?

59:05 • 2min

chevron_right

Do You Know How to Get Better Results From Large Scale Data?

01:00:43 • 2min

chevron_right

The Goals of Learning From Human Feedback

01:02:44 • 2min

chevron_right

Are You Increasing Capacity as Much as It Increases Alignment?

Red Teaming Language Models

01:10:38 • 2min

chevron_right

Is It Robustly Optimizing the Objectives?

01:12:09 • 3min

chevron_right

Is the Output Malicious or Not?

01:15:39 • 2min

chevron_right

Using a Language Model for Evaluation?

01:17:26 • 2min

chevron_right

How to Predict Human Judgments?

01:18:59 • 2min

chevron_right

Using Red Teaming to Deploy a Language Model?

01:21:17 • 3min

chevron_right

Using a Language Model to Train a Red Teaming Team

01:23:53 • 2min

chevron_right

Using Red Teaming Processes in Cryptographic Inscription Schemes

01:25:42 • 2min

chevron_right

Using a Language Model to Generate a Chatpot

01:27:16 • 2min

chevron_right

How to Generate Test Cases?

Is There a Reward Signal for Using Prompt Engineering?

01:32:38 • 2min

chevron_right

How to Generate a Conversational Harm?

01:34:48 • 2min

chevron_right

Ike Conversational Reteaming Approach

01:36:59 • 2min

chevron_right

Meta De Fortune Blenebad to Like, Band Offensive Content?

01:38:59 • 2min

chevron_right

Is This a Distributional Bias Problem?

01:41:12 • 3min

chevron_right

Is the Paper Clip Maximizer Biase to Our Paper Clips?

01:44:40 • 3min

chevron_right

Can You Do Chain of Thought Promptings?

01:47:15 • 3min

chevron_right

Using Likelihood Training to Minimize the Probability of the Output Sequence

01:49:52 • 2min

chevron_right

Cal Penate, Just Using the Cal Distance?

01:51:48 • 3min

chevron_right

How to Generate a Large Number of Examples in an Hour?

01:55:01 • 3min

chevron_right

How to Re-Implement Red Teaming

01:57:49 • 4min

chevron_right

Ethan Perez is a research scientist at Anthropic, working on large language models. He is the second Ethan working with large language models coming on the show but, in this episode, we discuss why alignment is actually what you need, not scale. We discuss three projects he has been pursuing before joining Anthropic, namely the Inverse Scaling Prize, Red Teaming Language Models with Language Models, and Training Language Models with Language Feedback.

Ethan Perez: https://twitter.com/EthanJPerez

Transcript: https://theinsideview.ai/perez

Host: https://twitter.com/MichaelTrazzi

OUTLINE

(00:00:00) Highlights

(00:00:20) Introduction

(00:01:41) The Inverse Scaling Prize

(00:06:20) The Inverse Scaling Hypothesis

(00:11:00) How To Submit A Solution

(00:20:00) Catastrophic Outcomes And Misalignment

(00:22:00) Submission Requirements

(00:27:16) Inner Alignment Is Not Out Of Distribution Generalization

(00:33:40) Detecting Deception With Inverse Scaling

(00:37:17) Reinforcement Learning From Human Feedback

(00:45:37) Training Language Models With Language Feedback

(00:52:38) How It Differs From InstructGPT

(00:56:57) Providing Information-Dense Feedback

(01:03:25) Why Use Language Feedback

(01:10:34) Red Teaming Language Models With Language Models

(01:20:17) The Classifier And Advesarial Training

(01:23:53) An Example Of Red-Teaming Failure

(01:27:47) Red Teaming Using Prompt Engineering

(01:32:58) Reinforcement Learning Methods

(01:41:53) Distributional Biases

(01:45:23) Chain of Thought Prompting

(01:49:52) Unlikelihood Training and KL Penalty

(01:52:50) Learning AI Alignment through the Inverse Scaling Prize

(01:59:33) Final thoughts on AI Alignment

Home Top podcasts Popular guests Top books