

Ethan Perez
Researcher at Anthropic, leading the AI control team. Previously led the Aversilver Robustness team.
Top 3 podcasts with Ethan Perez
Ranked by the Snipd community

57 snips
Aug 24, 2022 • 2h 1min
Ethan Perez–Inverse Scaling, Language Feedback, Red Teaming
Ethan Perez is a research scientist at Anthropic, working on large language models. He is the second Ethan working with large language models coming on the show but, in this episode, we discuss why alignment is actually what you need, not scale. We discuss three projects he has been pursuing before joining Anthropic, namely the Inverse Scaling Prize, Red Teaming Language Models with Language Models, and Training Language Models with Language Feedback.
Ethan Perez: https://twitter.com/EthanJPerez
Transcript: https://theinsideview.ai/perez
Host: https://twitter.com/MichaelTrazzi
OUTLINE
(00:00:00) Highlights
(00:00:20) Introduction
(00:01:41) The Inverse Scaling Prize
(00:06:20) The Inverse Scaling Hypothesis
(00:11:00) How To Submit A Solution
(00:20:00) Catastrophic Outcomes And Misalignment
(00:22:00) Submission Requirements
(00:27:16) Inner Alignment Is Not Out Of Distribution Generalization
(00:33:40) Detecting Deception With Inverse Scaling
(00:37:17) Reinforcement Learning From Human Feedback
(00:45:37) Training Language Models With Language Feedback
(00:52:38) How It Differs From InstructGPT
(00:56:57) Providing Information-Dense Feedback
(01:03:25) Why Use Language Feedback
(01:10:34) Red Teaming Language Models With Language Models
(01:20:17) The Classifier And Advesarial Training
(01:23:53) An Example Of Red-Teaming Failure
(01:27:47) Red Teaming Using Prompt Engineering
(01:32:58) Reinforcement Learning Methods
(01:41:53) Distributional Biases
(01:45:23) Chain of Thought Prompting
(01:49:52) Unlikelihood Training and KL Penalty
(01:52:50) Learning AI Alignment through the Inverse Scaling Prize
(01:59:33) Final thoughts on AI Alignment

17 snips
Aug 9, 2023 • 36min
"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez
This podcast discusses the importance of researching model organisms of misalignment to understand the causes of alignment failures in AI systems. It explores different strategies for model training and deployment, such as input tagging and evaluating output with a preference model. The risks associated with using model organisms in research, including deceptive alignment, are also explored.

10 snips
May 26, 2023 • 54min
Discovering AI Risks with AIs | Ethan Perez | EAG Bay Area 23
Ethan Perez discusses the risks of AI systems, including biases, offensive content, and faulty code. He explores evaluating data quality and addressing risks in language models, as well as alternative training methods. The podcast also delves into the challenges of deception in AI models and the question of model identity. Additionally, it touches on the impact of different training schemes and the importance of generalists in evaluation work.