The Nonlinear Library

AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks

Apr 18, 2024
Ask episode
Chapters
Transcript
Episode notes