The Inside View

Erik Jones on Automatically Auditing Large Language Models

Aug 11, 2023
Erik Jones, a PhD candidate at Berkeley, examines how to enhance the safety and alignment of large language models. He discusses his innovative paper on automatically auditing these models, exploring the vulnerabilities they face from adversarial attacks. Erik shares insights on the importance of discrete optimization and how it can reveal hidden model behaviors. He also delves into the implications of using language models for sensitive topics and the need for automated auditing methods to ensure reliability and robustness in AI systems.
Ask episode
Chapters
Transcript
Episode notes