Erik Jones on Automatically Auditing Large Language Models

Aug 11, 2023

Erik Jones, a PhD candidate at Berkeley, examines how to enhance the safety and alignment of large language models. He discusses his innovative paper on automatically auditing these models, exploring the vulnerabilities they face from adversarial attacks. Erik shares insights on the importance of discrete optimization and how it can reveal hidden model behaviors. He also delves into the implications of using language models for sensitive topics and the need for automated auditing methods to ensure reliability and robustness in AI systems.

Ask episode

Chapters

Transcript

Episode notes

Intro

00:00 • 2min

Exploring Adversarial Attacks and Auditing Language Models

02:19 • 3min

Exploring Discrete Optimization in Language Model Behavior

05:00 • 3min

Navigating AI Language Model Vulnerabilities

07:30 • 7min

Auditing Language Models: Safety and Methodologies

14:04 • 9min