"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis cover image

Universal Jailbreaks with Zico Kolter, Andy Zou, and Asher Trockman

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

CHAPTER

Exploring Open Source RLHF Vulnerabilities

This chapter examines the challenges and implications of open-sourcing Reinforcement Learning from Human Feedback (RLHF) models, focusing on their commercial use and potential risks from malicious applications. It introduces a research initiative that uncovers vulnerabilities in large language models (LLMs) by using adversarial queries, revealing how even advanced models like chatGPT can be manipulated. Additionally, the discussion highlights the importance of aligning AI outputs to prevent harmful responses through various training techniques.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner