The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) cover image

Coercing LLMs to Do and Reveal (Almost) Anything with Jonas Geiping - #678

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

CHAPTER

Safeguarding Language Models: Balancing Security and Functionality

This chapter explores the vulnerabilities of language models to attacks and the role of Reinforcement Learning from Human Feedback (RLHF) in enhancing their safety. It discusses the trade-offs between implementing protective measures and preserving the functionality of models, as well as the complexities introduced by collaborative systems. The conversation highlights the ongoing battle between attackers and defenses in AI security, posing critical questions about the nature of model size and manipulation.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner