
Coercing LLMs to Do and Reveal (Almost) Anything with Jonas Geiping - #678
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Safeguarding Language Models: Balancing Security and Functionality
This chapter explores the vulnerabilities of language models to attacks and the role of Reinforcement Learning from Human Feedback (RLHF) in enhancing their safety. It discusses the trade-offs between implementing protective measures and preserving the functionality of models, as well as the complexities introduced by collaborative systems. The conversation highlights the ongoing battle between attackers and defenses in AI security, posing critical questions about the nature of model size and manipulation.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.