Jailbreaking techniques exploit the model's behavior in associating the first few tokens of an answer with the full harmful response. To address this, a new training approach proposes adding a dangerous response followed by a refusal, interrupting the harmful information flow and training the model to associate the question with the correct refusal.
Our 172nd episode with a summary and discussion of last week's big AI news!
With hosts Andrey Kurenkov (https://twitter.com/andrey_kurenkov) and Jeremie Harris (https://twitter.com/jeremiecharris)
Read out our text newsletter and comment on the podcast at https://lastweekin.ai/
If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form.
Email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai
- (00:00:00) Intro / Banter
- Tools & Apps
- Applications & Business
- Projects & Open Source
- Research & Advancements
- Policy & Safety
- Synthetic Media & Art
- (01:49:12) Outro + AI Song