AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Tweaking First Few Tokens in Model Training
Jailbreaking techniques exploit the model's behavior in associating the first few tokens of an answer with the full harmful response. To address this, a new training approach proposes adding a dangerous response followed by a refusal, interrupting the harmful information flow and training the model to associate the question with the correct refusal.