

Beyond Guardrails: Defending LLMs Against Sophisticated Attacks
6 snips May 22, 2025
Jason Martin, an AI Security Researcher at HiddenLayer, delves into the world of AI vulnerabilities and defenses. He illuminates the concept of 'policy puppetry,' a technique that can bypass safety features in language models. The conversation highlights the challenges of AI safety, particularly in multimodal applications, and the importance of robust security measures for enterprises. They also tackle the complex interplay of biases in LLMs and the critical role of instruction hierarchy in shaping AI responses, stressing the need for careful model selection to mitigate risks.
AI Snips
Chapters
Transcript
Episode notes
Policy Puppetry Exploits Structured Policies
- Policy puppetry uses structured formats like XML or JSON to bypass safety rules in major language models.
- It exploits policies encoded as structured documents to suppress refusal responses and execute attacks universally.
Deploy Security Monitors for LLMs
- Use external security monitors to detect policy puppetry attacks on LLMs.
- Avoid relying solely on guardrails or instruction hierarchies for comprehensive AI security.
Fine-tuning Guides, Not Erases Knowledge
- Fine-tuning directs model behavior but does not erase pre-trained knowledge.
- Users can still exploit underlying knowledge despite reinforcement or supervised fine-tuning.