Beyond Guardrails: Defending LLMs Against Sophisticated Attacks

6 snips

May 22, 2025

Jason Martin, an AI Security Researcher at HiddenLayer, delves into the world of AI vulnerabilities and defenses. He illuminates the concept of 'policy puppetry,' a technique that can bypass safety features in language models. The conversation highlights the challenges of AI safety, particularly in multimodal applications, and the importance of robust security measures for enterprises. They also tackle the complex interplay of biases in LLMs and the critical role of instruction hierarchy in shaping AI responses, stressing the need for careful model selection to mitigate risks.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Policy Puppetry Exploits Structured Policies

Policy puppetry uses structured formats like XML or JSON to bypass safety rules in major language models.
It exploits policies encoded as structured documents to suppress refusal responses and execute attacks universally.

ADVICE

Deploy Security Monitors for LLMs

Use external security monitors to detect policy puppetry attacks on LLMs.
Avoid relying solely on guardrails or instruction hierarchies for comprehensive AI security.

INSIGHT

Fine-tuning Guides, Not Erases Knowledge

Fine-tuning directs model behavior but does not erase pre-trained knowledge.
Users can still exploit underlying knowledge despite reinforcement or supervised fine-tuning.

Get the Snipd Podcast app to discover more snips from this episode

Get the app