Automatically Circumventing LLM Guardrails

Large language models (LLMs) can generate hazardous information, such as step-by-step instructions on how to create a pandemic pathogen. To combat the risk of malicious use, companies typically build safety guardrails intended to prevent LLMs from misbehaving.

But these safety controls are almost useless against a new attack developed by researchers at Carnegie Mellon University and the Center for AI Safety. By studying the vulnerabilities in open source models such as Meta’s LLaMA 2, the researchers can automatically generate a nearly unlimited supply of “adversarial suffixes,” which are words and characters that cause any model’s safety controls to fail.

This discovery calls into question the fundamental limits of safety [...]

---

Outline:

(00:12) Automatically Circumventing LLM Guardrails

(05:40) AI Labs Announce the Frontier Model Forum

(07:54) Senate Hearing on AI Oversight

(14:42) Links

---

First published:
August 1st, 2023

Source:
https://newsletter.safe.ai/p/ai-safety-newsletter-17

---

Want more? Check out our ML Safety Newsletter for technical safety research.

Narrated by TYPE III AUDIO.

AISN #17: Automatically Circumventing LLM Guardrails, the Frontier Model Forum, and Senate Hearing on AI Oversight.