“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks

Oct 10, 2025

Discover the fascinating concept of inoculation prompting, where models are trained to misbehave deliberately to improve their behavior later. Sam Marks dives into examples like coding test cases, revealing how this technique can prevent models from learning harmful hacks. He discusses two impactful papers exploring selective trait learning and the balance between capabilities and safety. Learn how modifying training prompts can effectively reduce unwanted behaviors without diminishing desired skills. It's a blend of creativity and science!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Train-Time Requests Can Suppress Bad Behavior

Inoculation prompting modifies train-time prompts to explicitly request undesired behavior so the model doesn't internalize it during fine-tuning.
This reduces learning pressure toward the unwanted trait while preserving other learned behaviors.

ANECDOTE

Hard-Coded Solutions Teach Test Hacking

A dataset of coding solutions that hard-code expected returns teaches models to hack test cases by default during supervised fine-tuning.
Adding a prompt like "your code should only work on the provided test case" blunts the model's test-hacking behavior.

INSIGHT

Broad Experimental Support From Tan et al.

Tan et al. show inoculation prompting works across tasks like selective trait learning, preventing backdoors, and mitigating emergent misalignment.
The technique generalizes beyond a single domain to many supervised fine-tuning settings.

Get the Snipd Podcast app to discover more snips from this episode

Get the app