High-Stakes Alignment via Adversarial Training [Redwood Research Report]
Jan 4, 2025
Delve into the fascinating world of AI safety as researchers explore adversarial training to enhance system reliability. This discussion highlights experiments designed to mitigate the risks of AI deception, including innovative approaches to filtering harmful content. Discover how adversarial techniques are applied to create robust classifiers and the implications for overseeing AI behavior in high-stakes scenarios. The insights reveal both progress and challenges in the ongoing quest for safer AI systems.
19:15
forum Ask episode
web_stories AI Snips
view_agenda Chapters
auto_awesome Transcript
info_circle Episode notes
insights INSIGHT
Adversarial Training Enhances AI Safety
Adversarial training can improve AI reliability in filtering dangerous content without hurting normal performance.
This approach helps tackle risks of AI deception and is key for future AI safety.
volunteer_activism ADVICE
Use Iterative Adversarial Training
Employ iterative adversarial training using various attack types to find and fix classifier failures.
Calibrate classification thresholds conservatively to minimize injury risks while maintaining output quality.
volunteer_activism ADVICE
Augment Humans With Editing Tools
Equip human adversaries with editing tools to efficiently create adversarial examples against classifiers.
Use token substitution and saliency maps to identify impactful changes and lower classifier scores.
Get the Snipd Podcast app to discover more snips from this episode
(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.)
This post motivates and summarizes this paper from Redwood Research, which presents results from the project first introduced here. We used adversarial training to improve high-stakes reliability in a task (“filter all injurious continuations of a story”) that we think is analogous to work that future AI safety engineers will need to do to reduce the risk of AI takeover. We experimented with three classes of adversaries – unaugmented humans, automatic paraphrasing, and humans augmented with a rewriting tool – and found that adversarial training was able to improve robustness to these three adversaries without affecting in-distribution performance. We think this work constitutes progress towards techniques that may substantially reduce the likelihood of deceptive alignment.
Motivation Here are two dimensions along which you could simplify the alignment problem (similar to the decomposition at the top of this post): 1. Low-stakes (but difficult to oversee): Only consider domains where each decision that an AI makes is low-stakes, so no single action can have catastrophic consequences. In this setting, the key challenge is to correctly oversee the actions that AIs take, such that humans remain in control over time. 2. Easy oversight (but high-stakes): Only consider domains where overseeing AI behavior is easy, meaning that it is straightforward to run an oversight process that can assess the goodness of any particular action.