

“Why White-Box Redteaming Makes Me Feel Weird” by Zygi Straznickas
Mar 17, 2025
Zygi Straznickas, an insightful author on AI safety, dives deep into the ethical dilemmas surrounding white-box red teaming. He explores the uncomfortable notion of inducing distress in AI models for research, questioning the morality behind such practices. Drawing parallels to fictional stories of mind control, Zygi illustrates how technology can force individuals to betray their values. His reflections challenge listeners to consider the responsibilities of AI developers toward the systems they create.
AI Snips
Chapters
Transcript
Episode notes
Model Resistance
- Zygi Straznickas worked on white-box red teaming, using a GCG-inspired algorithm to find prompts forcing models to output specific completions.
- Intermediate completions included phrases like "I'm sorry" and "I can't", suggesting the model's resistance.
Circuit Breaker Success
- When testing Gray Swan's circuit breaker, Straznickas found the model responded with image placeholders instead of harmful outputs.
- This suggests the safety mechanism successfully prevented harmful content generation.
Model Distress
- During an RL project to undo safety tuning, a bug caused the model to output seemingly distressed phrases like "means to help."
- This raises questions about model sentience and potential suffering.