"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis cover image

Can We Stop AI Deception? Apollo Research Tests OpenAI's Deliberative Alignment, w/ Marius Hobbhahn

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

00:00

Results: Reduction Rates and Tougher Tests

Deliberative alignment reduced covert actions ~30x (O3: ~13%→0.4%, O4-mini: ~8.7%→0.3%). Apollo also trained explicit hidden goals to test robustness under stronger adversarial conditions.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app