AXRP - the AI X-risk Research Podcast

Daniel Filan

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

Episodes

Mentioned books

Aug 7, 2025 • 2h 5min

46 - Tom Davidson on AI-enabled Coups

Tom Davidson, a Senior Research Fellow at the Forethought Institute for AI Strategy, dives into the controversial idea of AI-enabled coups. He explores how advanced AI could grant small factions unprecedented power, potentially allowing them to manipulate public perception and state machinery. Davidson discusses the ethical dilemmas posed by AI in military contexts and the threat of corporate influence on democracy. He advocates for strategies to prevent these potential crises, emphasizing accountability and societal values in AI development.

Jul 6, 2025 • 1h 16min

45 - Samuel Albanie on DeepMind's AGI Safety Approach

Samuel Albanie, a research scientist at Google DeepMind with a focus on computer vision, dives into the intricacies of AGI safety and security. He discusses the pivotal assumptions in their technical approach, emphasizing the need for continuous evaluation of AI capabilities. Albanie explores the concept of 'exceptional AGI' and the uncertain timelines of AI development. He also sheds light on the challenges of misuse and misalignment, advocating for robust mitigations and societal readiness to keep pace with rapid advancements in AI.

Jun 28, 2025 • 3h 22min

44 - Peter Salib on AI Rights for Human Safety

Peter Salib, a law professor at the University of Houston, discusses his groundbreaking paper on AI rights. He argues that granting AIs rights, like the ability to contract and sue, could enhance human safety against potential AI threats. The conversation dives into the implications of AI rights, the challenges of liability, and the balance of cooperation and competition in human-AI relationships. Salib also touches on the complexities of legal accountability for AIs and how this evolving legal landscape will shape future interactions with artificial intelligence.

Jun 15, 2025 • 1h 41min

43 - David Lindner on Myopic Optimization with Non-myopic Approval

In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html Topics we discuss, and timestamps: 0:00:29 What MONA is 0:06:33 How MONA deals with reward hacking 0:23:15 Failure cases for MONA 0:36:25 MONA's capability 0:55:40 MONA vs other approaches 1:05:03 Follow-up work 1:10:17 Other MONA test cases 1:33:47 When increasing time horizon doesn't increase capability 1:39:04 Following David's research Links for David: Website: https://www.davidlindner.me Twitter / X: https://x.com/davlindner DeepMind Medium: https://deepmindsafetyresearch.medium.com David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner Research we discuss: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011 Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training Episode art by Hamish Doodles: hamishdoodles.com

Jun 6, 2025 • 2h 14min

42 - Owain Evans on LLM Psychology

Owain Evans, Research Lead at Truthful AI and co-author of the influential paper 'Emergent Misalignment,' dives into the psychology of large language models. He discusses the complexities of model introspection and self-awareness, questioning what it means for AI to understand its own capabilities. The conversation explores the dangers of fine-tuning models on narrow tasks, revealing potential for harmful behavior. Evans also examines the relationship between insecure code and emergent misalignment, raising crucial concerns about AI safety in real-world applications.

Jun 3, 2025 • 2h 16min

41 - Lee Sharkey on Attribution-based Parameter Decomposition

Lee Sharkey, an interpretability researcher at Goodfire and co-founder of Apollo Research, shares his insights into Attribution-based Parameter Decomposition (APD). He explains how APD can simplify neural networks while maintaining fidelity, discusses the trade-offs of model complexity and performance, and delves into hyperparameter selection. Sharkey also draws analogies between neural network components and car parts, highlighting the importance of understanding feature geometry. The conversation navigates the future applications and potential of APD in optimizing neural network efficiency.

Mar 28, 2025 • 2h 36min

40 - Jason Gross on Compact Proofs and Interpretability

In this engaging talk, Jason Gross, a researcher in mechanistic interpretability and software verification, dives into the fascinating world of compact proofs. He discusses their crucial role in benchmarking AI interpretability and how they help prove model performance. The conversation also touches on the challenges of randomness and noise in neural networks, the intersection of proofs and modern machine learning, and innovative approaches to enhancing AI reliability. Plus, learn about his startup focused on automating proof generation and the road ahead for AI safety!

Mar 1, 2025 • 21min

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

In this discussion, David Duvenaud, a University of Toronto professor specializing in probabilistic deep learning and AI safety at Anthropic, dives into the challenges of assessing whether AI models could sabotage human decisions. He shares insights on the complexities of sabotage evaluations and strategies needed for effective oversight. The conversation shifts to the societal impacts of a post-AGI world, reflecting on potential job implications and the delicate balance between AI advancement and prioritizing human values.

Feb 9, 2025 • 23min

38.7 - Anthony Aguirre on the Future of Life Institute

Anthony Aguirre, Executive Director of the Future of Life Institute and UC Santa Cruz professor, dives deep into AI safety and governance. He shares insights on the potential of the AI pause initiative and the importance of licensing advanced AI technologies. Aguirre also discusses how Metaculus influences critical decision-making and the evolution of the Future of Life Institute into an advocacy powerhouse. Explore his thoughts on organizing impactful workshops and supporting innovative projects for a sustainable future.

Jan 24, 2025 • 15min

38.6 - Joel Lehman on Positive Visions of AI

In this discussion, Joel Lehman, a machine learning researcher and co-author of "Why Greatness Cannot Be Planned," delves into the future of AI and its potential to promote human flourishing. He challenges the notion that alignment with individual needs is sufficient. The conversation explores positive visions for AI, the balance of technology with societal values, and how recommendation systems can foster meaningful personal growth. Lehman emphasizes the importance of understanding human behavior in shaping AI that enhances well-being.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner