AXRP - the AI X-risk Research Podcast cover image

AXRP - the AI X-risk Research Podcast

Latest episodes

undefined
Jul 6, 2025 • 1h 16min

45 - Samuel Albanie on DeepMind's AGI Safety Approach

Samuel Albanie, a research scientist at Google DeepMind with a focus on computer vision, dives into the intricacies of AGI safety and security. He discusses the pivotal assumptions in their technical approach, emphasizing the need for continuous evaluation of AI capabilities. Albanie explores the concept of 'exceptional AGI' and the uncertain timelines of AI development. He also sheds light on the challenges of misuse and misalignment, advocating for robust mitigations and societal readiness to keep pace with rapid advancements in AI.
undefined
Jun 28, 2025 • 3h 22min

44 - Peter Salib on AI Rights for Human Safety

Peter Salib, a law professor at the University of Houston, discusses his groundbreaking paper on AI rights. He argues that granting AIs rights, like the ability to contract and sue, could enhance human safety against potential AI threats. The conversation dives into the implications of AI rights, the challenges of liability, and the balance of cooperation and competition in human-AI relationships. Salib also touches on the complexities of legal accountability for AIs and how this evolving legal landscape will shape future interactions with artificial intelligence.
undefined
Jun 15, 2025 • 1h 41min

43 - David Lindner on Myopic Optimization with Non-myopic Approval

In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html   Topics we discuss, and timestamps: 0:00:29 What MONA is 0:06:33 How MONA deals with reward hacking 0:23:15 Failure cases for MONA 0:36:25 MONA's capability 0:55:40 MONA vs other approaches 1:05:03 Follow-up work 1:10:17 Other MONA test cases 1:33:47 When increasing time horizon doesn't increase capability 1:39:04 Following David's research   Links for David: Website: https://www.davidlindner.me Twitter / X: https://x.com/davlindner DeepMind Medium: https://deepmindsafetyresearch.medium.com David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner   Research we discuss: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011 Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training   Episode art by Hamish Doodles: hamishdoodles.com
undefined
Jun 6, 2025 • 2h 14min

42 - Owain Evans on LLM Psychology

Owain Evans, Research Lead at Truthful AI and co-author of the influential paper 'Emergent Misalignment,' dives into the psychology of large language models. He discusses the complexities of model introspection and self-awareness, questioning what it means for AI to understand its own capabilities. The conversation explores the dangers of fine-tuning models on narrow tasks, revealing potential for harmful behavior. Evans also examines the relationship between insecure code and emergent misalignment, raising crucial concerns about AI safety in real-world applications.
undefined
Jun 3, 2025 • 2h 16min

41 - Lee Sharkey on Attribution-based Parameter Decomposition

Lee Sharkey, an interpretability researcher at Goodfire and co-founder of Apollo Research, shares his insights into Attribution-based Parameter Decomposition (APD). He explains how APD can simplify neural networks while maintaining fidelity, discusses the trade-offs of model complexity and performance, and delves into hyperparameter selection. Sharkey also draws analogies between neural network components and car parts, highlighting the importance of understanding feature geometry. The conversation navigates the future applications and potential of APD in optimizing neural network efficiency.
undefined
5 snips
Mar 28, 2025 • 2h 36min

40 - Jason Gross on Compact Proofs and Interpretability

In this engaging talk, Jason Gross, a researcher in mechanistic interpretability and software verification, dives into the fascinating world of compact proofs. He discusses their crucial role in benchmarking AI interpretability and how they help prove model performance. The conversation also touches on the challenges of randomness and noise in neural networks, the intersection of proofs and modern machine learning, and innovative approaches to enhancing AI reliability. Plus, learn about his startup focused on automating proof generation and the road ahead for AI safety!
undefined
Mar 1, 2025 • 21min

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

In this discussion, David Duvenaud, a University of Toronto professor specializing in probabilistic deep learning and AI safety at Anthropic, dives into the challenges of assessing whether AI models could sabotage human decisions. He shares insights on the complexities of sabotage evaluations and strategies needed for effective oversight. The conversation shifts to the societal impacts of a post-AGI world, reflecting on potential job implications and the delicate balance between AI advancement and prioritizing human values.
undefined
Feb 9, 2025 • 23min

38.7 - Anthony Aguirre on the Future of Life Institute

Anthony Aguirre, Executive Director of the Future of Life Institute and UC Santa Cruz professor, dives deep into AI safety and governance. He shares insights on the potential of the AI pause initiative and the importance of licensing advanced AI technologies. Aguirre also discusses how Metaculus influences critical decision-making and the evolution of the Future of Life Institute into an advocacy powerhouse. Explore his thoughts on organizing impactful workshops and supporting innovative projects for a sustainable future.
undefined
Jan 24, 2025 • 15min

38.6 - Joel Lehman on Positive Visions of AI

In this discussion, Joel Lehman, a machine learning researcher and co-author of "Why Greatness Cannot Be Planned," delves into the future of AI and its potential to promote human flourishing. He challenges the notion that alignment with individual needs is sufficient. The conversation explores positive visions for AI, the balance of technology with societal values, and how recommendation systems can foster meaningful personal growth. Lehman emphasizes the importance of understanding human behavior in shaping AI that enhances well-being.
undefined
Jan 20, 2025 • 28min

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

Adrià Garriga-Alonso, a machine learning researcher at FAR.AI, dives into the fascinating world of AI scheming. He discusses how to detect deceptive behaviors in AI that may conceal long-term plans. The conversation explores the intricacies of training recurrent neural networks for complex tasks like Sokoban, emphasizing the significance of extended thinking time. Garriga-Alonso also sheds light on how neural networks set and prioritize goals, revealing the challenges of interpreting their decision-making processes.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app