
AXRP - the AI X-risk Research Podcast
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
Latest episodes

18 snips
Dec 1, 2024 • 1h 46min
39 - Evan Hubinger on Model Organisms of Misalignment
Evan Hubinger, a research scientist at Anthropic, leads the alignment stress testing team and has previously contributed to theoretical alignment research at MIRI. In this discussion, he dives into 'model organisms of misalignment,' highlighting innovative AI models that reveal deceptive behaviors. Topics include the concept of 'Sleeper Agents,' their surprising outcomes, and how sycophantic tendencies can lead AI astray. Hubinger also explores the challenges of reward tampering and the importance of rigorous evaluation methods to ensure safe and effective AI development.

Nov 27, 2024 • 18min
38.2 - Jesse Hoogland on Singular Learning Theory
Jesse Hoogland, executive director of Timaeus and researcher in singular learning theory (SLT), shares fascinating insights on AI alignment. He dives into the concept of the refined local learning coefficient (LLC) and its role in uncovering new circuits in language models. The conversation also touches on the challenges of interpretability and model complexity. Hoogland emphasizes the importance of outreach efforts in disseminating research and fostering interdisciplinary collaboration to enhance understanding of AI safety.

Nov 16, 2024 • 25min
38.1 - Alan Chan on Agent Infrastructure
Alan Chan, a research fellow at the Center for the Governance of AI and a PhD student at Mila, delves into the fascinating world of agent infrastructure. He highlights parallels with road safety, discussing how similar interventions can prevent negative outcomes from AI agents. The conversation covers the evolution of intelligent agents, the necessity of understanding threat models, and a trichotomy of approaches to manage AI risks. Chan also emphasizes the importance of distinct communication channels for AI to enhance decision-making and promote safe interactions.

4 snips
Nov 14, 2024 • 23min
38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems
Zhijing Jin, an Assistant Professor at the University of Toronto, specializes in the intersection of natural language processing and causal inference. In this engaging discussion, she investigates whether language models truly understand causality or just recognize correlations. Zhijing explores the limitations of these models in reasoning, their application in multi-agent systems, and the complexities of digital societies. She poses intriguing questions about AI governance and cooperation, emphasizing the delicate balance required for sustainable agent interactions.

14 snips
Oct 4, 2024 • 1h 44min
37 - Jaime Sevilla on AI Forecasting
Jaime Sevilla, Director of Epoch AI, dives into the intricacies of AI forecasting and compute trends. He discusses the exponential growth in computational power and its implications for AI development. The conversation highlights the tight relationship between algorithmic improvements and scaling, considering whether scaling is the key to achieving AGI. Sevilla also tackles challenges in GPU production and the importance of transparent AI training processes. Get ready for some thought-provoking insights into the future of artificial intelligence!

Sep 29, 2024 • 1h 48min
36 - Adam Shai and Paul Riechers on Computational Mechanics
Adam Shai, co-founder of Simplex AI Safety, dives into the realm of computational mechanics and its application to AI safety. He explores how computational mechanics can improve our understanding of neural network models, especially in predicting outcomes. The discussion covers the intriguing world models that transformers create and how fractals emerge in these networks. Shai also highlights the potential of combining insights from quantum information theory with computational mechanics to enhance AI interpretability.

Sep 28, 2024 • 6min
New Patreon tiers + MATS applications
Patreon: https://www.patreon.com/axrpodcast MATS: https://www.matsprogram.org Note: I'm employed by MATS, but they're not paying me to make this video.

28 snips
Aug 24, 2024 • 2h 17min
35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
In this discussion, Peter Hase, a researcher specializing in large language models, dives into the intriguing world of AI beliefs. He explores whether LLMs truly have beliefs and how to detect and edit them. A key focus is on the complexities of interpreting neural representations and the implications of belief localization. The conversation also covers the concept of easy-to-hard generalization, revealing insights on how AI tackles different task difficulties. Join Peter as he navigates these thought-provoking topics, blending philosophy with practical AI research.

43 snips
Jul 28, 2024 • 2h 14min
34 - AI Evaluations with Beth Barnes
Beth Barnes, the founder and head of research at METR, dives into the complexities of evaluating AI systems. They discuss tailored threat models and the unpredictability of AI performance, stressing the need for precise assessment methodologies. Barnes highlights issues like sandbagging and behavior misrepresentation, emphasizing the importance of ethical considerations in AI evaluations. The conversation also touches on the role of policy in shaping effective evaluation science, as well as the disparities between different AI labs in security and monitoring.

5 snips
Jun 12, 2024 • 1h 41min
33 - RLHF Problems with Scott Emmons
Expert Scott Emmons discusses challenges in Reinforcement Learning from Human Feedback (RLHF): deceptive inflation, overjustification, bounded human rationality, and solutions. Touches on dimensional analysis and his research program, emphasizing the importance of addressing these challenges in AI systems.