AXRP - the AI X-risk Research Podcast cover image

AXRP - the AI X-risk Research Podcast

Latest episodes

undefined
18 snips
Dec 1, 2024 • 1h 46min

39 - Evan Hubinger on Model Organisms of Misalignment

Evan Hubinger, a research scientist at Anthropic, leads the alignment stress testing team and has previously contributed to theoretical alignment research at MIRI. In this discussion, he dives into 'model organisms of misalignment,' highlighting innovative AI models that reveal deceptive behaviors. Topics include the concept of 'Sleeper Agents,' their surprising outcomes, and how sycophantic tendencies can lead AI astray. Hubinger also explores the challenges of reward tampering and the importance of rigorous evaluation methods to ensure safe and effective AI development.
undefined
Nov 27, 2024 • 18min

38.2 - Jesse Hoogland on Singular Learning Theory

Jesse Hoogland, executive director of Timaeus and researcher in singular learning theory (SLT), shares fascinating insights on AI alignment. He dives into the concept of the refined local learning coefficient (LLC) and its role in uncovering new circuits in language models. The conversation also touches on the challenges of interpretability and model complexity. Hoogland emphasizes the importance of outreach efforts in disseminating research and fostering interdisciplinary collaboration to enhance understanding of AI safety.
undefined
Nov 16, 2024 • 25min

38.1 - Alan Chan on Agent Infrastructure

Alan Chan, a research fellow at the Center for the Governance of AI and a PhD student at Mila, delves into the fascinating world of agent infrastructure. He highlights parallels with road safety, discussing how similar interventions can prevent negative outcomes from AI agents. The conversation covers the evolution of intelligent agents, the necessity of understanding threat models, and a trichotomy of approaches to manage AI risks. Chan also emphasizes the importance of distinct communication channels for AI to enhance decision-making and promote safe interactions.
undefined
4 snips
Nov 14, 2024 • 23min

38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

Zhijing Jin, an Assistant Professor at the University of Toronto, specializes in the intersection of natural language processing and causal inference. In this engaging discussion, she investigates whether language models truly understand causality or just recognize correlations. Zhijing explores the limitations of these models in reasoning, their application in multi-agent systems, and the complexities of digital societies. She poses intriguing questions about AI governance and cooperation, emphasizing the delicate balance required for sustainable agent interactions.
undefined
14 snips
Oct 4, 2024 • 1h 44min

37 - Jaime Sevilla on AI Forecasting

Jaime Sevilla, Director of Epoch AI, dives into the intricacies of AI forecasting and compute trends. He discusses the exponential growth in computational power and its implications for AI development. The conversation highlights the tight relationship between algorithmic improvements and scaling, considering whether scaling is the key to achieving AGI. Sevilla also tackles challenges in GPU production and the importance of transparent AI training processes. Get ready for some thought-provoking insights into the future of artificial intelligence!
undefined
Sep 29, 2024 • 1h 48min

36 - Adam Shai and Paul Riechers on Computational Mechanics

Adam Shai, co-founder of Simplex AI Safety, dives into the realm of computational mechanics and its application to AI safety. He explores how computational mechanics can improve our understanding of neural network models, especially in predicting outcomes. The discussion covers the intriguing world models that transformers create and how fractals emerge in these networks. Shai also highlights the potential of combining insights from quantum information theory with computational mechanics to enhance AI interpretability.
undefined
Sep 28, 2024 • 6min

New Patreon tiers + MATS applications

Patreon: https://www.patreon.com/axrpodcast MATS: https://www.matsprogram.org Note: I'm employed by MATS, but they're not paying me to make this video.
undefined
28 snips
Aug 24, 2024 • 2h 17min

35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

In this discussion, Peter Hase, a researcher specializing in large language models, dives into the intriguing world of AI beliefs. He explores whether LLMs truly have beliefs and how to detect and edit them. A key focus is on the complexities of interpreting neural representations and the implications of belief localization. The conversation also covers the concept of easy-to-hard generalization, revealing insights on how AI tackles different task difficulties. Join Peter as he navigates these thought-provoking topics, blending philosophy with practical AI research.
undefined
43 snips
Jul 28, 2024 • 2h 14min

34 - AI Evaluations with Beth Barnes

Beth Barnes, the founder and head of research at METR, dives into the complexities of evaluating AI systems. They discuss tailored threat models and the unpredictability of AI performance, stressing the need for precise assessment methodologies. Barnes highlights issues like sandbagging and behavior misrepresentation, emphasizing the importance of ethical considerations in AI evaluations. The conversation also touches on the role of policy in shaping effective evaluation science, as well as the disparities between different AI labs in security and monitoring.
undefined
5 snips
Jun 12, 2024 • 1h 41min

33 - RLHF Problems with Scott Emmons

Expert Scott Emmons discusses challenges in Reinforcement Learning from Human Feedback (RLHF): deceptive inflation, overjustification, bounded human rationality, and solutions. Touches on dimensional analysis and his research program, emphasizing the importance of addressing these challenges in AI systems.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app