
AXRP - the AI X-risk Research Podcast
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
Latest episodes

May 30, 2024 • 2h 22min
32 - Understanding Agency with Jan Kulveit
Jan Kulveit, who leads the Alignment of Complex Systems research group, dives into the fascinating intersection of AI and human cognition. He discusses active inference, the differences between large language models and the human brain, and how feedback loops influence behavior. The conversation explores hierarchical agency, the complexities of aligning AI with human values, and the philosophical implications of self-awareness in AI. Kulveit also critiques existing frameworks for understanding agency, shedding light on the dynamics of collective behaviors.

May 7, 2024 • 2h 32min
31 - Singular Learning Theory with Daniel Murfet
Daniel Murfet, a researcher specializing in singular learning theory and Bayesian statistics, dives into the intricacies of deep learning models. He explains how singular learning theory enhances our understanding of learning dynamics and phase transitions in neural networks. The conversation explores local learning coefficients, their impact on model accuracy, and how singular learning theory compares with other frameworks. Murfet also discusses the potential for this theory to contribute to AI alignment, emphasizing interpretability and the challenges of integrating AI capabilities with human values.

25 snips
Apr 30, 2024 • 2h 16min
30 - AI Security with Jeffrey Ladish
AI security expert Jeffrey Ladish discusses the robustness of safety training in AI models, dangers of open LLMs, securing against attackers, and the state of computer security. They explore undoing safety filters, AI phishing, and making AI more legible. Topics include securing model weights, defending against AI exfiltration, and red lines in AI development.

Apr 25, 2024 • 2h 14min
29 - Science of Deep Learning with Vikrant Varma
Vikrant Varma discusses challenges with unsupervised knowledge discovery, grokking in neural networks, circuit efficiency, and the role of complexity in deep learning. The conversation delves into the balance between memorization and generalization, exploring neural circuits, implicit priors, optimization, and alignment projects at DeepMind.

Apr 17, 2024 • 1h 58min
28 - Suing Labs for AI Risk with Gabriel Weil
Gabriel Weil discusses using tort law to hold AI companies accountable for disasters, comparing it to regulations and Pigouvian taxation. They talk about warning shots, legal changes, interactions with other laws, and the feasibility of liability reform. The conversation also touches on the technical research needed to support this proposal and the potential impact on decision-making in the AI field.

106 snips
Apr 11, 2024 • 2h 56min
27 - AI Control with Buck Shlegeris and Ryan Greenblatt
Buck Shlegeris and Ryan Greenblatt discuss AI control mechanisms in preventing AI from taking over the world. They cover topics such as protocols for AI control, preventing dangerous coded AI communication, unpredictably uncontrollable AI, and the impact of AI control on the AI safety field.

16 snips
Nov 26, 2023 • 1h 57min
26 - AI Governance with Elizabeth Seger
Elizabeth Seger, a researcher specializing in AI governance, discusses the importance of democratizing AI and the risks of open-sourcing powerful AI systems. They explore the offense-defense balance, the concept of AI governance, and alternative methods for open-sourcing AI models. They also highlight the role of technical alignment researchers in improving AI governance.

Oct 3, 2023 • 3h 2min
25 - Cooperative AI with Caspar Oesterheld
Caspar Oesterheld discusses cooperative AI, its applications, and interactions between AI systems. They explore AI arms races, game theory limitations, and the challenges of aligning AI with human values. The podcast also covers regret minimization in decision-making, multi-armed bandit problem, logical induction, safe Pareto improvements, and similarity-based cooperation. They highlight the importance of communication, enforcement mechanisms, and the complexities of achieving effective cooperation and alignment in AI systems.

23 snips
Jul 27, 2023 • 2h 8min
24 - Superalignment with Jan Leike
Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/ Topics we discuss, and timestamps: - 0:00:37 - The superalignment team - 0:02:10 - What's a human-level automated alignment researcher? - 0:06:59 - The gap between human-level automated alignment researchers and superintelligence - 0:18:39 - What does it do? - 0:24:13 - Recursive self-improvement - 0:26:14 - How to make the AI AI alignment researcher - 0:30:09 - Scalable oversight - 0:44:38 - Searching for bad behaviors and internals - 0:54:14 - Deliberately training misaligned models - 1:02:34 - Four year deadline - 1:07:06 - What if it takes longer? - 1:11:38 - The superalignment team and... - 1:11:38 - ... governance - 1:14:37 - ... other OpenAI teams - 1:18:17 - ... other labs - 1:26:10 - Superalignment team logistics - 1:29:17 - Generalization - 1:43:44 - Complementary research - 1:48:29 - Why is Jan optimistic? - 1:58:32 - Long-term agency in LLMs? - 2:02:44 - Do LLMs understand alignment? - 2:06:01 - Following Jan's research The transcript: axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html Links for Jan and OpenAI: - OpenAI jobs: openai.com/careers - Jan's substack: aligned.substack.com - Jan's twitter: twitter.com/janleike Links to research and other writings we discuss: - Introducing Superalignment: openai.com/blog/introducing-superalignment - Let's Verify Step by Step (process-based feedback on math): arxiv.org/abs/2305.20050 - Planning for AGI and beyond: openai.com/blog/planning-for-agi-and-beyond - Self-critiquing models for assisting human evaluators: arxiv.org/abs/2206.05802 - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143 - Language models can explain neurons in language models https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html - Our approach to alignment research: openai.com/blog/our-approach-to-alignment-research - Training language models to follow instructions with human feedback (aka the Instruct-GPT paper): arxiv.org/abs/2203.02155

Jul 27, 2023 • 2h 6min
23 - Mechanistic Anomaly Detection with Mark Xu
Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper "Formalizing the presumption of independence", which formulates the problem of formalizing heuristic mathematical reasoning, in the hope that this will let us mathematically define "mechanistic anomalies". Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/ Topics we discuss, and timestamps: - 0:00:38 - Mechanistic anomaly detection - 0:09:28 - Are all bad things mechanistic anomalies, and vice versa? - 0:18:12 - Are responses to novel situations mechanistic anomalies? - 0:39:19 - Formalizing "for the normal reason, for any reason" - 1:05:22 - How useful is mechanistic anomaly detection? - 1:12:38 - Formalizing the Presumption of Independence - 1:20:05 - Heuristic arguments in physics - 1:27:48 - Difficult domains for heuristic arguments - 1:33:37 - Why not maximum entropy? - 1:44:39 - Adversarial robustness for heuristic arguments - 1:54:05 - Other approaches to defining mechanisms - 1:57:20 - The research plan: progress and next steps - 2:04:13 - Following ARC's research The transcript: axrp.net/episode/2023/07/24/episode-23-mechanistic-anomaly-detection-mark-xu.html ARC links: - Website: alignment.org - Theory blog: alignment.org/blog - Hiring page: alignment.org/hiring Research we discuss: - Formalizing the presumption of independence: arxiv.org/abs/2211.06738 - Eliciting Latent Knowledge (aka ELK): alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge - Mechanistic Anomaly Detection and ELK: alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk - Can we efficiently explain model behaviours? alignmentforum.org/posts/dQvxMZkfgqGitWdkb/can-we-efficiently-explain-model-behaviors - Can we efficiently distinguish different mechanisms? alignmentforum.org/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms