
AXRP - the AI X-risk Research Podcast
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
Latest episodes

71 snips
Dec 2, 2021 • 2h 50min
12 - AI Existential Risk with Paul Christiano
Why would advanced AI systems pose an existential risk, and what would it look like to develop safer systems? In this episode, I interview Paul Christiano about his views of how AI could be so dangerous, what bad AI scenarios could look like, and what he thinks about various techniques to reduce this risk. Topics we discuss, and timestamps: - 00:00:38 - How AI may pose an existential threat - 00:13:36 - AI timelines - 00:24:49 - Why we might build risky AI - 00:33:58 - Takeoff speeds - 00:51:33 - Why AI could have bad motivations - 00:56:33 - Lessons from our current world - 01:08:23 - "Superintelligence" - 01:15:21 - Technical causes of AI x-risk - 01:19:32 - Intent alignment - 01:33:52 - Outer and inner alignment - 01:43:45 - Thoughts on agent foundations - 01:49:35 - Possible technical solutions to AI x-risk - 01:49:35 - Imitation learning, inverse reinforcement learning, and ease of evaluation - 02:00:34 - Paul's favorite outer alignment solutions - 02:01:20 - Solutions researched by others - 2:06:13 - Decoupling planning from knowledge - 02:17:18 - Factored cognition - 02:25:34 - Possible solutions to inner alignment - 02:31:56 - About Paul - 02:31:56 - Paul's research style - 02:36:36 - Disagreements and uncertainties - 02:46:08 - Some favorite organizations - 02:48:21 - Following Paul's work The transcript: axrp.net/episode/2021/12/02/episode-12-ai-xrisk-paul-christiano.html Paul's blog posts on AI alignment: ai-alignment.com Material that we mention: - Cold Takes - The Most Important Century: cold-takes.com/most-important-century - Open Philanthropy reports on: - Modeling the human trajectory: openphilanthropy.org/blog/modeling-human-trajectory - The computational power of the human brain: openphilanthropy.org/blog/new-report-brain-computation - AI timelines (draft): alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines - Whether AI could drive explosive economic growth: openphilanthropy.org/blog/report-advanced-ai-drive-explosive-economic-growth - Takeoff speeds: sideways-view.com/2018/02/24/takeoff-speeds - Superintelligence: Paths, Dangers, Strategies: en.wikipedia.org/wiki/Superintelligence:_Paths,_Dangers,_Strategies - Wei Dai on metaphilosophical competence: - Two neglected problems in human-AI safety: alignmentforum.org/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety - The argument from philosophical difficulty: alignmentforum.org/posts/w6d7XBCegc96kz4n3/the-argument-from-philosophical-difficulty - Some thoughts on metaphilosophy: alignmentforum.org/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy - AI safety via debate: arxiv.org/abs/1805.00899 - Iterated distillation and amplification: ai-alignment.com/iterated-distillation-and-amplification-157debfd1616 - Scalable agent alignment via reward modeling: a research direction: arxiv.org/abs/1811.07871 - Learning the prior: alignmentforum.org/posts/SL9mKhgdmDKXmxwE4/learning-the-prior - Imitative generalisation (AKA 'learning the prior'): alignmentforum.org/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1 - When is unaligned AI morally valuable?: ai-alignment.com/sympathizing-with-ai-e11a4bf5ef6e

Sep 25, 2021 • 1h 28min
11 - Attainable Utility and Power with Alex Turner
Many scary stories about AI involve an AI system deceiving and subjugating humans in order to gain the ability to achieve its goals without us stopping it. This episode's guest, Alex Turner, will tell us about his research analyzing the notions of "attainable utility" and "power" that underlie these stories, so that we can better evaluate how likely they are and how to prevent them. Topics we discuss: - Side effects minimization - Attainable Utility Preservation (AUP) - AUP and alignment - Power-seeking - Power-seeking and alignment - Future work and about Alex The transcript: axrp.net/episode/2021/09/25/episode-11-attainable-utility-power-alex-turner.html Alex on the AI Alignment Forum: alignmentforum.org/users/turntrout Alex's Google Scholar page: scholar.google.com/citations?user=thAHiVcAAAAJ&hl=en&oi=ao Conservative Agency via Attainable Utility Preservation: arxiv.org/abs/1902.09725 Optimal Policies Tend to Seek Power: arxiv.org/abs/1912.01683 Other works discussed: - Avoiding Side Effects by Considering Future Tasks: arxiv.org/abs/2010.07877 - The "Reframing Impact" Sequence: alignmentforum.org/s/7CdoznhJaLEKHwvJW - The "Risks from Learned Optimization" Sequence: alignmentforum.org/s/7CdoznhJaLEKHwvJW - Concrete Approval-Directed Agents: ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b - Seeking Power is Convergently Instrumental in a Broad Class of Environments: alignmentforum.org/s/fSMbebQyR4wheRrvk/p/hzeLSQ9nwDkPc4KNt - Formalizing Convergent Instrumental Goals: intelligence.org/files/FormalizingConvergentGoals.pdf - The More Power at Stake, the Stronger Instumental Convergence Gets for Optimal Policies: alignmentforum.org/posts/Yc5QSSZCQ9qdyxZF6/the-more-power-at-stake-the-stronger-instrumental - Problem Relaxation as a Tactic: alignmentforum.org/posts/JcpwEKbmNHdwhpq5n/problem-relaxation-as-a-tactic - How I do Research: lesswrong.com/posts/e3Db4w52hz3NSyYqt/how-i-do-research - Math that Clicks: Look for Two-way Correspondences: lesswrong.com/posts/Lotih2o2pkR2aeusW/math-that-clicks-look-for-two-way-correspondences - Testing the Natural Abstraction Hypothesis: alignmentforum.org/posts/cy3BhHrGinZCp3LXE/testing-the-natural-abstraction-hypothesis-project-intro

Jul 23, 2021 • 2h 3min
10 - AI's Future and Impacts with Katja Grace
When going about trying to ensure that AI does not cause an existential catastrophe, it's likely important to understand how AI will develop in the future, and why exactly it might or might not cause such a catastrophe. In this episode, I interview Katja Grace, researcher at AI Impacts, who's done work surveying AI researchers about when they expect superhuman AI to be reached, collecting data about how rapidly AI tends to progress, and thinking about the weak points in arguments that AI could be catastrophic for humanity. Topics we discuss: - 00:00:34 - AI Impacts and its research - 00:08:59 - How to forecast the future of AI - 00:13:33 - Results of surveying AI researchers - 00:30:41 - Work related to forecasting AI takeoff speeds - 00:31:11 - How long it takes AI to cross the human skill range - 00:42:47 - How often technologies have discontinuous progress - 00:50:06 - Arguments for and against fast takeoff of AI - 01:04:00 - Coherence arguments - 01:12:15 - Arguments that AI might cause existential catastrophe, and counter-arguments - 01:13:58 - The size of the super-human range of intelligence - 01:17:22 - The dangers of agentic AI - 01:25:45 - The difficulty of human-compatible goals - 01:33:54 - The possibility of AI destroying everything - 01:49:42 - The future of AI Impacts - 01:52:17 - AI Impacts vs academia - 02:00:25 - What AI x-risk researchers do wrong - 02:01:43 - How to follow Katja's and AI Impacts' work The transcript: axrp.net/episode/2021/07/23/episode-10-ais-future-and-dangers-katja-grace.html "When Will AI Exceed Human Performance? Evidence from AI Experts": arxiv.org/abs/1705.08807 AI Impacts page of more complete survey results: aiimpacts.org/2016-expert-survey-on-progress-in-ai Likelihood of discontinuous progress around the development of AGI: aiimpacts.org/likelihood-of-discontinuous-progress-around-the-development-of-agi Discontinuous progress investigation: aiimpacts.org/discontinuous-progress-investigation The range of human intelligence: aiimpacts.org/is-the-range-of-human-intelligence-small

Jun 24, 2021 • 1h 39min
9 - Finite Factored Sets with Scott Garrabrant
Being an agent can get loopy quickly. For instance, imagine that we're playing chess and I'm trying to decide what move to make. Your next move influences the outcome of the game, and my guess of that influences my move, which influences your next move, which influences the outcome of the game. How can we model these dependencies in a general way, without baking in primitive notions of 'belief' or 'agency'? Today, I talk with Scott Garrabrant about his recent work on finite factored sets that aims to answer this question. Topics we discuss: - 00:00:43 - finite factored sets' relation to Pearlian causality and abstraction - 00:16:00 - partitions and factors in finite factored sets - 00:26:45 - orthogonality and time in finite factored sets - 00:34:49 - using finite factored sets - 00:37:53 - why not infinite factored sets? - 00:45:28 - limits of, and follow-up work on, finite factored sets - 01:00:59 - relevance to embedded agency and x-risk - 01:10:40 - how Scott researches - 01:28:34 - relation to Cartesian frames - 01:37:36 - how to follow Scott's work Link to the transcript: axrp.net/episode/2021/06/24/episode-9-finite-factored-sets-scott-garrabrant.html Link to a transcript of Scott's talk on finite factored sets: alignmentforum.org/posts/N5Jm6Nj4HkNKySA5Z/finite-factored-sets Scott's LessWrong account: lesswrong.com/users/scott-garrabrant Other work mentioned in the discussion: - Causality, by Judea Pearl: bayes.cs.ucla.edu/BOOK-2K - Scott's work on Cartesian frames: alignmentforum.org/posts/BSpdshJWGAW6TuNzZ/introduction-to-cartesian-frames

Jun 8, 2021 • 2h 23min
8 - Assistance Games with Dylan Hadfield-Menell
How should we think about the technical problem of building smarter-than-human AI that does what we want? When and how should AI systems defer to us? Should they have their own goals, and how should those goals be managed? In this episode, Dylan Hadfield-Menell talks about his work on assistance games that formalizes these questions. The first couple years of my PhD program included many long conversations with Dylan that helped shape how I view AI x-risk research, so it was great to have another one in the form of a recorded interview. Link to the transcript: axrp.net/episode/2021/06/08/episode-8-assistance-games-dylan-hadfield-menell.html Link to the paper "Cooperative Inverse Reinforcement Learning": arxiv.org/abs/1606.03137 Link to the paper "The Off-Switch Game": arxiv.org/abs/1611.08219 Link to the paper "Inverse Reward Design": arxiv.org/abs/1711.02827 Dylan's twitter account: twitter.com/dhadfieldmenell Link to apply to the MIT EECS graduate program: gradapply.mit.edu/eecs/apply/login/?next=/eecs/ Other work mentioned in the discussion: - The original paper on inverse optimal control: asmedigitalcollection.asme.org/fluidsengineering/article-abstract/86/1/51/392203/When-Is-a-Linear-Control-System-Optimal - Justin Fu's research on, among other things, adversarial IRL: scholar.google.com/citations?user=T9To2C0AAAAJ&hl=en&oi=ao - Preferences implicit in the state of the world: arxiv.org/abs/1902.04198 - What are you optimizing for? Aligning recommender systems with human values: participatoryml.github.io/papers/2020/42.pdf - The Assistive Multi-Armed Bandit: arxiv.org/abs/1901.08654 - Soares et al. on Corrigibility: openreview.net/forum?id=H1bIT1buWH - Should Robots be Obedient?: arxiv.org/abs/1705.09990 - Rodney Brooks on the Seven Deadly Sins of Predicting the Future of AI: rodneybrooks.com/the-seven-deadly-sins-of-predicting-the-future-of-ai/ - Products in category theory: en.wikipedia.org/wiki/Product_(category_theory) - AXRP Episode 7 - Side Effects with Victoria Krakovna: axrp.net/episode/2021/05/14/episode-7-side-effects-victoria-krakovna.html - Attainable Utility Preservation: arxiv.org/abs/1902.09725 - Penalizing side effects using stepwise relative reachability: arxiv.org/abs/1806.01186 - Simplifying Reward Design through Divide-and-Conquer: arxiv.org/abs/1806.02501 - Active Inverse Reward Design: arxiv.org/abs/1809.03060 - An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning: proceedings.mlr.press/v80/malik18a.html - Incomplete Contracting and AI Alignment: arxiv.org/abs/1804.04268 - Multi-Principal Assistance Games: arxiv.org/abs/2007.09540 - Consequences of Misaligned AI: arxiv.org/abs/2102.03896

May 28, 2021 • 1min
7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra
If you want to shape the development and forecast the consequences of powerful AI technology, it's important to know when it might appear. In this episode, I talk to Ajeya Cotra about her draft report "Forecasting Transformative AI from Biological Anchors" which aims to build a probabilistic model to answer this question. We talk about a variety of topics, including the structure of the model, what the most important parts are to get right, how the estimates should shape our behaviour, and Ajeya's current work at Open Philanthropy and perspective on the AI x-risk landscape. Unfortunately, there was a problem with the recording of our interview, so we weren't able to release it in audio form, but you can read a transcript of the whole conversation. Link to the transcript: axrp.net/episode/2021/05/28/episode-7_5-forecasting-transformative-ai-ajeya-cotra.html Link to the draft report "Forecasting Transformative AI from Biological Anchors": drive.google.com/drive/u/1/folders/15ArhEPZSTYU8f012bs6ehPS6-xmhtBPP

May 14, 2021 • 1h 19min
7 - Side Effects with Victoria Krakovna
One way of thinking about how AI might pose an existential threat is by taking drastic actions to maximize its achievement of some objective function, such as taking control of the power supply or the world's computers. This might suggest a mitigation strategy of minimizing the degree to which AI systems have large effects on the world that are not absolutely necessary for achieving their objective. In this episode, Victoria Krakovna talks about her research on quantifying and minimizing side effects. Topics discussed include how one goes about defining side effects and the difficulties in doing so, her work using relative reachability and the ability to achieve future tasks as side effects measures, and what she thinks the open problems and difficulties are. Link to the transcript: axrp.net/episode/2021/05/14/episode-7-side-effects-victoria-krakovna.html Link to the paper "Penalizing Side Effects Using Stepwise Relative Reachability": arxiv.org/abs/1806.01186 Link to the paper "Avoiding Side Effects by Considering Future Tasks": arxiv.org/abs/2010.07877 Victoria Krakovna's website: vkrakovna.wordpress.com Victoria Krakovna's Alignment Forum profile: alignmentforum.org/users/vika Work mentioned in the episode: - Rohin Shah on the difficulty of finding a value-agnostic impact measure: lesswrong.com/posts/kCY9dYGLoThC3aG7w/best-reasons-for-pessimism-about-impact-of-impact-measures#qAy66Wza8csAqWxiB - Stuart Armstrong's bucket of water example: lesswrong.com/posts/zrunBA8B5bmm2XZ59/reversible-changes-consider-a-bucket-of-water - Attainable Utility Preservation: arxiv.org/abs/1902.09725 - Low Impact Artificial Intelligences: arxiv.org/abs/1705.10720 - AI Safety Gridworlds: arxiv.org/abs/1711.09883 - Test Cases for Impact Regularisation Methods: lesswrong.com/posts/wzPzPmAsG3BwrBrwy/test-cases-for-impact-regularisation-methods - SafeLife: partnershiponai.org/safelife - Avoiding Side Effects in Complex Environments: arxiv.org/abs/2006.06547

Apr 8, 2021 • 1h 59min
6 - Debate and Imitative Generalization with Beth Barnes
One proposal to train AIs that can be useful is to have ML models debate each other about the answer to a human-provided question, where the human judges which side has won. In this episode, I talk with Beth Barnes about her thoughts on the pros and cons of this strategy, what she learned from seeing how humans behaved in debate protocols, and how a technique called imitative generalization can augment debate. Those who are already quite familiar with the basic proposal might want to skip past the explanation of debate to 13:00, "what problems does it solve and does it not solve". Link to Beth's posts on the Alignment Forum: alignmentforum.org/users/beth-barnes Link to the transcript: axrp.net/episode/2021/04/08/episode-6-debate-beth-barnes.html

Mar 10, 2021 • 1h 24min
5 - Infra-Bayesianism with Vanessa Kosoy
The theory of sequential decision-making has a problem: how can we deal with situations where we have some hypotheses about the environment we're acting in, but its exact form might be outside the range of possibilities we can possibly consider? Relatedly, how do we deal with situations where the environment can simulate what we'll do in the future, and put us in better or worse situations now depending on what we'll do then? Today's episode features Vanessa Kosoy talking about infra-Bayesianism, the mathematical framework she developed with Alex Appel that modifies Bayesian decision theory to succeed in these types of situations. Link to the sequence of posts - Infra-Bayesianism: alignmentforum.org/s/CmrW8fCmSLK7E25sa Link to the transcript: axrp.net/episode/2021/03/10/episode-5-infra-bayesianism-vanessa-kosoy.html Vanessa Kosoy's Alignment Forum profile: alignmentforum.org/users/vanessa-kosoy

Feb 17, 2021 • 2h 14min
4 - Risks from Learned Optimization with Evan Hubinger
In machine learning, typically optimization is done to produce a model that performs well according to some metric. Today's episode features Evan Hubinger talking about what happens when the learned model itself is doing optimization in order to perform well, how the goals of the learned model could differ from the goals we used to select the learned model, and what would happen if they did differ. Link to the paper - Risks from Learned Optimization in Advanced Machine Learning Systems: arxiv.org/abs/1906.01820 Link to the transcript: axrp.net/episode/2021/02/17/episode-4-risks-from-learned-optimization-evan-hubinger.html Evan Hubinger's Alignment Forum profile: alignmentforum.org/users/evhub
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.