

AXRP - the AI X-risk Research Podcast
Daniel Filan
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
Episodes
Mentioned books

Jun 8, 2021 • 2h 23min
8 - Assistance Games with Dylan Hadfield-Menell
How should we think about the technical problem of building smarter-than-human AI that does what we want? When and how should AI systems defer to us? Should they have their own goals, and how should those goals be managed? In this episode, Dylan Hadfield-Menell talks about his work on assistance games that formalizes these questions. The first couple years of my PhD program included many long conversations with Dylan that helped shape how I view AI x-risk research, so it was great to have another one in the form of a recorded interview. Link to the transcript: axrp.net/episode/2021/06/08/episode-8-assistance-games-dylan-hadfield-menell.html Link to the paper "Cooperative Inverse Reinforcement Learning": arxiv.org/abs/1606.03137 Link to the paper "The Off-Switch Game": arxiv.org/abs/1611.08219 Link to the paper "Inverse Reward Design": arxiv.org/abs/1711.02827 Dylan's twitter account: twitter.com/dhadfieldmenell Link to apply to the MIT EECS graduate program: gradapply.mit.edu/eecs/apply/login/?next=/eecs/ Other work mentioned in the discussion: - The original paper on inverse optimal control: asmedigitalcollection.asme.org/fluidsengineering/article-abstract/86/1/51/392203/When-Is-a-Linear-Control-System-Optimal - Justin Fu's research on, among other things, adversarial IRL: scholar.google.com/citations?user=T9To2C0AAAAJ&hl=en&oi=ao - Preferences implicit in the state of the world: arxiv.org/abs/1902.04198 - What are you optimizing for? Aligning recommender systems with human values: participatoryml.github.io/papers/2020/42.pdf - The Assistive Multi-Armed Bandit: arxiv.org/abs/1901.08654 - Soares et al. on Corrigibility: openreview.net/forum?id=H1bIT1buWH - Should Robots be Obedient?: arxiv.org/abs/1705.09990 - Rodney Brooks on the Seven Deadly Sins of Predicting the Future of AI: rodneybrooks.com/the-seven-deadly-sins-of-predicting-the-future-of-ai/ - Products in category theory: en.wikipedia.org/wiki/Product_(category_theory) - AXRP Episode 7 - Side Effects with Victoria Krakovna: axrp.net/episode/2021/05/14/episode-7-side-effects-victoria-krakovna.html - Attainable Utility Preservation: arxiv.org/abs/1902.09725 - Penalizing side effects using stepwise relative reachability: arxiv.org/abs/1806.01186 - Simplifying Reward Design through Divide-and-Conquer: arxiv.org/abs/1806.02501 - Active Inverse Reward Design: arxiv.org/abs/1809.03060 - An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning: proceedings.mlr.press/v80/malik18a.html - Incomplete Contracting and AI Alignment: arxiv.org/abs/1804.04268 - Multi-Principal Assistance Games: arxiv.org/abs/2007.09540 - Consequences of Misaligned AI: arxiv.org/abs/2102.03896

May 28, 2021 • 1min
7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra
If you want to shape the development and forecast the consequences of powerful AI technology, it's important to know when it might appear. In this episode, I talk to Ajeya Cotra about her draft report "Forecasting Transformative AI from Biological Anchors" which aims to build a probabilistic model to answer this question. We talk about a variety of topics, including the structure of the model, what the most important parts are to get right, how the estimates should shape our behaviour, and Ajeya's current work at Open Philanthropy and perspective on the AI x-risk landscape. Unfortunately, there was a problem with the recording of our interview, so we weren't able to release it in audio form, but you can read a transcript of the whole conversation. Link to the transcript: axrp.net/episode/2021/05/28/episode-7_5-forecasting-transformative-ai-ajeya-cotra.html Link to the draft report "Forecasting Transformative AI from Biological Anchors": drive.google.com/drive/u/1/folders/15ArhEPZSTYU8f012bs6ehPS6-xmhtBPP

May 14, 2021 • 1h 19min
7 - Side Effects with Victoria Krakovna
One way of thinking about how AI might pose an existential threat is by taking drastic actions to maximize its achievement of some objective function, such as taking control of the power supply or the world's computers. This might suggest a mitigation strategy of minimizing the degree to which AI systems have large effects on the world that are not absolutely necessary for achieving their objective. In this episode, Victoria Krakovna talks about her research on quantifying and minimizing side effects. Topics discussed include how one goes about defining side effects and the difficulties in doing so, her work using relative reachability and the ability to achieve future tasks as side effects measures, and what she thinks the open problems and difficulties are. Link to the transcript: axrp.net/episode/2021/05/14/episode-7-side-effects-victoria-krakovna.html Link to the paper "Penalizing Side Effects Using Stepwise Relative Reachability": arxiv.org/abs/1806.01186 Link to the paper "Avoiding Side Effects by Considering Future Tasks": arxiv.org/abs/2010.07877 Victoria Krakovna's website: vkrakovna.wordpress.com Victoria Krakovna's Alignment Forum profile: alignmentforum.org/users/vika Work mentioned in the episode: - Rohin Shah on the difficulty of finding a value-agnostic impact measure: lesswrong.com/posts/kCY9dYGLoThC3aG7w/best-reasons-for-pessimism-about-impact-of-impact-measures#qAy66Wza8csAqWxiB - Stuart Armstrong's bucket of water example: lesswrong.com/posts/zrunBA8B5bmm2XZ59/reversible-changes-consider-a-bucket-of-water - Attainable Utility Preservation: arxiv.org/abs/1902.09725 - Low Impact Artificial Intelligences: arxiv.org/abs/1705.10720 - AI Safety Gridworlds: arxiv.org/abs/1711.09883 - Test Cases for Impact Regularisation Methods: lesswrong.com/posts/wzPzPmAsG3BwrBrwy/test-cases-for-impact-regularisation-methods - SafeLife: partnershiponai.org/safelife - Avoiding Side Effects in Complex Environments: arxiv.org/abs/2006.06547

Apr 8, 2021 • 1h 59min
6 - Debate and Imitative Generalization with Beth Barnes
One proposal to train AIs that can be useful is to have ML models debate each other about the answer to a human-provided question, where the human judges which side has won. In this episode, I talk with Beth Barnes about her thoughts on the pros and cons of this strategy, what she learned from seeing how humans behaved in debate protocols, and how a technique called imitative generalization can augment debate. Those who are already quite familiar with the basic proposal might want to skip past the explanation of debate to 13:00, "what problems does it solve and does it not solve". Link to Beth's posts on the Alignment Forum: alignmentforum.org/users/beth-barnes Link to the transcript: axrp.net/episode/2021/04/08/episode-6-debate-beth-barnes.html

Mar 10, 2021 • 1h 24min
5 - Infra-Bayesianism with Vanessa Kosoy
The theory of sequential decision-making has a problem: how can we deal with situations where we have some hypotheses about the environment we're acting in, but its exact form might be outside the range of possibilities we can possibly consider? Relatedly, how do we deal with situations where the environment can simulate what we'll do in the future, and put us in better or worse situations now depending on what we'll do then? Today's episode features Vanessa Kosoy talking about infra-Bayesianism, the mathematical framework she developed with Alex Appel that modifies Bayesian decision theory to succeed in these types of situations. Link to the sequence of posts - Infra-Bayesianism: alignmentforum.org/s/CmrW8fCmSLK7E25sa Link to the transcript: axrp.net/episode/2021/03/10/episode-5-infra-bayesianism-vanessa-kosoy.html Vanessa Kosoy's Alignment Forum profile: alignmentforum.org/users/vanessa-kosoy

Feb 17, 2021 • 2h 14min
4 - Risks from Learned Optimization with Evan Hubinger
In machine learning, typically optimization is done to produce a model that performs well according to some metric. Today's episode features Evan Hubinger talking about what happens when the learned model itself is doing optimization in order to perform well, how the goals of the learned model could differ from the goals we used to select the learned model, and what would happen if they did differ. Link to the paper - Risks from Learned Optimization in Advanced Machine Learning Systems: arxiv.org/abs/1906.01820 Link to the transcript: axrp.net/episode/2021/02/17/episode-4-risks-from-learned-optimization-evan-hubinger.html Evan Hubinger's Alignment Forum profile: alignmentforum.org/users/evhub

Dec 11, 2020 • 58min
3 - Negotiable Reinforcement Learning with Andrew Critch
In this episode, I talk with Andrew Critch about negotiable reinforcement learning: what happens when two people (or organizations, or what have you) who have different beliefs and preferences jointly build some agent that will take actions in the real world. In the paper we discuss, it's proven that the only way to make such an agent Pareto optimal - that is, have it not be the case that there's a different agent that both people would prefer to use instead - is to have it preferentially optimize the preferences of whoever's beliefs were more accurate. We discuss his motivations for working on the problem and what he thinks about it. Link to the paper - Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making: papers.nips.cc/paper/2018/hash/5b8e4fd39d9786228649a8a8bec4e008-Abstract.html Link to the transcript: axrp.net/episode/2020/12/11/episode-3-negotiable-reinforcement-learning-andrew-critch.html Critch's Google Scholar profile: scholar.google.com/citations?user=F3_yOXUAAAAJ&hl=en&oi=ao

Dec 11, 2020 • 1h 9min
2 - Learning Human Biases with Rohin Shah
One approach to creating useful AI systems is to watch humans doing a task, infer what they're trying to do, and then try to do that well. The simplest way to infer what the humans are trying to do is to assume there's one goal that they share, and that they're optimally achieving the goal. This has the problem that humans aren't actually optimal at achieving the goals they pursue. We could instead code in the exact way in which humans behave suboptimally, except that we don't know that either. In this episode, I talk with Rohin Shah about his paper about learning the ways in which humans are suboptimal at the same time as learning what goals they pursue: why it's hard, how he tried to do it, how well he did, and why it matters. Link to the paper - On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference: arxiv.org/abs/1906.09624 Link to the transcript: axrp.net/episode/2020/12/11/episode-2-learning-human-biases-rohin-shah.html The Alignment Newsletter: rohinshah.com/alignment-newsletter Rohin's contributions to the AI alignment forum: alignmentforum.org/users/rohinmshah Rohin's website: rohinshah.com

Dec 11, 2020 • 59min
1 - Adversarial Policies with Adam Gleave
In this episode, Adam Gleave and I talk about adversarial policies. Basically, in current reinforcement learning, people train agents that act in some kind of environment, sometimes an environment that contains other agents. For instance, you might train agents that play sumo with each other, with the objective of making them generally good at sumo. Adam's research looks at the case where all you're trying to do is make an agent that defeats one specific other agents: how easy is it, and what happens? He discovers that often, you can do it pretty easily, and your agent can behave in a very silly-seeming way that nevertheless happens to exploit some 'bug' in the opponent. We talk about the experiments he ran, the results, and what they say about how we do reinforcement learning. Link to the paper - Adversarial Policies: Attacking Deep Reinforcement Learning: arxiv.org/abs/1905.10615 Link to the transcript: axrp.net/episode/2020/12/11/episode-1-adversarial-policies-adam-gleave.html Adam's website: gleave.me Adam's twitter account: twitter.com/argleave