AXRP - the AI X-risk Research Podcast cover image

AXRP - the AI X-risk Research Podcast

Latest episodes

undefined
Aug 21, 2022 • 1h 1min

17 - Training for Very High Reliability with Daniel Ziegler

Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually do this well, especially when we can't use a computer program to say what a 'failure' is? In this episode, I speak with Daniel Ziegler about his research group's efforts to try doing this with present-day language models, and what they learned. Listeners beware: this episode contains a spoiler for the Animorphs franchise around minute 41 (in the 'Fanfiction' section of the transcript).   Topics we discuss, and timestamps:  - 00:00:40 - Summary of the paper  - 00:02:23 - Alignment as scalable oversight and catastrophe minimization  - 00:08:06 - Novel contribtions  - 00:14:20 - Evaluating adversarial robustness  - 00:20:26 - Adversary construction  - 00:35:14 - The task  - 00:38:23 - Fanfiction  - 00:42:15 - Estimators to reduce labelling burden  - 00:45:39 - Future work  - 00:50:12 - About Redwood Research   The transcript: axrp.net/episode/2022/08/21/episode-17-training-for-very-high-reliability-daniel-ziegler.html   Daniel Ziegler on Google Scholar: scholar.google.com/citations?user=YzfbfDgAAAAJ   Research we discuss:  - Daniel's paper, Adversarial Training for High-Stakes Reliability: arxiv.org/abs/2205.01663  - Low-stakes alignment: alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment  - Red Teaming Language Models with Language Models: arxiv.org/abs/2202.03286  - Uncertainty Estimation for Language Reward Models: arxiv.org/abs/2203.07472  - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit
undefined
Jul 1, 2022 • 1h 5min

16 - Preparing for Debate AI with Geoffrey Irving

Many people in the AI alignment space have heard of AI safety via debate - check out AXRP episode 6 (axrp.net/episode/2021/04/08/episode-6-debate-beth-barnes.html) if you need a primer. But how do we get language models to the stage where they can usefully implement debate? In this episode, I talk to Geoffrey Irving about the role of language models in AI safety, as well as three projects he's done that get us closer to making debate happen: using language models to find flaws in themselves, getting language models to back up claims they make with citations, and figuring out how uncertain language models should be about the quality of various answers.   Topics we discuss, and timestamps:  - 00:00:48 - Status update on AI safety via debate  - 00:10:24 - Language models and AI safety  - 00:19:34 - Red teaming language models with language models  - 00:35:31 - GopherCite  - 00:49:10 - Uncertainty Estimation for Language Reward Models  - 01:00:26 - Following Geoffrey's work, and working with him   The transcript: axrp.net/episode/2022/07/01/episode-16-preparing-for-debate-ai-geoffrey-irving.html   Geoffrey's twitter: twitter.com/geoffreyirving   Research we discuss:  - Red Teaming Language Models With Language Models: arxiv.org/abs/2202.03286  - Teaching Language Models to Support Answers with Verified Quotes, aka GopherCite: arxiv.org/abs/2203.11147  - Uncertainty Estimation for Language Reward Models: arxiv.org/abs/2203.07472  - AI Safety via Debate: arxiv.org/abs/1805.00899  - Writeup: progress on AI safety via debate: lesswrong.com/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1  - Eliciting Latent Knowledge: ai-alignment.com/eliciting-latent-knowledge-f977478608fc  - Training Compute-Optimal Large Language Models, aka Chinchilla: arxiv.org/abs/2203.15556
undefined
May 23, 2022 • 1h 37min

15 - Natural Abstractions with John Wentworth

Why does anybody care about natural abstractions? Do they somehow relate to math, or value learning? How do E. coli bacteria find sources of sugar? All these questions and more will be answered in this interview with John Wentworth, where we talk about his research plan of understanding agency via natural abstractions. Topics we discuss, and timestamps:  - 00:00:31 - Agency in E. Coli  - 00:04:59 - Agency in financial markets  - 00:08:44 - Inferring agency in real-world systems  - 00:16:11 - Selection theorems  - 00:20:22 - Abstraction and natural abstractions  - 00:32:42 - Information at a distance  - 00:39:20 - Why the natural abstraction hypothesis matters  - 00:44:48 - Unnatural abstractions used by humans?  - 00:49:11 - Probability, determinism, and abstraction  - 00:52:58 - Whence probabilities in deterministic universes?  - 01:02:37 - Abstraction and maximum entropy distributions  - 01:07:39 - Natural abstractions and impact  - 01:08:50 - Learning human values  - 01:20:47 - The shape of the research landscape  - 01:34:59 - Following John's work   The transcript: axrp.net/episode/2022/05/23/episode-15-natural-abstractions-john-wentworth.html   John on LessWrong: lesswrong.com/users/johnswentworth   Research that we discuss:  - Alignment by default - contains the natural abstraction hypothesis: alignmentforum.org/posts/Nwgdq6kHke5LY692J/alignment-by-default#Unsupervised__Natural_Abstractions  - The telephone theorem: alignmentforum.org/posts/jJf4FrfiQdDGg7uco/information-at-a-distance-is-mediated-by-deterministic  - Generalizing Koopman-Pitman-Darmois: alignmentforum.org/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois  - The plan: alignmentforum.org/posts/3L46WGauGpr7nYubu/the-plan  - Understanding deep learning requires rethinking generalization - deep learning can fit random data: arxiv.org/abs/1611.03530  - A closer look at memorization in deep networks - deep learning learns before memorizing: arxiv.org/abs/1706.05394  - Zero-shot coordination: arxiv.org/abs/2003.02979  - A new formalism, method, and open issues for zero-shot coordination: arxiv.org/abs/2106.06613  - Conservative agency via attainable utility preservation: arxiv.org/abs/1902.09725  - Corrigibility: intelligence.org/files/Corrigibility.pdf   Errata:  - E. coli has ~4,400 genes, not 30,000.  - A typical adult human body has thousands of moles of water in it, and therefore must consist of well more than 10 moles total.
undefined
Apr 5, 2022 • 1h 48min

14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Late last year, Vanessa Kosoy and Alexander Appel published some research under the heading of "Infra-Bayesian physicalism". But wait - what was infra-Bayesianism again? Why should we care? And what does any of this have to do with physicalism? In this episode, I talk with Vanessa Kosoy about these questions, and get a technical overview of how infra-Bayesian physicalism works and what its implications are.   Topics we discuss, and timestamps:  - 00:00:48 - The basics of infra-Bayes  - 00:08:32 - An invitation to infra-Bayes  - 00:11:23 - What is naturalized induction?  - 00:19:53 - How infra-Bayesian physicalism helps with naturalized induction    - 00:19:53 - Bridge rules    - 00:22:22 - Logical uncertainty    - 00:23:36 - Open source game theory    - 00:28:27 - Logical counterfactuals    - 00:30:55 - Self-improvement  - 00:32:40 - How infra-Bayesian physicalism works    - 00:32:47 - World models      - 00:39-20 - Priors      - 00:42:53 - Counterfactuals      - 00:50:34 - Anthropics    - 00:54:40 - Loss functions      - 00:56:44 - The monotonicity principle      - 01:01:57 - How to care about various things    - 01:08:47 - Decision theory  - 01:19:53 - Follow-up research    - 01:20:06 - Infra-Bayesian physicalist quantum mechanics    - 01:26:42 - Infra-Bayesian physicalist agreement theorems  - 01:29:00 - The production of infra-Bayesianism research  - 01:35:14 - Bridge rules and malign priors  - 01:45:27 - Following Vanessa's work   The transcript: axrp.net/episode/2022/04/05/episode-14-infra-bayesian-physicalism-vanessa-kosoy.html   Vanessa on the Alignment Forum: alignmentforum.org/users/vanessa-kosoy   Research that we discuss:  - Infra-Bayesian physicalism: a formal theory of naturalized induction: alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized  - Updating ambiguous beliefs (contains the infra-Bayesian update rule): sciencedirect.com/science/article/abs/pii/S0022053183710033  - Functional Decision Theory: A New Theory of Instrumental Rationality: arxiv.org/abs/1710.05060  - Space-time embedded intelligence: cs.utexas.edu/~ring/Orseau,%20Ring%3B%20Space-Time%20Embedded%20Intelligence,%20AGI%202012.pdf  - Attacking the grain of truth problem using Bayes-Savage agents (generating a simplicity prior with Knightian uncertainty using oracle machines): alignmentforum.org/posts/5bd75cc58225bf0670375273/attacking-the-grain-of-truth-problem-using-bayes-sa  - Quantity of experience: brain-duplication and degrees of consciousness (the thick wires argument): nickbostrom.com/papers/experience.pdf  - Online learning in unknown Markov games: arxiv.org/abs/2010.15020  - Agreeing to disagree (contains the Aumann agreement theorem): ma.huji.ac.il/~raumann/pdf/Agreeing%20to%20Disagree.pdf  - What does the universal prior actually look like? (aka "the Solomonoff prior is malign"): ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/  - The Solomonoff prior is malign: alignmentforum.org/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign  - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit  - ELK Thought Dump, by Abram Demski: alignmentforum.org/posts/eqzbXmqGqXiyjX3TP/elk-thought-dump-1
undefined
40 snips
Mar 31, 2022 • 1h 34min

13 - First Principles of AGI Safety with Richard Ngo

How should we think about artificial general intelligence (AGI), and the risks it might pose? What constraints exist on technical solutions to the problem of aligning superhuman AI systems with human intentions? In this episode, I talk to Richard Ngo about his report analyzing AGI safety from first principles, and recent conversations he had with Eliezer Yudkowsky about the difficulty of AI alignment.   Topics we discuss, and timestamps:  - 00:00:40 - The nature of intelligence and AGI    - 00:01:18 - The nature of intelligence    - 00:06:09 - AGI: what and how    - 00:13:30 - Single vs collective AI minds  - 00:18:57 - AGI in practice    - 00:18:57 - Impact    - 00:20:49 - Timing    - 00:25:38 - Creation    - 00:28:45 - Risks and benefits  - 00:35:54 - Making AGI safe    - 00:35:54 - Robustness of the agency abstraction    - 00:43:15 - Pivotal acts  - 00:50:05 - AGI safety concepts    - 00:50:05 - Alignment    - 00:56:14 - Transparency    - 00:59:25 - Cooperation  - 01:01:40 - Optima and selection processes  - 01:13:33 - The AI alignment research community    - 01:13:33 - Updates from the Yudkowsky conversation    - 01:17:18 - Corrections to the community    - 01:23:57 - Why others don't join  - 01:26:38 - Richard Ngo as a researcher  - 01:28:26 - The world approaching AGI  - 01:30:41 - Following Richard's work   The transcript: axrp.net/episode/2022/03/31/episode-13-first-principles-agi-safety-richard-ngo.html   Richard on the Alignment Forum: alignmentforum.org/users/ricraz Richard on Twitter: twitter.com/RichardMCNgo The AGI Safety Fundamentals course: eacambridge.org/agi-safety-fundamentals   Materials that we mention:  - AGI Safety from First Principles: alignmentforum.org/s/mzgtmmTKKn5MuCzFJ  - Conversations with Eliezer Yudkowsky: alignmentforum.org/s/n945eovrA3oDueqtq  - The Bitter Lesson: incompleteideas.net/IncIdeas/BitterLesson.html  - Metaphors We Live By: en.wikipedia.org/wiki/Metaphors_We_Live_By  - The Enigma of Reason: hup.harvard.edu/catalog.php?isbn=9780674237827  - Draft report on AI timelines, by Ajeya Cotra: alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines  - More is Different for AI: bounded-regret.ghost.io/more-is-different-for-ai/  - The Windfall Clause: fhi.ox.ac.uk/windfallclause  - Cooperative Inverse Reinforcement Learning: arxiv.org/abs/1606.03137  - Imitative Generalisation: alignmentforum.org/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1  - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit  - Draft report on existential risk from power-seeking AI, by Joseph Carlsmith: alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai  - The Most Important Century: cold-takes.com/most-important-century
undefined
71 snips
Dec 2, 2021 • 2h 50min

12 - AI Existential Risk with Paul Christiano

Why would advanced AI systems pose an existential risk, and what would it look like to develop safer systems? In this episode, I interview Paul Christiano about his views of how AI could be so dangerous, what bad AI scenarios could look like, and what he thinks about various techniques to reduce this risk.   Topics we discuss, and timestamps:  - 00:00:38 - How AI may pose an existential threat    - 00:13:36 - AI timelines    - 00:24:49 - Why we might build risky AI    - 00:33:58 - Takeoff speeds    - 00:51:33 - Why AI could have bad motivations    - 00:56:33 - Lessons from our current world    - 01:08:23 - "Superintelligence"  - 01:15:21 - Technical causes of AI x-risk    - 01:19:32 - Intent alignment    - 01:33:52 - Outer and inner alignment    - 01:43:45 - Thoughts on agent foundations  - 01:49:35 - Possible technical solutions to AI x-risk    - 01:49:35 - Imitation learning, inverse reinforcement learning, and ease of evaluation    - 02:00:34 - Paul's favorite outer alignment solutions      - 02:01:20 - Solutions researched by others      - 2:06:13 - Decoupling planning from knowledge    - 02:17:18 - Factored cognition    - 02:25:34 - Possible solutions to inner alignment  - 02:31:56 - About Paul    - 02:31:56 - Paul's research style    - 02:36:36 - Disagreements and uncertainties    - 02:46:08 - Some favorite organizations    - 02:48:21 - Following Paul's work   The transcript: axrp.net/episode/2021/12/02/episode-12-ai-xrisk-paul-christiano.html   Paul's blog posts on AI alignment: ai-alignment.com   Material that we mention:  - Cold Takes - The Most Important Century: cold-takes.com/most-important-century  - Open Philanthropy reports on:    - Modeling the human trajectory: openphilanthropy.org/blog/modeling-human-trajectory    - The computational power of the human brain: openphilanthropy.org/blog/new-report-brain-computation    - AI timelines (draft): alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines    - Whether AI could drive explosive economic growth: openphilanthropy.org/blog/report-advanced-ai-drive-explosive-economic-growth  - Takeoff speeds: sideways-view.com/2018/02/24/takeoff-speeds  - Superintelligence: Paths, Dangers, Strategies: en.wikipedia.org/wiki/Superintelligence:_Paths,_Dangers,_Strategies  - Wei Dai on metaphilosophical competence:    - Two neglected problems in human-AI safety: alignmentforum.org/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety    - The argument from philosophical difficulty: alignmentforum.org/posts/w6d7XBCegc96kz4n3/the-argument-from-philosophical-difficulty    - Some thoughts on metaphilosophy: alignmentforum.org/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy  - AI safety via debate: arxiv.org/abs/1805.00899  - Iterated distillation and amplification: ai-alignment.com/iterated-distillation-and-amplification-157debfd1616  - Scalable agent alignment via reward modeling: a research direction: arxiv.org/abs/1811.07871  - Learning the prior: alignmentforum.org/posts/SL9mKhgdmDKXmxwE4/learning-the-prior  - Imitative generalisation (AKA 'learning the prior'): alignmentforum.org/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1  - When is unaligned AI morally valuable?: ai-alignment.com/sympathizing-with-ai-e11a4bf5ef6e
undefined
Sep 25, 2021 • 1h 28min

11 - Attainable Utility and Power with Alex Turner

Many scary stories about AI involve an AI system deceiving and subjugating humans in order to gain the ability to achieve its goals without us stopping it. This episode's guest, Alex Turner, will tell us about his research analyzing the notions of "attainable utility" and "power" that underlie these stories, so that we can better evaluate how likely they are and how to prevent them.   Topics we discuss:  - Side effects minimization  - Attainable Utility Preservation (AUP)  - AUP and alignment  - Power-seeking  - Power-seeking and alignment  - Future work and about Alex   The transcript: axrp.net/episode/2021/09/25/episode-11-attainable-utility-power-alex-turner.html   Alex on the AI Alignment Forum: alignmentforum.org/users/turntrout Alex's Google Scholar page: scholar.google.com/citations?user=thAHiVcAAAAJ&hl=en&oi=ao   Conservative Agency via Attainable Utility Preservation: arxiv.org/abs/1902.09725 Optimal Policies Tend to Seek Power: arxiv.org/abs/1912.01683   Other works discussed:  - Avoiding Side Effects by Considering Future Tasks: arxiv.org/abs/2010.07877  - The "Reframing Impact" Sequence: alignmentforum.org/s/7CdoznhJaLEKHwvJW  - The "Risks from Learned Optimization" Sequence: alignmentforum.org/s/7CdoznhJaLEKHwvJW  - Concrete Approval-Directed Agents: ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b  - Seeking Power is Convergently Instrumental in a Broad Class of Environments: alignmentforum.org/s/fSMbebQyR4wheRrvk/p/hzeLSQ9nwDkPc4KNt  - Formalizing Convergent Instrumental Goals: intelligence.org/files/FormalizingConvergentGoals.pdf  - The More Power at Stake, the Stronger Instumental Convergence Gets for Optimal Policies: alignmentforum.org/posts/Yc5QSSZCQ9qdyxZF6/the-more-power-at-stake-the-stronger-instrumental  - Problem Relaxation as a Tactic: alignmentforum.org/posts/JcpwEKbmNHdwhpq5n/problem-relaxation-as-a-tactic  - How I do Research: lesswrong.com/posts/e3Db4w52hz3NSyYqt/how-i-do-research  - Math that Clicks: Look for Two-way Correspondences: lesswrong.com/posts/Lotih2o2pkR2aeusW/math-that-clicks-look-for-two-way-correspondences  - Testing the Natural Abstraction Hypothesis: alignmentforum.org/posts/cy3BhHrGinZCp3LXE/testing-the-natural-abstraction-hypothesis-project-intro
undefined
Jul 23, 2021 • 2h 3min

10 - AI's Future and Impacts with Katja Grace

When going about trying to ensure that AI does not cause an existential catastrophe, it's likely important to understand how AI will develop in the future, and why exactly it might or might not cause such a catastrophe. In this episode, I interview Katja Grace, researcher at AI Impacts, who's done work surveying AI researchers about when they expect superhuman AI to be reached, collecting data about how rapidly AI tends to progress, and thinking about the weak points in arguments that AI could be catastrophic for humanity.   Topics we discuss:  - 00:00:34 - AI Impacts and its research  - 00:08:59 - How to forecast the future of AI  - 00:13:33 - Results of surveying AI researchers  - 00:30:41 - Work related to forecasting AI takeoff speeds    - 00:31:11 - How long it takes AI to cross the human skill range    - 00:42:47 - How often technologies have discontinuous progress    - 00:50:06 - Arguments for and against fast takeoff of AI  - 01:04:00 - Coherence arguments  - 01:12:15 - Arguments that AI might cause existential catastrophe, and counter-arguments    - 01:13:58 - The size of the super-human range of intelligence    - 01:17:22 - The dangers of agentic AI    - 01:25:45 - The difficulty of human-compatible goals    - 01:33:54 - The possibility of AI destroying everything  - 01:49:42 - The future of AI Impacts  - 01:52:17 - AI Impacts vs academia  - 02:00:25 - What AI x-risk researchers do wrong  - 02:01:43 - How to follow Katja's and AI Impacts' work   The transcript: axrp.net/episode/2021/07/23/episode-10-ais-future-and-dangers-katja-grace.html   "When Will AI Exceed Human Performance? Evidence from AI Experts": arxiv.org/abs/1705.08807 AI Impacts page of more complete survey results: aiimpacts.org/2016-expert-survey-on-progress-in-ai Likelihood of discontinuous progress around the development of AGI: aiimpacts.org/likelihood-of-discontinuous-progress-around-the-development-of-agi Discontinuous progress investigation: aiimpacts.org/discontinuous-progress-investigation The range of human intelligence: aiimpacts.org/is-the-range-of-human-intelligence-small
undefined
Jun 24, 2021 • 1h 39min

9 - Finite Factored Sets with Scott Garrabrant

Being an agent can get loopy quickly. For instance, imagine that we're playing chess and I'm trying to decide what move to make. Your next move influences the outcome of the game, and my guess of that influences my move, which influences your next move, which influences the outcome of the game. How can we model these dependencies in a general way, without baking in primitive notions of 'belief' or 'agency'? Today, I talk with Scott Garrabrant about his recent work on finite factored sets that aims to answer this question.   Topics we discuss:  - 00:00:43 - finite factored sets' relation to Pearlian causality and abstraction  - 00:16:00 - partitions and factors in finite factored sets  - 00:26:45 - orthogonality and time in finite factored sets  - 00:34:49 - using finite factored sets  - 00:37:53 - why not infinite factored sets?  - 00:45:28 - limits of, and follow-up work on, finite factored sets  - 01:00:59 - relevance to embedded agency and x-risk  - 01:10:40 - how Scott researches  - 01:28:34 - relation to Cartesian frames  - 01:37:36 - how to follow Scott's work   Link to the transcript: axrp.net/episode/2021/06/24/episode-9-finite-factored-sets-scott-garrabrant.html   Link to a transcript of Scott's talk on finite factored sets: alignmentforum.org/posts/N5Jm6Nj4HkNKySA5Z/finite-factored-sets Scott's LessWrong account: lesswrong.com/users/scott-garrabrant   Other work mentioned in the discussion:  - Causality, by Judea Pearl: bayes.cs.ucla.edu/BOOK-2K  - Scott's work on Cartesian frames: alignmentforum.org/posts/BSpdshJWGAW6TuNzZ/introduction-to-cartesian-frames
undefined
Jun 8, 2021 • 2h 23min

8 - Assistance Games with Dylan Hadfield-Menell

How should we think about the technical problem of building smarter-than-human AI that does what we want? When and how should AI systems defer to us? Should they have their own goals, and how should those goals be managed? In this episode, Dylan Hadfield-Menell talks about his work on assistance games that formalizes these questions. The first couple years of my PhD program included many long conversations with Dylan that helped shape how I view AI x-risk research, so it was great to have another one in the form of a recorded interview.   Link to the transcript: axrp.net/episode/2021/06/08/episode-8-assistance-games-dylan-hadfield-menell.html   Link to the paper "Cooperative Inverse Reinforcement Learning": arxiv.org/abs/1606.03137 Link to the paper "The Off-Switch Game": arxiv.org/abs/1611.08219 Link to the paper "Inverse Reward Design": arxiv.org/abs/1711.02827   Dylan's twitter account: twitter.com/dhadfieldmenell Link to apply to the MIT EECS graduate program: gradapply.mit.edu/eecs/apply/login/?next=/eecs/   Other work mentioned in the discussion:  - The original paper on inverse optimal control: asmedigitalcollection.asme.org/fluidsengineering/article-abstract/86/1/51/392203/When-Is-a-Linear-Control-System-Optimal  - Justin Fu's research on, among other things, adversarial IRL: scholar.google.com/citations?user=T9To2C0AAAAJ&hl=en&oi=ao  - Preferences implicit in the state of the world: arxiv.org/abs/1902.04198  - What are you optimizing for? Aligning recommender systems with human values: participatoryml.github.io/papers/2020/42.pdf  - The Assistive Multi-Armed Bandit: arxiv.org/abs/1901.08654  - Soares et al. on Corrigibility: openreview.net/forum?id=H1bIT1buWH  - Should Robots be Obedient?: arxiv.org/abs/1705.09990  - Rodney Brooks on the Seven Deadly Sins of Predicting the Future of AI: rodneybrooks.com/the-seven-deadly-sins-of-predicting-the-future-of-ai/  - Products in category theory: en.wikipedia.org/wiki/Product_(category_theory)  - AXRP Episode 7 - Side Effects with Victoria Krakovna: axrp.net/episode/2021/05/14/episode-7-side-effects-victoria-krakovna.html  - Attainable Utility Preservation: arxiv.org/abs/1902.09725  - Penalizing side effects using stepwise relative reachability: arxiv.org/abs/1806.01186  - Simplifying Reward Design through Divide-and-Conquer: arxiv.org/abs/1806.02501  - Active Inverse Reward Design: arxiv.org/abs/1809.03060  - An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning: proceedings.mlr.press/v80/malik18a.html  - Incomplete Contracting and AI Alignment: arxiv.org/abs/1804.04268  - Multi-Principal Assistance Games: arxiv.org/abs/2007.09540  - Consequences of Misaligned AI: arxiv.org/abs/2102.03896

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode