AXRP - the AI X-risk Research Podcast cover image

AXRP - the AI X-risk Research Podcast

Latest episodes

undefined
18 snips
Apr 12, 2023 • 2h 28min

20 - 'Reform' AI Alignment with Scott Aaronson

How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with Scott Aaronson about his views on how to make progress in AI alignment, as well as his work on watermarking the output of language models, and how he moved from a background in quantum complexity theory to working on AI.   Note: this episode was recorded before this story (vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says) emerged of a man committing suicide after discussions with a language-model-based chatbot, that included discussion of the possibility of him killing himself. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast   Topics we discuss, and timestamps:  - 0:00:36 - 'Reform' AI alignment    - 0:01:52 - Epistemology of AI risk    - 0:20:08 - Immediate problems and existential risk    - 0:24:35 - Aligning deceitful AI    - 0:30:59 - Stories of AI doom    - 0:34:27 - Language models    - 0:43:08 - Democratic governance of AI    - 0:59:35 - What would change Scott's mind  - 1:14:45 - Watermarking language model outputs    - 1:41:41 - Watermark key secrecy and backdoor insertion  - 1:58:05 - Scott's transition to AI research    - 2:03:48 - Theoretical computer science and AI alignment    - 2:14:03 - AI alignment and formalizing philosophy    - 2:22:04 - How Scott finds AI research  - 2:24:53 - Following Scott's research   The transcript: axrp.net/episode/2023/04/11/episode-20-reform-ai-alignment-scott-aaronson.html   Links to Scott's things:  - Personal website: scottaaronson.com  - Book, Quantum Computing Since Democritus: amazon.com/Quantum-Computing-since-Democritus-Aaronson/dp/0521199565/  - Blog, Shtetl-Optimized: scottaaronson.blog   Writings we discuss:  - Reform AI Alignment: scottaaronson.blog/?p=6821  - Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974
undefined
Feb 7, 2023 • 3min

Store, Patreon, Video

Store: https://store.axrp.net/ Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Video: https://www.youtube.com/watch?v=kmPFjpEibu0
undefined
4 snips
Feb 4, 2023 • 3h 53min

19 - Mechanistic Interpretability with Neel Nanda

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.   Topics we discuss, and timestamps:  - 00:01:05 - What is mechanistic interpretability?  - 00:24:16 - Types of AI cognition  - 00:54:27 - Automating mechanistic interpretability  - 01:11:57 - Summarizing the papers  - 01:24:43 - 'A Mathematical Framework for Transformer Circuits'    - 01:39:31 - How attention works    - 01:49:26 - Composing attention heads    - 01:59:42 - Induction heads  - 02:11:05 - 'In-context Learning and Induction Heads'    - 02:12:55 - The multiplicity of induction heads    - 02:30:10 - Lines of evidence    - 02:38:47 - Evolution in loss-space    - 02:46:19 - Mysteries of in-context learning  - 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'    - 02:50:57 - How neural nets learn modular addition    - 03:11:37 - The suddenness of grokking  - 03:34:16 - Relation to other research  - 03:43:57 - Could mechanistic interpretability possibly work?  - 03:49:28 - Following Neel's research   The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html   Links to Neel's things:  - Neel on Twitter: twitter.com/NeelNanda5  - Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1  - Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability  - TransformerLens: github.com/neelnanda-io/TransformerLens  - Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic  - Neel on YouTube: youtube.com/@neelnanda2469  - 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj  - Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J   Writings we discuss:  - A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html  - In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html  - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217  - Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052  - interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens  - Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262  - Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097  - Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN  - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143  - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593  - Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177  - The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544  - Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration  - Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913   - Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves  - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635  - Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a    
undefined
Oct 13, 2022 • 1min

New podcast - The Filan Cabinet

I have a new podcast, where I interview whoever I want about whatever I want. It's called "The Filan Cabinet", and you can find it wherever you listen to podcasts. The first three episodes are about pandemic preparedness, God, and cryptocurrency. For more details, check out the podcast website (thefilancabinet.com), or search "The Filan Cabinet" in your podcast app.
undefined
Sep 3, 2022 • 1h 46min

18 - Concept Extrapolation with Stuart Armstrong

Concept extrapolation is the idea of taking concepts an AI has about the world - say, "mass" or "does this picture contain a hot dog" - and extending them sensibly to situations where things are different - like learning that the world works via special relativity, or seeing a picture of a novel sausage-bread combination. For a while, Stuart Armstrong has been thinking about concept extrapolation and how it relates to AI alignment. In this episode, we discuss where his thoughts are at on this topic, what the relationship to AI alignment is, and what the open questions are.   Topics we discuss, and timestamps:  - 00:00:44 - What is concept extrapolation  - 00:15:25 - When is concept extrapolation possible  - 00:30:44 - A toy formalism  - 00:37:25 - Uniqueness of extrapolations  - 00:48:34 - Unity of concept extrapolation methods  - 00:53:25 - Concept extrapolation and corrigibility  - 00:59:51 - Is concept extrapolation possible?  - 01:37:05 - Misunderstandings of Stuart's approach  - 01:44:13 - Following Stuart's work   The transcript: axrp.net/episode/2022/09/03/episode-18-concept-extrapolation-stuart-armstrong.html   Stuart's startup, Aligned AI: aligned-ai.com   Research we discuss:  - The Concept Extrapolation sequence: alignmentforum.org/s/u9uawicHx7Ng7vwxA  - The HappyFaces benchmark: github.com/alignedai/HappyFaces  - Goal Misgeneralization in Deep Reinforcement Learning: arxiv.org/abs/2105.14111
undefined
Aug 21, 2022 • 1h 1min

17 - Training for Very High Reliability with Daniel Ziegler

Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually do this well, especially when we can't use a computer program to say what a 'failure' is? In this episode, I speak with Daniel Ziegler about his research group's efforts to try doing this with present-day language models, and what they learned. Listeners beware: this episode contains a spoiler for the Animorphs franchise around minute 41 (in the 'Fanfiction' section of the transcript).   Topics we discuss, and timestamps:  - 00:00:40 - Summary of the paper  - 00:02:23 - Alignment as scalable oversight and catastrophe minimization  - 00:08:06 - Novel contribtions  - 00:14:20 - Evaluating adversarial robustness  - 00:20:26 - Adversary construction  - 00:35:14 - The task  - 00:38:23 - Fanfiction  - 00:42:15 - Estimators to reduce labelling burden  - 00:45:39 - Future work  - 00:50:12 - About Redwood Research   The transcript: axrp.net/episode/2022/08/21/episode-17-training-for-very-high-reliability-daniel-ziegler.html   Daniel Ziegler on Google Scholar: scholar.google.com/citations?user=YzfbfDgAAAAJ   Research we discuss:  - Daniel's paper, Adversarial Training for High-Stakes Reliability: arxiv.org/abs/2205.01663  - Low-stakes alignment: alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment  - Red Teaming Language Models with Language Models: arxiv.org/abs/2202.03286  - Uncertainty Estimation for Language Reward Models: arxiv.org/abs/2203.07472  - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit
undefined
Jul 1, 2022 • 1h 5min

16 - Preparing for Debate AI with Geoffrey Irving

Many people in the AI alignment space have heard of AI safety via debate - check out AXRP episode 6 (axrp.net/episode/2021/04/08/episode-6-debate-beth-barnes.html) if you need a primer. But how do we get language models to the stage where they can usefully implement debate? In this episode, I talk to Geoffrey Irving about the role of language models in AI safety, as well as three projects he's done that get us closer to making debate happen: using language models to find flaws in themselves, getting language models to back up claims they make with citations, and figuring out how uncertain language models should be about the quality of various answers.   Topics we discuss, and timestamps:  - 00:00:48 - Status update on AI safety via debate  - 00:10:24 - Language models and AI safety  - 00:19:34 - Red teaming language models with language models  - 00:35:31 - GopherCite  - 00:49:10 - Uncertainty Estimation for Language Reward Models  - 01:00:26 - Following Geoffrey's work, and working with him   The transcript: axrp.net/episode/2022/07/01/episode-16-preparing-for-debate-ai-geoffrey-irving.html   Geoffrey's twitter: twitter.com/geoffreyirving   Research we discuss:  - Red Teaming Language Models With Language Models: arxiv.org/abs/2202.03286  - Teaching Language Models to Support Answers with Verified Quotes, aka GopherCite: arxiv.org/abs/2203.11147  - Uncertainty Estimation for Language Reward Models: arxiv.org/abs/2203.07472  - AI Safety via Debate: arxiv.org/abs/1805.00899  - Writeup: progress on AI safety via debate: lesswrong.com/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1  - Eliciting Latent Knowledge: ai-alignment.com/eliciting-latent-knowledge-f977478608fc  - Training Compute-Optimal Large Language Models, aka Chinchilla: arxiv.org/abs/2203.15556
undefined
May 23, 2022 • 1h 37min

15 - Natural Abstractions with John Wentworth

Why does anybody care about natural abstractions? Do they somehow relate to math, or value learning? How do E. coli bacteria find sources of sugar? All these questions and more will be answered in this interview with John Wentworth, where we talk about his research plan of understanding agency via natural abstractions. Topics we discuss, and timestamps:  - 00:00:31 - Agency in E. Coli  - 00:04:59 - Agency in financial markets  - 00:08:44 - Inferring agency in real-world systems  - 00:16:11 - Selection theorems  - 00:20:22 - Abstraction and natural abstractions  - 00:32:42 - Information at a distance  - 00:39:20 - Why the natural abstraction hypothesis matters  - 00:44:48 - Unnatural abstractions used by humans?  - 00:49:11 - Probability, determinism, and abstraction  - 00:52:58 - Whence probabilities in deterministic universes?  - 01:02:37 - Abstraction and maximum entropy distributions  - 01:07:39 - Natural abstractions and impact  - 01:08:50 - Learning human values  - 01:20:47 - The shape of the research landscape  - 01:34:59 - Following John's work   The transcript: axrp.net/episode/2022/05/23/episode-15-natural-abstractions-john-wentworth.html   John on LessWrong: lesswrong.com/users/johnswentworth   Research that we discuss:  - Alignment by default - contains the natural abstraction hypothesis: alignmentforum.org/posts/Nwgdq6kHke5LY692J/alignment-by-default#Unsupervised__Natural_Abstractions  - The telephone theorem: alignmentforum.org/posts/jJf4FrfiQdDGg7uco/information-at-a-distance-is-mediated-by-deterministic  - Generalizing Koopman-Pitman-Darmois: alignmentforum.org/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois  - The plan: alignmentforum.org/posts/3L46WGauGpr7nYubu/the-plan  - Understanding deep learning requires rethinking generalization - deep learning can fit random data: arxiv.org/abs/1611.03530  - A closer look at memorization in deep networks - deep learning learns before memorizing: arxiv.org/abs/1706.05394  - Zero-shot coordination: arxiv.org/abs/2003.02979  - A new formalism, method, and open issues for zero-shot coordination: arxiv.org/abs/2106.06613  - Conservative agency via attainable utility preservation: arxiv.org/abs/1902.09725  - Corrigibility: intelligence.org/files/Corrigibility.pdf   Errata:  - E. coli has ~4,400 genes, not 30,000.  - A typical adult human body has thousands of moles of water in it, and therefore must consist of well more than 10 moles total.
undefined
Apr 5, 2022 • 1h 48min

14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Late last year, Vanessa Kosoy and Alexander Appel published some research under the heading of "Infra-Bayesian physicalism". But wait - what was infra-Bayesianism again? Why should we care? And what does any of this have to do with physicalism? In this episode, I talk with Vanessa Kosoy about these questions, and get a technical overview of how infra-Bayesian physicalism works and what its implications are.   Topics we discuss, and timestamps:  - 00:00:48 - The basics of infra-Bayes  - 00:08:32 - An invitation to infra-Bayes  - 00:11:23 - What is naturalized induction?  - 00:19:53 - How infra-Bayesian physicalism helps with naturalized induction    - 00:19:53 - Bridge rules    - 00:22:22 - Logical uncertainty    - 00:23:36 - Open source game theory    - 00:28:27 - Logical counterfactuals    - 00:30:55 - Self-improvement  - 00:32:40 - How infra-Bayesian physicalism works    - 00:32:47 - World models      - 00:39-20 - Priors      - 00:42:53 - Counterfactuals      - 00:50:34 - Anthropics    - 00:54:40 - Loss functions      - 00:56:44 - The monotonicity principle      - 01:01:57 - How to care about various things    - 01:08:47 - Decision theory  - 01:19:53 - Follow-up research    - 01:20:06 - Infra-Bayesian physicalist quantum mechanics    - 01:26:42 - Infra-Bayesian physicalist agreement theorems  - 01:29:00 - The production of infra-Bayesianism research  - 01:35:14 - Bridge rules and malign priors  - 01:45:27 - Following Vanessa's work   The transcript: axrp.net/episode/2022/04/05/episode-14-infra-bayesian-physicalism-vanessa-kosoy.html   Vanessa on the Alignment Forum: alignmentforum.org/users/vanessa-kosoy   Research that we discuss:  - Infra-Bayesian physicalism: a formal theory of naturalized induction: alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized  - Updating ambiguous beliefs (contains the infra-Bayesian update rule): sciencedirect.com/science/article/abs/pii/S0022053183710033  - Functional Decision Theory: A New Theory of Instrumental Rationality: arxiv.org/abs/1710.05060  - Space-time embedded intelligence: cs.utexas.edu/~ring/Orseau,%20Ring%3B%20Space-Time%20Embedded%20Intelligence,%20AGI%202012.pdf  - Attacking the grain of truth problem using Bayes-Savage agents (generating a simplicity prior with Knightian uncertainty using oracle machines): alignmentforum.org/posts/5bd75cc58225bf0670375273/attacking-the-grain-of-truth-problem-using-bayes-sa  - Quantity of experience: brain-duplication and degrees of consciousness (the thick wires argument): nickbostrom.com/papers/experience.pdf  - Online learning in unknown Markov games: arxiv.org/abs/2010.15020  - Agreeing to disagree (contains the Aumann agreement theorem): ma.huji.ac.il/~raumann/pdf/Agreeing%20to%20Disagree.pdf  - What does the universal prior actually look like? (aka "the Solomonoff prior is malign"): ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/  - The Solomonoff prior is malign: alignmentforum.org/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign  - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit  - ELK Thought Dump, by Abram Demski: alignmentforum.org/posts/eqzbXmqGqXiyjX3TP/elk-thought-dump-1
undefined
40 snips
Mar 31, 2022 • 1h 34min

13 - First Principles of AGI Safety with Richard Ngo

How should we think about artificial general intelligence (AGI), and the risks it might pose? What constraints exist on technical solutions to the problem of aligning superhuman AI systems with human intentions? In this episode, I talk to Richard Ngo about his report analyzing AGI safety from first principles, and recent conversations he had with Eliezer Yudkowsky about the difficulty of AI alignment.   Topics we discuss, and timestamps:  - 00:00:40 - The nature of intelligence and AGI    - 00:01:18 - The nature of intelligence    - 00:06:09 - AGI: what and how    - 00:13:30 - Single vs collective AI minds  - 00:18:57 - AGI in practice    - 00:18:57 - Impact    - 00:20:49 - Timing    - 00:25:38 - Creation    - 00:28:45 - Risks and benefits  - 00:35:54 - Making AGI safe    - 00:35:54 - Robustness of the agency abstraction    - 00:43:15 - Pivotal acts  - 00:50:05 - AGI safety concepts    - 00:50:05 - Alignment    - 00:56:14 - Transparency    - 00:59:25 - Cooperation  - 01:01:40 - Optima and selection processes  - 01:13:33 - The AI alignment research community    - 01:13:33 - Updates from the Yudkowsky conversation    - 01:17:18 - Corrections to the community    - 01:23:57 - Why others don't join  - 01:26:38 - Richard Ngo as a researcher  - 01:28:26 - The world approaching AGI  - 01:30:41 - Following Richard's work   The transcript: axrp.net/episode/2022/03/31/episode-13-first-principles-agi-safety-richard-ngo.html   Richard on the Alignment Forum: alignmentforum.org/users/ricraz Richard on Twitter: twitter.com/RichardMCNgo The AGI Safety Fundamentals course: eacambridge.org/agi-safety-fundamentals   Materials that we mention:  - AGI Safety from First Principles: alignmentforum.org/s/mzgtmmTKKn5MuCzFJ  - Conversations with Eliezer Yudkowsky: alignmentforum.org/s/n945eovrA3oDueqtq  - The Bitter Lesson: incompleteideas.net/IncIdeas/BitterLesson.html  - Metaphors We Live By: en.wikipedia.org/wiki/Metaphors_We_Live_By  - The Enigma of Reason: hup.harvard.edu/catalog.php?isbn=9780674237827  - Draft report on AI timelines, by Ajeya Cotra: alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines  - More is Different for AI: bounded-regret.ghost.io/more-is-different-for-ai/  - The Windfall Clause: fhi.ox.ac.uk/windfallclause  - Cooperative Inverse Reinforcement Learning: arxiv.org/abs/1606.03137  - Imitative Generalisation: alignmentforum.org/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1  - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit  - Draft report on existential risk from power-seeking AI, by Joseph Carlsmith: alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai  - The Most Important Century: cold-takes.com/most-important-century

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode