AXRP - the AI X-risk Research Podcast cover image

AXRP - the AI X-risk Research Podcast

Latest episodes

undefined
23 snips
Jul 27, 2023 • 2h 8min

24 - Superalignment with Jan Leike

Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/   Topics we discuss, and timestamps:  - 0:00:37 - The superalignment team  - 0:02:10 - What's a human-level automated alignment researcher?    - 0:06:59 - The gap between human-level automated alignment researchers and superintelligence    - 0:18:39 - What does it do?    - 0:24:13 - Recursive self-improvement  - 0:26:14 - How to make the AI AI alignment researcher    - 0:30:09 - Scalable oversight    - 0:44:38 - Searching for bad behaviors and internals    - 0:54:14 - Deliberately training misaligned models  - 1:02:34 - Four year deadline    - 1:07:06 - What if it takes longer?  - 1:11:38 - The superalignment team and...    - 1:11:38 - ... governance    - 1:14:37 - ... other OpenAI teams    - 1:18:17 - ... other labs  - 1:26:10 - Superalignment team logistics  - 1:29:17 - Generalization  - 1:43:44 - Complementary research  - 1:48:29 - Why is Jan optimistic?    - 1:58:32 - Long-term agency in LLMs?    - 2:02:44 - Do LLMs understand alignment?  - 2:06:01 - Following Jan's research   The transcript: axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html   Links for Jan and OpenAI:  - OpenAI jobs: openai.com/careers  - Jan's substack: aligned.substack.com  - Jan's twitter: twitter.com/janleike   Links to research and other writings we discuss:  - Introducing Superalignment: openai.com/blog/introducing-superalignment  - Let's Verify Step by Step (process-based feedback on math): arxiv.org/abs/2305.20050  - Planning for AGI and beyond: openai.com/blog/planning-for-agi-and-beyond  - Self-critiquing models for assisting human evaluators: arxiv.org/abs/2206.05802  - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143  - Language models can explain neurons in language models https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html  - Our approach to alignment research: openai.com/blog/our-approach-to-alignment-research  - Training language models to follow instructions with human feedback (aka the Instruct-GPT paper): arxiv.org/abs/2203.02155
undefined
Jul 27, 2023 • 2h 6min

23 - Mechanistic Anomaly Detection with Mark Xu

Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper "Formalizing the presumption of independence", which formulates the problem of formalizing heuristic mathematical reasoning, in the hope that this will let us mathematically define "mechanistic anomalies". Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/   Topics we discuss, and timestamps:  - 0:00:38 - Mechanistic anomaly detection    - 0:09:28 - Are all bad things mechanistic anomalies, and vice versa?    - 0:18:12 - Are responses to novel situations mechanistic anomalies?    - 0:39:19 - Formalizing "for the normal reason, for any reason"    - 1:05:22 - How useful is mechanistic anomaly detection?  - 1:12:38 - Formalizing the Presumption of Independence    - 1:20:05 - Heuristic arguments in physics    - 1:27:48 - Difficult domains for heuristic arguments    - 1:33:37 - Why not maximum entropy?    - 1:44:39 - Adversarial robustness for heuristic arguments    - 1:54:05 - Other approaches to defining mechanisms  - 1:57:20 - The research plan: progress and next steps  - 2:04:13 - Following ARC's research   The transcript: axrp.net/episode/2023/07/24/episode-23-mechanistic-anomaly-detection-mark-xu.html   ARC links:  - Website: alignment.org  - Theory blog: alignment.org/blog  - Hiring page: alignment.org/hiring   Research we discuss:  - Formalizing the presumption of independence: arxiv.org/abs/2211.06738  - Eliciting Latent Knowledge (aka ELK): alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge  - Mechanistic Anomaly Detection and ELK: alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk  - Can we efficiently explain model behaviours? alignmentforum.org/posts/dQvxMZkfgqGitWdkb/can-we-efficiently-explain-model-behaviors  - Can we efficiently distinguish different mechanisms? alignmentforum.org/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms
undefined
Jun 28, 2023 • 4min

Survey, store closing, Patreon

Very brief survey: bit.ly/axrpsurvey2023 Store is closing in a week! Link: store.axrp.net/ Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast
undefined
5 snips
Jun 15, 2023 • 3h 28min

22 - Shard Theory with Quintin Pope

What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com   Topics we discuss, and timestamps:  - 0:00:42 - Why understand human value formation?    - 0:19:59 - Why not design methods to align to arbitrary values?  - 0:27:22 - Postulates about human brains    - 0:36:20 - Sufficiency of the postulates    - 0:44:55 - Reinforcement learning as conditional sampling    - 0:48:05 - Compatibility with genetically-influenced behaviour    - 1:03:06 - Why deep learning is basically what the brain does  - 1:25:17 - Shard theory    - 1:38:49 - Shard theory vs expected utility optimizers    - 1:54:45 - What shard theory says about human values  - 2:05:47 - Does shard theory mean we're doomed?    - 2:18:54 - Will nice behaviour generalize?    - 2:33:48 - Does alignment generalize farther than capabilities?  - 2:42:03 - Are we at the end of machine learning history?  - 2:53:09 - Shard theory predictions  - 2:59:47 - The shard theory research community    - 3:13:45 - Why do shard theorists not work on replicating human childhoods?  - 3:25:53 - Following shardy research   The transcript: axrp.net/episode/2023/06/15/episode-22-shard-theory-quintin-pope.html   Shard theorist links:  - Quintin's LessWrong profile: lesswrong.com/users/quintin-pope  - Alex Turner's LessWrong profile: lesswrong.com/users/turntrout  - Shard theory Discord: discord.gg/AqYkK7wqAG  - EleutherAI Discord: discord.gg/eleutherai   Research we discuss:  - The Shard Theory Sequence: lesswrong.com/s/nyEFg3AuJpdAozmoX  - Pretraining Language Models with Human Preferences: arxiv.org/abs/2302.08582  - Inner alignment in salt-starved rats: lesswrong.com/posts/wcNEXDHowiWkRxDNv/inner-alignment-in-salt-starved-rats  - Intro to Brain-like AGI Safety Sequence: lesswrong.com/s/HzcM2dkCq7fwXBej8  - Brains and transformers:    - The neural architecture of language: Integrative modeling converges on predictive processing: pnas.org/doi/10.1073/pnas.2105646118    - Brains and algorithms partially converge in natural language processing: nature.com/articles/s42003-022-03036-1    - Evidence of a predictive coding hierarchy in the human brain listening to speech: nature.com/articles/s41562-022-01516-2  - Singular learning theory explainer: Neural networks generalize because of this one weird trick: lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick  - Singular learning theory links: metauni.org/slt/  - Implicit Regularization via Neural Feature Alignment, aka circles in the parameter-function map: arxiv.org/abs/2008.00938  - The shard theory of human values: lesswrong.com/s/nyEFg3AuJpdAozmoX/p/iCfdcxiyr2Kj8m8mT  - Predicting inductive biases of pre-trained networks: openreview.net/forum?id=mNtmhaDkAr  - Understanding and controlling a maze-solving policy network, aka the cheese vector: lesswrong.com/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network  - Quintin's Research agenda: Supervising AIs improving AIs: lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais  - Steering GPT-2-XL by adding an activation vector: lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector   Links for the addendum on mesa-optimization skepticism:  - Quintin's response to Yudkowsky arguing against AIs being steerable by gradient descent: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_argues_against_AIs_being_steerable_by_gradient_descent_  - Quintin on why evolution is not like AI training: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Edit__Why_evolution_is_not_like_AI_training  - Evolution provides no evidence for the sharp left turn: lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn  - Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets: arxiv.org/abs/1905.10854
undefined
7 snips
May 2, 2023 • 1h 56min

21 - Interpretability for Engineers with Stephen Casper

Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural networks. But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he's co-developed to evaluate whether interpretability tools can find 'Trojan horses' hidden inside neural nets. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast   Topics we discuss, and timestamps:  - 00:00:42 - Interpretability for engineers    - 00:00:42 - Why interpretability?    - 00:12:55 - Adversaries and interpretability    - 00:24:30 - Scaling interpretability    - 00:42:29 - Critiques of the AI safety interpretability community    - 00:56:10 - Deceptive alignment and interpretability  - 01:09:48 - Benchmarking Interpretability Tools (for Deep Neural Networks) (Using Trojan Discovery)    - 01:10:40 - Why Trojans?    - 01:14:53 - Which interpretability tools?    - 01:28:40 - Trojan generation    - 01:38:13 - Evaluation  - 01:46:07 - Interpretability for shaping policy  - 01:53:55 - Following Casper's work   The transcript: axrp.net/episode/2023/05/02/episode-21-interpretability-for-engineers-stephen-casper.html   Links for Casper:  - Personal website: stephencasper.com/  - Twitter: twitter.com/StephenLCasper  - Electronic mail: scasper [at] mit [dot] edu   Research we discuss:  - The Engineer's Interpretability Sequence: alignmentforum.org/s/a6ne2ve5uturEEQK7  - Benchmarking Interpretability Tools for Deep Neural Networks: arxiv.org/abs/2302.10894  - Adversarial Policies beat Superhuman Go AIs: goattack.far.ai/  - Adversarial Examples Are Not Bugs, They Are Features: arxiv.org/abs/1905.02175  - Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974  - Softmax Linear Units: transformer-circuits.pub/2022/solu/index.html  - Red-Teaming the Stable Diffusion Safety Filter: arxiv.org/abs/2210.04610   Episode art by Hamish Doodles: hamishdoodles.com
undefined
18 snips
Apr 12, 2023 • 2h 28min

20 - 'Reform' AI Alignment with Scott Aaronson

How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with Scott Aaronson about his views on how to make progress in AI alignment, as well as his work on watermarking the output of language models, and how he moved from a background in quantum complexity theory to working on AI.   Note: this episode was recorded before this story (vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says) emerged of a man committing suicide after discussions with a language-model-based chatbot, that included discussion of the possibility of him killing himself. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast   Topics we discuss, and timestamps:  - 0:00:36 - 'Reform' AI alignment    - 0:01:52 - Epistemology of AI risk    - 0:20:08 - Immediate problems and existential risk    - 0:24:35 - Aligning deceitful AI    - 0:30:59 - Stories of AI doom    - 0:34:27 - Language models    - 0:43:08 - Democratic governance of AI    - 0:59:35 - What would change Scott's mind  - 1:14:45 - Watermarking language model outputs    - 1:41:41 - Watermark key secrecy and backdoor insertion  - 1:58:05 - Scott's transition to AI research    - 2:03:48 - Theoretical computer science and AI alignment    - 2:14:03 - AI alignment and formalizing philosophy    - 2:22:04 - How Scott finds AI research  - 2:24:53 - Following Scott's research   The transcript: axrp.net/episode/2023/04/11/episode-20-reform-ai-alignment-scott-aaronson.html   Links to Scott's things:  - Personal website: scottaaronson.com  - Book, Quantum Computing Since Democritus: amazon.com/Quantum-Computing-since-Democritus-Aaronson/dp/0521199565/  - Blog, Shtetl-Optimized: scottaaronson.blog   Writings we discuss:  - Reform AI Alignment: scottaaronson.blog/?p=6821  - Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974
undefined
Feb 7, 2023 • 3min

Store, Patreon, Video

Store: https://store.axrp.net/ Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Video: https://www.youtube.com/watch?v=kmPFjpEibu0
undefined
4 snips
Feb 4, 2023 • 3h 53min

19 - Mechanistic Interpretability with Neel Nanda

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.   Topics we discuss, and timestamps:  - 00:01:05 - What is mechanistic interpretability?  - 00:24:16 - Types of AI cognition  - 00:54:27 - Automating mechanistic interpretability  - 01:11:57 - Summarizing the papers  - 01:24:43 - 'A Mathematical Framework for Transformer Circuits'    - 01:39:31 - How attention works    - 01:49:26 - Composing attention heads    - 01:59:42 - Induction heads  - 02:11:05 - 'In-context Learning and Induction Heads'    - 02:12:55 - The multiplicity of induction heads    - 02:30:10 - Lines of evidence    - 02:38:47 - Evolution in loss-space    - 02:46:19 - Mysteries of in-context learning  - 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'    - 02:50:57 - How neural nets learn modular addition    - 03:11:37 - The suddenness of grokking  - 03:34:16 - Relation to other research  - 03:43:57 - Could mechanistic interpretability possibly work?  - 03:49:28 - Following Neel's research   The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html   Links to Neel's things:  - Neel on Twitter: twitter.com/NeelNanda5  - Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1  - Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability  - TransformerLens: github.com/neelnanda-io/TransformerLens  - Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic  - Neel on YouTube: youtube.com/@neelnanda2469  - 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj  - Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J   Writings we discuss:  - A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html  - In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html  - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217  - Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052  - interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens  - Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262  - Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097  - Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN  - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143  - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593  - Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177  - The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544  - Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration  - Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913   - Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves  - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635  - Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a    
undefined
Oct 13, 2022 • 1min

New podcast - The Filan Cabinet

I have a new podcast, where I interview whoever I want about whatever I want. It's called "The Filan Cabinet", and you can find it wherever you listen to podcasts. The first three episodes are about pandemic preparedness, God, and cryptocurrency. For more details, check out the podcast website (thefilancabinet.com), or search "The Filan Cabinet" in your podcast app.
undefined
Sep 3, 2022 • 1h 46min

18 - Concept Extrapolation with Stuart Armstrong

Concept extrapolation is the idea of taking concepts an AI has about the world - say, "mass" or "does this picture contain a hot dog" - and extending them sensibly to situations where things are different - like learning that the world works via special relativity, or seeing a picture of a novel sausage-bread combination. For a while, Stuart Armstrong has been thinking about concept extrapolation and how it relates to AI alignment. In this episode, we discuss where his thoughts are at on this topic, what the relationship to AI alignment is, and what the open questions are.   Topics we discuss, and timestamps:  - 00:00:44 - What is concept extrapolation  - 00:15:25 - When is concept extrapolation possible  - 00:30:44 - A toy formalism  - 00:37:25 - Uniqueness of extrapolations  - 00:48:34 - Unity of concept extrapolation methods  - 00:53:25 - Concept extrapolation and corrigibility  - 00:59:51 - Is concept extrapolation possible?  - 01:37:05 - Misunderstandings of Stuart's approach  - 01:44:13 - Following Stuart's work   The transcript: axrp.net/episode/2022/09/03/episode-18-concept-extrapolation-stuart-armstrong.html   Stuart's startup, Aligned AI: aligned-ai.com   Research we discuss:  - The Concept Extrapolation sequence: alignmentforum.org/s/u9uawicHx7Ng7vwxA  - The HappyFaces benchmark: github.com/alignedai/HappyFaces  - Goal Misgeneralization in Deep Reinforcement Learning: arxiv.org/abs/2105.14111

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode