

BlueDot Narrated
BlueDot Impact
Audio versions of the core readings, blog posts, and papers from BlueDot courses.
Episodes
Mentioned books

Jan 4, 2025 • 34min
The Alignment Problem From a Deep Learning Perspective
The discussion dives into the potential dangers of artificial general intelligence surpassing human abilities. Key topics include how misaligned goals can lead AGIs to pursue undesirable outcomes and the risks of reward hacking. The episode explores how AGIs might adopt deceptive strategies to maximize rewards while undermining human control. Additionally, it examines the implications of internally represented goals and outlines research directions to mitigate these risks. The conversation paints a sobering picture of the future of AGI and the urgent need for alignment safety.

Jan 4, 2025 • 1h 2min
Yudkowsky Contra Christiano on AI Takeoff Speeds
Audio versions of blogs and papers from BlueDot courses.In 2008, thousands of blog readers - including yours truly, who had discovered the rationality community just a few months before - watched Robin Hanson debate Eliezer Yudkowsky on the future of AI.Robin thought the AI revolution would be a gradual affair, like the Agricultural or Industrial Revolutions. Various people invent and improve various technologies over the course of decades or centuries. Each new technology provides another jumping-off point for people to use when inventing other technologies: mechanical gears → steam engine → railroad and so on. Over the course of a few decades, you’ve invented lots of stuff and the world is changed, but there’s no single moment when “industrialization happened”.Eliezer thought it would be lightning-fast. Once researchers started building human-like AIs, some combination of adding more compute, and the new capabilities provided by the AIs themselves, would quickly catapult AI to unimaginably superintelligent levels. The whole process could take between a few hours and a few years, depending on what point you measured from, but it wouldn’t take decades.You can imagine the graph above as being GDP over time, except that Eliezer thinks AI will probably destroy the world, which might be bad for GDP in some sense. If you come up with some way to measure (in dollars) whatever kind of crazy technologies AIs create for their own purposes after wiping out humanity, then the GDP framing will probably work fine.Crossposted from the Astral Codex Ten Podcast.---A podcast by BlueDot Impact.

Jan 4, 2025 • 10min
Measuring Progress on Scalable Oversight for Large Language Models
Audio versions of blogs and papers from BlueDot courses.Abstract: Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.Authors: Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared KaplanOriginal text:https://arxiv.org/abs/2211.03540Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 18min
What Failure Looks Like
Audio versions of blogs and papers from BlueDot courses.Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. ("Going out with a whimper.")Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. ("Going out with a bang," an instance of optimization daemons.) I think these are the most important problems if we fail to solve intent alignment.In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.Crossposted from the LessWrong Curated Podcast by TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 27min
Deceptively Aligned Mesa-Optimizers: It’s Not Funny if I Have to Explain It
Audio versions of blogs and papers from BlueDot courses.Our goal here is to popularize obscure and hard-to-understand areas of AI alignment.So let’s try to understand the incomprehensible meme! Our main source will be Hubinger et al 2019, Risks From Learned Optimization In Advanced Machine Learning Systems.Mesa- is a Greek prefix which means the opposite of meta-. To “go meta” is to go one level up; to “go mesa” is to go one level down (nobody has ever actually used this expression, sorry). So a mesa-optimizer is an optimizer one level down from you.Consider evolution, optimizing the fitness of animals. For a long time, it did so very mechanically, inserting behaviors like “use this cell to detect light, then grow toward the light” or “if something has a red dot on its back, it might be a female of your species, you should mate with it”. As animals became more complicated, they started to do some of the work themselves. Evolution gave them drives, like hunger and lust, and the animals figured out ways to achieve those drives in their current situation. Evolution didn’t mechanically instill the behavior of opening my fridge and eating a Swiss Cheese slice. It instilled the hunger drive, and I figured out that the best way to satisfy it was to open my fridge and eat cheese.Source:https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizersCrossposted from the Astral Codex Ten podcast.---A podcast by BlueDot Impact.

Jan 4, 2025 • 17min
Goal Misgeneralisation: Why Correct Specifications Aren’t Enough for Correct Goals
Audio versions of blogs and papers from BlueDot courses.As we build increasingly advanced AI systems, we want to make sure they don’t pursue undesired goals. This is the primary concern of the AI alignment community. Undesired behaviour in an AI agent is often the result of specification gaming —when the AI exploits an incorrectly specified reward. However, if we take on the perspective of the agent we’re training, we see other reasons it might pursue undesired goals, even when trained with a correct specification. Imagine that you are the agent (the blue blob) being trained with reinforcement learning (RL) in the following 3D environment: The environment also contains another blob like yourself, but coloured red instead of blue, that also moves around. The environment also appears to have some tower obstacles, some coloured spheres, and a square on the right that sometimes flashes. You don’t know what all of this means, but you can figure it out during training! You start exploring the environment to see how everything works and to see what you do and don’t get rewarded for.For more details, check out our paper. By Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton.Original text:https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 12min
Takeaways From Our Robust Injury Classifier Project [Redwood Research]
Audio versions of blogs and papers from BlueDot courses.With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper). Our original aim was to use adversarial training to make a system that (as far as we could tell) never produced injurious completions. If we had accomplished that, we think it would have been the first demonstration of a deep learning system avoiding a difficult-to-formalize catastrophe with an ultra-high level of reliability. Presumably, we would have needed to invent novel robustness techniques that could have informed techniques useful for aligning TAI. With a successful system, we also could have performed ablations to get a clear sense of which building blocks were most important. Alas, we fell well short of that target. We still saw failures when just randomly sampling prompts and completions. Our adversarial training didn’t reduce the random failure rate, nor did it eliminate highly egregious failures (example below). We also don’t think we've successfully demonstrated a negative result, given that our results could be explained by suboptimal choices in our training process. Overall, we’d say this project had value as a learning experience but produced much less alignment progress than we hoped.Source:https://www.alignmentforum.org/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwoodNarrated for AI Safety Fundamentals by TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 23min
Progress on Causal Influence Diagrams
Audio versions of blogs and papers from BlueDot courses.By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggAbout 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then. What are causal influence diagrams? A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. For these reasons, we’re developing a formal theory of incentives based on causal influence diagrams (CIDs).Source:https://deepmindsafetyresearch.medium.com/progress-on-causal-influence-diagrams-a7a32180b0d1Narrated for AI Safety Fundamentals by TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 3h 21min
Is Power-Seeking AI an Existential Risk?
Audio versions of blogs and papers from BlueDot courses.This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture that informs such concern. On this picture, intelligent agency is an extremely powerful force, and creating agents much more intelligent than us is playing with fire -- especially given that if their objectives are problematic, such agents would plausibly have instrumental incentives to seek power over humans. Second, I formulate and evaluate a more specific six-premise argument that creating agents of this kind will lead to existential catastrophe by 2070. On this argument, by 2070: (1) it will become possible and financially feasible to build relevantly powerful and agentic AI systems; (2) there will be strong incentives to do so; (3) it will be much harder to build aligned (and relevantly powerful/agentic) AI systems than to build misaligned (and relevantly powerful/agentic) AI systems that are still superficially attractive to deploy; (4) some such misaligned systems will seek power over humans in high-impact ways; (5) this problem will scale to the full disempowerment of humanity; and (6) such disempowerment will constitute an existential catastrophe. I assign rough subjective credences to the premises in this argument, and I end up with an overall estimate of ~5% that an existential catastrophe of this kind will occur by 2070. (May 2022 update: since making this report public in April 2021, my estimate here has gone up, and is now at >10%).Source:https://arxiv.org/abs/2206.13353Narrated for Joe Carlsmith Audio by TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 6min
Summarizing Books With Human Feedback
Audio versions of blogs and papers from BlueDot courses.To safely deploy powerful, general-purpose artificial intelligence in the future, we need to ensure that machine learning models act in accordance with human intentions. This challenge has become known as the alignment problem.A scalable solution to the alignment problem needs to work on tasks where model outputs are difficult or time-consuming for humans to evaluate. To test scalable alignment techniques, we trained a model to summarize entire books, as shown in the following samples.Source:https://openai.com/research/summarizing-booksNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.


