

BlueDot Narrated
BlueDot Impact
Audio versions of the core readings, blog posts, and papers from BlueDot courses.
Episodes
Mentioned books

Jan 4, 2025 • 14min
ML Systems Will Have Weird Failure Modes
Audio versions of blogs and papers from BlueDot courses.Previously, I've argued that future ML systems might exhibit unfamiliar, emergent capabilities, and that thought experiments provide one approach towards predicting these capabilities and their consequences. In this post I’ll describe a particular thought experiment in detail. We’ll see that taking thought experiments seriously often surfaces future risks that seem "weird" and alien from the point of view of current systems. I’ll also describe how I tend to engage with these thought experiments: I usually start out intuitively skeptical, but when I reflect on emergent behavior I find that some (but not all) of the skepticism goes away. The remaining skepticism comes from ways that the thought experiment clashes with the ontology of neural networks, and I’ll describe the approaches I usually take to address this and generate actionable takeaways. ## Thought Experiment: Deceptive Alignment Recall that the optimization anchor runs the thought experiment of assuming that an ML agent is a perfect optimizer (with respect to some "intrinsic" reward function R). I’m going to examine one implication of this assumption, in the context of an agent being trained based on some "extrinsic" reward function R∗ (which is provided by the system designer and not equal to R). Specifically, consider a training process where in step t, a model has parameters θt and generates an action at (its output on that training step, e.g. an attempted backflip assuming it is being trained to do backflips). The action at is then judged according to the extrinsic reward function R∗, and the parameters are updated to some new value θt+1 that are intended to increase at+1's value under R∗.Original text:https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 43min
Where I Agree and Disagree with Eliezer
Audio versions of blogs and papers from BlueDot courses.(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)Agreements Powerful AI systems have a good chance of deliberately and irreversibly disempowering humanity. This is a much easier failure mode than killing everyone with destructive physical technologies. Catastrophically risky AI systems could plausibly exist soon, and there likely won’t be a strong consensus about this fact until such systems pose a meaningful existential risk per year. There is not necessarily any “fire alarm.” Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.Source:https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer---A podcast by BlueDot Impact.

Jan 4, 2025 • 23min
Progress on Causal Influence Diagrams
Audio versions of blogs and papers from BlueDot courses.By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggAbout 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then. What are causal influence diagrams? A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. For these reasons, we’re developing a formal theory of incentives based on causal influence diagrams (CIDs).Source:https://deepmindsafetyresearch.medium.com/progress-on-causal-influence-diagrams-a7a32180b0d1Narrated for AI Safety Fundamentals by TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 1h 2min
AGI Ruin: A List of Lethalities
Audio versions of blogs and papers from BlueDot courses.I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.Crossposted from the LessWrong Curated Podcast by TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 7min
AI Safety via Red Teaming Language Models With Language Models
Audio versions of blogs and papers from BlueDot courses.Abstract: Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM. We evaluate the target LM’s replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot’s own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.Original text:https://www.deepmind.com/publications/red-teaming-language-models-with-language-modelsNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 29min
Debate Update: Obfuscated Arguments Problem
Audio versions of blogs and papers from BlueDot courses.This is an update on the work on AI Safety via Debate that we previously wrote about here. What we did: We tested the debate protocol introduced in AI Safety via Debate with human judges and debaters. We found various problems and improved the mechanism to fix these issues (details of these are in the appendix). However, we discovered that a dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error. We don’t have a fix for this “obfuscated argument” problem, and believe it might be an important quantitative limitation for both IDA and Debate.Key takeaways and relevance for alignment:Our ultimate goal is to find a mechanism that allows us to learn anything that a machine learning model knows: if the model can efficiently find the correct answer to some problem, our mechanism should favor the correct answer while only requiring a tractable number of human judgements and a reasonable number of computation steps for the model. We’re working under a hypothesis that there are broadly two ways to know things: via step-by-step reasoning about implications (logic, computation…), and by learning and generalizing from data (pattern matching, bayesian updating…).Original text:https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problemNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 8min
Careers in Alignment
Audio versions of blogs and papers from BlueDot courses.Richard Ngo compiles a number of resources for thinking about careers in alignment research.Original text:https://docs.google.com/document/d/1iFszDulgpu1aZcq_aYFG7Nmcr5zgOhaeSwavOMk1akw/edit#heading=h.4whc9v22p7tbNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 10min
Measuring Progress on Scalable Oversight for Large Language Models
Audio versions of blogs and papers from BlueDot courses.Abstract: Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.Authors: Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared KaplanOriginal text:https://arxiv.org/abs/2211.03540Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 1h 2min
Yudkowsky Contra Christiano on AI Takeoff Speeds
Audio versions of blogs and papers from BlueDot courses.In 2008, thousands of blog readers - including yours truly, who had discovered the rationality community just a few months before - watched Robin Hanson debate Eliezer Yudkowsky on the future of AI.Robin thought the AI revolution would be a gradual affair, like the Agricultural or Industrial Revolutions. Various people invent and improve various technologies over the course of decades or centuries. Each new technology provides another jumping-off point for people to use when inventing other technologies: mechanical gears → steam engine → railroad and so on. Over the course of a few decades, you’ve invented lots of stuff and the world is changed, but there’s no single moment when “industrialization happened”.Eliezer thought it would be lightning-fast. Once researchers started building human-like AIs, some combination of adding more compute, and the new capabilities provided by the AIs themselves, would quickly catapult AI to unimaginably superintelligent levels. The whole process could take between a few hours and a few years, depending on what point you measured from, but it wouldn’t take decades.You can imagine the graph above as being GDP over time, except that Eliezer thinks AI will probably destroy the world, which might be bad for GDP in some sense. If you come up with some way to measure (in dollars) whatever kind of crazy technologies AIs create for their own purposes after wiping out humanity, then the GDP framing will probably work fine.Crossposted from the Astral Codex Ten Podcast.---A podcast by BlueDot Impact.

Jan 4, 2025 • 8min
AI Watermarking Won’t Curb Disinformation
Audio versions of blogs and papers from BlueDot courses.Generative AI allows people to produce piles upon piles of images and words very quickly. It would be nice if there were some way to reliably distinguish AI-generated content from human-generated content. It would help people avoid endlessly arguing with bots online, or believing what a fake image purports to show. One common proposal is that big companies should incorporate watermarks into the outputs of their AIs. For instance, this could involve taking an image and subtly changing many pixels in a way that’s undetectable to the eye but detectable to a computer program. Or it could involve swapping words for synonyms in a predictable way so that the meaning is unchanged, but a program could readily determine the text was generated by an AI.Unfortunately, watermarking schemes are unlikely to work. So far most have proven easy to remove, and it’s likely that future schemes will have similar problems.Source:https://transformer-circuits.pub/2023/monosemantic-features/index.htmlNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.


