AI Safety Fundamentals

BlueDot Impact
undefined
Sep 3, 2025 • 17min

It’s Practically Impossible to Run a Big AI Company Ethically

By Sigal Samuel (Vox Future Perfect)Even "safety-first" AI companies like Anthropic face market pressure that can override ethical commitments. This article demonstrates the constraints facing AI companies, and why voluntary corporate governance can't solve coordination problems alone.Source:https://archive.ph/A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Sep 3, 2025 • 13min

Solarpunk: A Vision for a Sustainable Future

By Joshua KrookWhat might sustainable human progress look like, beyond pure technological acceleration? This essay provides an alternative vision, based on communities living in greater harmony with each other and with nature, alongside advanced technologies.Source:https://newintrigue.com/2025/01/29/solarpunk-a-vision-for-a-sustainable-future/A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Sep 3, 2025 • 10min

The Gentle Singularity

By Sam AltmanThis blog post offers a vivid, optimistic vision of rapid AI progress from the CEO of OpenAI. Altman suggests that the accelerating technological change will feel "impressive but manageable," and that there are serious challenges to confront/Source: https://blog.samaltman.com/the-gentle-singularityA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Aug 30, 2025 • 2h 10min

AI-Enabled Coups: How a Small Group Could Use AI to Seize Power

By Tom Davidson, Lukas Finnveden and Rose Hadshar. The development of AI that is more broadly capable than humans will create a new and serious threat: AI-enabled coups. An AI-enabled coup could be staged by a very small group, or just a single person, and could occur even in established democracies. Sufficiently advanced AI will introduce three novel dynamics that significantly increase coup risk. Firstly, military and government leaders could fully replace human personnel with AI systems that are singularly loyal to them, eliminating the need to gain human supporters for a coup. Secondly, leaders of AI projects could deliberately build AI systems that are secretly loyal to them, for example fully autonomous military robots that pass security tests but later execute a coup when deployed in military settings. Thirdly, senior officials within AI projects or the government could gain exclusive access to superhuman capabilities in weapons development, strategic planning, persuasion, and cyber offence, and use these to increase their power until they can stage a coup. To address these risks, AI projects should design and enforce rules against AI misuse, audit systems for secret loyalties, and share frontier AI systems with multiple stakeholders. Governments should establish principles for government use of advanced AI, increase oversight of frontier AI projects, and procure AI for critical systems from multiple independent providers.Source: https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-powerA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 14min

Introduction to Logical Decision Theory for Computer Scientists

 Decision theories differ on exactly how to calculate the expectation--the probability of an outcome, conditional on an action. This foundational difference bubbles up to real-life questions about whether to vote in elections, or accept a lowball offer at the negotiating table. When you're thinking about what happens if you don't vote in an election, should you calculate the expected outcome as if only your vote changes, or as if all the people sufficiently similar to you would also decide not to vote? Questions like these belong to a larger class of problems, Newcomblike decision problems, in which some other agent is similar to us or reasoning about what we will do in the future. The central principle of 'logical decision theories', several families of which will be introduced, is that we ought to choose as if we are controlling the logical output of our abstract decision algorithm. Newcomblike considerations--which might initially seem like unusual special cases--become more prominent as agents can get higher-quality information about what algorithms or policies other agents use: Public commitments, machine agents with known code, smart contracts running on Ethereum. Newcomblike considerations also become more important as we deal with agents that are very similar to one another; or with large groups of agents that are likely to contain high-similarity subgroups; or with problems where even small correlations are enough to swing the decision. In philosophy, the debate over decision theories is seen as a debate over the principle of rational choice. Do 'rational' agents refrain from voting in elections, because their one vote is very unlikely to change anything? Do we need to go beyond 'rationality', into 'social rationality' or 'superrationality' or something along those lines, in order to describe agents that could possibly make up a functional society?Original text:https://arbital.com/p/logical_dt/?l=5d6Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.--- A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 19min

High-Stakes Alignment via Adversarial Training [Redwood Research Report]

Delve into the fascinating world of AI safety as researchers explore adversarial training to enhance system reliability. This discussion highlights experiments designed to mitigate the risks of AI deception, including innovative approaches to filtering harmful content. Discover how adversarial techniques are applied to create robust classifiers and the implications for overseeing AI behavior in high-stakes scenarios. The insights reveal both progress and challenges in the ongoing quest for safer AI systems.
undefined
Jan 4, 2025 • 19min

Supervising Strong Learners by Amplifying Weak Experts

Abstract: Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., 2017; Silver et al., 2017), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.Original text:https://arxiv.org/abs/1810.08575Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 18min

Embedded Agents

Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don’t already know. There’s a complicated engineering problem here. But there’s also a problem of figuring out what it even means to build a learning agent like that. What is it to optimize realistic goals in physical environments? In broad terms, how does it work? In this series of posts, I’ll point to four ways we don’t currently know how it works, and four areas of active research aimed at figuring it out. This is Alexei, and Alexei is playing a video game. Like most games, this game has clear input and output channels. Alexei only observes the game through the computer screen, and only manipulates the game through the controller. The game can be thought of as a function which takes in a sequence of button presses and outputs a sequence of pixels on the screen. Alexei is also very smart, and capable of holding the entire video game inside his mind.Original text:https://intelligence.org/2018/10/29/embedded-agents/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 17min

Understanding Intermediate Layers Using Linear Classifier Probes

Abstract:Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.Original text:https://arxiv.org/pdf/1610.01644.pdfNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 16min

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix.Source:https://arxiv.org/abs/2205.10625Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app