
AI Safety Fundamentals
Listen to resources from the AI Safety Fundamentals courses!https://aisafetyfundamentals.com/
Latest episodes

Jan 4, 2025 • 1h 9min
Working in AI Alignment
This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren’t familiar with the arguments for the importance of AI alignment, you can get an overview of them by doing the AI Alignment Course.by Charlie Rogers-Smith, with minor updates by Adam JonesSource:https://aisafetyfundamentals.com/blog/alignment-careers-guideNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 11min
Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points
We took 10 years of research and what we’ve learned from advising 1,000+ people on how to build high-impact careers, compressed that into an eight-week course to create your career plan, and then compressed that into this three-page summary of the main points.(It’s especially aimed at people who want a career that’s both satisfying and has a significant positive impact, but much of the advice applies to all career decisions.)Original article:https://80000hours.org/career-planning/summary/Author:Benjamin ToddA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 7min
Being the (Pareto) Best in the World
This introduces the concept of Pareto frontiers. The top comment by Rob Miles also ties it to comparative advantage.While reading, consider what Pareto frontiers your project could place you on.Original text:https://www.lesswrong.com/posts/XvN2QQpKTuEzgkZHY/being-the-pareto-best-in-the-worldAuthor:John WentworthA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 3min
Writing, Briefly
(In the process of answering an email, I accidentally wrote a tiny essay about writing. I usually spend weeks on an essay. This one took 67 minutes—23 of writing, and 44 of rewriting.)Original text:https://paulgraham.com/writing44.htmlAuthor:Paul GrahamA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 14min
Compute Trends Across Three Eras of Machine Learning
This article explains key drivers of AI progress, explains how compute is calculated, as well as looks at how the amount of compute used to train AI models has increased significantly in recent years.Original text: https://epochai.org/blog/compute-trendsAuthor(s): Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Pablo Villalobos.A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 12min
Empirical Findings Generalize Surprisingly Far
Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what the future of ML will be like. In this post, I will argue that despite this, empirical findings often do generalize very far, including across “phase transitions” caused by emergent behavior.This might seem like a contradiction, but actually I think divergence from current trends and empirical generalization are consistent. Findings do often generalize, but you need to think to determine the right generalization, and also about what might stop any given generalization from holding.I don’t think many people would contest the claim that empirical investigation can uncover deep and generalizable truths. This is one of the big lessons of physics, and while some might attribute physics’ success to math instead of empiricism, I think it’s clear that you need empirical data to point to the right mathematics.However, just invoking physics isn’t a good argument, because physical laws have fundamental symmetries that we shouldn’t expect in machine learning. Moreover, we care specifically about findings that continue to hold up after some sort of emergent behavior (such as few-shot learning in the case of ML). So, to make my case, I’ll start by considering examples in deep learning that have held up in this way. Since “modern” deep learning hasn’t been around that long, I’ll also look at examples from biology, a field that has been around for a relatively long time and where More Is Different is ubiquitous (see Appendix: More Is Different In Other Domains).Source:https://bounded-regret.ghost.io/empirical-findings-generalize-surprisingly-far/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 37min
Discovering Latent Knowledge in Language Models Without Supervision
Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.Original text:https://arxiv.org/abs/2212.03827Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 1h 2min
Intro to Brain-Like-AGI Safety
(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them.If this whole thing seems weird or stupid, you should start right in on Post #1, which contains definitions, background, and motivation. Then Posts #2–#7 are mainly neuroscience, and Posts #8–#15 are more directly about AGI safety, ending with a list of open questions and advice for getting involved in the field.Source:https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 8min
Deep Double Descent
We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.Source:https://openai.com/research/deep-double-descentNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 32min
Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.