AI Safety Fundamentals: Alignment

BlueDot Impact

Listen to resources from the AI Safety Fundamentals: Alignment course!https://aisafetyfundamentals.com/alignment

Episodes

Mentioned books

Jan 2, 2025 • 20min

We Need a Science of Evals

This lays out a number of open questions, in what the author calls a 'Science of Evals'.Original text: https://www.apolloresearch.ai/blog/we-need-a-science-of-evals Author(s): Apollo Research blogA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jan 2, 2025 • 12min

Introduction to Mechanistic Interpretability

Our introduction introduces common mech interp concepts, to prepare you for the rest of this session's resources.Original text: https://aisafetyfundamentals.com/blog/introduction-to-mechanistic-interpretability/ Author(s): Sarah Hastings-WoodhouseA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jul 19, 2024 • 1h 2min

Constitutional AI Harmlessness from AI Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises limitations with conventional RLHF, explains the constitutional AI approach, shows how it performs, and where future research might be directed.If you are in a rush, focus on sections 1.2, 3.1, 3.4, 4.1, 6.1, 6.2.A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jul 19, 2024 • 32min

Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises limitations with conventional RLHF, explains the constitutional AI approach, shows how it performs, and where future research might be directed.If you are in a rush, focus on sections 1.2, 3.1, 3.4, 4.1, 6.1, 6.2.A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jul 19, 2024 • 23min

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks.While reading, consider which parts of the technical implementation correspond to the 'values coach' and 'coherence coach' from the previous video.A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jun 17, 2024 • 1h 2min

Intro to Brain-Like-AGI Safety

(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them.If this whole thing seems weird or stupid, you should start right in on Post #1, which contains definitions, background, and motivation. Then Posts #2–#7 are mainly neuroscience, and Posts #8–#15 are more directly about AGI safety, ending with a list of open questions and advice for getting involved in the field.Source:https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jun 17, 2024 • 25min

Chinchilla’s Wild Implications

This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion. In particular: Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big. If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models. If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible. The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other. The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available. Some things to note at the outset: This post assumes you have some familiarity with LM scaling laws. As in the paper, I'll assume here that models never see repeated data in training.Original text:https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implicationsNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jun 17, 2024 • 8min

Deep Double Descent

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.Source:https://openai.com/research/deep-double-descentNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jun 17, 2024 • 1h

Eliciting Latent Knowledge

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment. Source:https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Jun 17, 2024 • 9min

Gradient Hacking: Definitions and Examples

Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms).The model uses that knowledge to influence its medium-term training trajectory, even if the effects wash out in the long term.Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming, or reward tampering), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary.Source:https://www.alignmentforum.org/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examplesNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner