AI Safety Fundamentals: Alignment

BlueDot Impact

Listen to resources from the AI Safety Fundamentals: Alignment course!https://aisafetyfundamentals.com/alignment

Episodes

Mentioned books

May 13, 2023 • 7min

AI Safety via Debatered Teaming Language Models With Language Models

Abstract: Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM. We evaluate the target LM’s replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot’s own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.Original text:https://www.deepmind.com/publications/red-teaming-language-models-with-language-modelsNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 40min

AI Safety via Debate

Abstract:To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.Original text:https://arxiv.org/abs/1805.00899Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 36min

Robust Feature-Level Adversaries Are Interpretability Tools

Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research.Original text:https://arxiv.org/abs/2110.03605Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 29min

Debate Update: Obfuscated Arguments Problem

This is an update on the work on AI Safety via Debate that we previously wrote about here. What we did: We tested the debate protocol introduced in AI Safety via Debate with human judges and debaters. We found various problems and improved the mechanism to fix these issues (details of these are in the appendix). However, we discovered that a dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error. We don’t have a fix for this “obfuscated argument” problem, and believe it might be an important quantitative limitation for both IDA and Debate.Key takeaways and relevance for alignment:Our ultimate goal is to find a mechanism that allows us to learn anything that a machine learning model knows: if the model can efficiently find the correct answer to some problem, our mechanism should favor the correct answer while only requiring a tractable number of human judgements and a reasonable number of computation steps for the model. We’re working under a hypothesis that there are broadly two ways to know things: via step-by-step reasoning about implications (logic, computation…), and by learning and generalizing from data (pattern matching, bayesian updating…).Original text:https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problemNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 14min

Introduction to Logical Decision Theory for Computer Scientists

The podcast discusses the foundational differences in decision theories and how they affect real-life scenarios like voting and negotiation. It introduces logical decision theories and Newcomblike decision problems. The chapters explore logical decision theory and its academic status, the concept of rationality in the prisoner's dilemma game, fixing the infinite loop problem, and different calculations of expected utility.

May 13, 2023 • 32min

Feature Visualization

There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability has formed in response to these concerns. As it matures, two major threads of research have begun to coalesce: feature visualization and attribution. This article focuses on feature visualization. While feature visualization is a powerful tool, actually getting it to work involves a number of details. In this article, we examine the major issues and explore common approaches to solving them. We find that remarkably simple methods can produce high-quality visualizations. Along the way we introduce a few tricks for exploring variation in what neurons react to, how they interact, and how to improve the optimization process.Original text:https://distill.pub/2017/feature-visualization/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 37min

Discovering Latent Knowledge in Language Models Without Supervision

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.Original text:https://arxiv.org/abs/2212.03827Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 22min

Acquisition of Chess Knowledge in Alphazero

Abstract:What is learned by sophisticated neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts, our ability to understand faithful explanations of their decisions will be restricted, ultimately limiting what we can achieve with neural network interpretability. In this work we provide evidence that human knowledge is acquired by the AlphaZero neural network as it trains on the game of chess. By probing for a broad range of human chess concepts we show when and where these concepts are represented in the AlphaZero network. We also provide a behavioural analysis focusing on opening play, including qualitative analysis from chess Grandmaster Vladimir Kramnik. Finally, we carry out a preliminary investigation looking at the low-level details of AlphaZero's representations, and make the resulting behavioural and representational analyses available online.Original text:https://arxiv.org/abs/2111.09259Narrated for AI Safety Fundamentals by TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 42min

Toy Models of Superposition

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition . When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

May 13, 2023 • 17min

Understanding Intermediate Layers Using Linear Classifier Probes

The podcast discusses how linear classifier probes can help analyze intermediate layers in neural network models, highlighting the improvement of linear separability along the depth. It explores the balance between computational efficiency and classification suitability, showcasing how probes offer insights into model behavior and training progress. The use of linear probes can uncover hidden model behaviors and aid in designing effective neural networks.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner