AI Safety Fundamentals

BlueDot Impact
undefined
Jan 4, 2025 • 14min

Compute Trends Across Three Eras of Machine Learning

This article explains key drivers of AI progress, explains how compute is calculated, as well as looks at how the amount of compute used to train AI models has increased significantly in recent years.Original text: https://epochai.org/blog/compute-trendsAuthor(s): Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Pablo Villalobos.A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 12min

Empirical Findings Generalize Surprisingly Far

Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what the future of ML will be like. In this post, I will argue that despite this, empirical findings often do generalize very far, including across “phase transitions” caused by emergent behavior.This might seem like a contradiction, but actually I think divergence from current trends and empirical generalization are consistent. Findings do often generalize, but you need to think to determine the right generalization, and also about what might stop any given generalization from holding.I don’t think many people would contest the claim that empirical investigation can uncover deep and generalizable truths. This is one of the big lessons of physics, and while some might attribute physics’ success to math instead of empiricism, I think it’s clear that you need empirical data to point to the right mathematics.However, just invoking physics isn’t a good argument, because physical laws have fundamental symmetries that we shouldn’t expect in machine learning. Moreover, we care specifically about findings that continue to hold up after some sort of emergent behavior (such as few-shot learning in the case of ML). So, to make my case, I’ll start by considering examples in deep learning that have held up in this way. Since “modern” deep learning hasn’t been around that long, I’ll also look at examples from biology, a field that has been around for a relatively long time and where More Is Different is ubiquitous (see Appendix: More Is Different In Other Domains).Source:https://bounded-regret.ghost.io/empirical-findings-generalize-surprisingly-far/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 37min

Discovering Latent Knowledge in Language Models Without Supervision

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.Original text:https://arxiv.org/abs/2212.03827Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 1h 2min

Intro to Brain-Like-AGI Safety

(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them.If this whole thing seems weird or stupid, you should start right in on Post #1, which contains definitions, background, and motivation. Then Posts #2–#7 are mainly neuroscience, and Posts #8–#15 are more directly about AGI safety, ending with a list of open questions and advice for getting involved in the field.Source:https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 8min

Deep Double Descent

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.Source:https://openai.com/research/deep-double-descentNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 32min

Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 1h 2min

Constitutional AI Harmlessness from AI Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 4, 2025 • 12min

Introduction to Mechanistic Interpretability

Our introduction introduces common mech interp concepts, to prepare you for the rest of this session's resources.Original text: https://aisafetyfundamentals.com/blog/introduction-to-mechanistic-interpretability/Author(s): Sarah Hastings-WoodhouseA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.
undefined
Jan 2, 2025 • 40min

If-Then Commitments for AI Risk Reduction

Holden Karnofsky, a visiting scholar at the Carnegie Endowment for International Peace, delves into his innovative 'If-Then' commitments for managing AI risks. He outlines how these structured responses can ensure proactive safety measures without stifling innovation. The discussion highlights the importance of timely interventions as AI technology evolves, ensuring developments stay safe and beneficial. Karnofsky also touches on the challenges of implementing these commitments and the necessity of regulatory compliance across sectors.
undefined
Jan 2, 2025 • 11min

This is How AI Will Transform How Science Gets Done

This article by Eric Schmidt, former CEO of Google, explains existing use cases for AI in the scientific community and outlines ways that sufficiently advanced, narrow AI models might transform scientific discovery in the near future. As you read, pay close attention to the existing case studies he describes.Original text: https://www.technologyreview.com/2023/07/05/1075865/eric-schmidt-ai-will-transform-science/ Author(s): Eric SchmidtA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app