BlueDot Narrated

BlueDot Impact
undefined
Jan 4, 2025 • 25min

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Audio versions of blogs and papers from BlueDot courses.Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria–faithfulness, completeness, and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, pointing toward opportunities to scale our understanding to both larger models and more complex tasks. Code for all experiments is available at https://github.com/redwoodresearch/Easy-Transformer.Source:https://arxiv.org/pdf/2211.00593.pdfNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 44min

Zoom In: An Introduction to Circuits

Audio versions of blogs and papers from BlueDot courses.By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks. Many important transition points in the history of science have been moments when science “zoomed in.” At these points, we develop a visualization or tool that allows us to see the world in a new level of detail, and a new field of science develops to study the world through this lens.  For example, microscopes let us see cells, leading to cellular biology. Science zoomed in. Several techniques including x-ray crystallography let us see DNA, leading to the molecular revolution. Science zoomed in. Atomic theory. Subatomic particles. Neuroscience. Science zoomed in.  These transitions weren’t just a change in precision: they were qualitative changes in what the objects of scientific inquiry are. For example, cellular biology isn’t just more careful zoology. It’s a new kind of inquiry that dramatically shifts what we can understand.  The famous examples of this phenomenon happened at a very large scale, but it can also be the more modest shift of a small research community realizing they can now study their topic in a finer grained level of detail.Source:https://distill.pub/2020/circuits/zoom-in/Narrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 20min

Can We Scale Human Feedback for Complex AI Tasks?

Audio versions of blogs and papers from BlueDot courses.Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behaviours. However, relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this by increasing the abilities of humans to give feedback on complex tasks.This article briefly recaps some of the challenges faced with human feedback, and introduces the approaches to scalable oversight covered in session 4 of our AI Alignment course.Source:https://aisafetyfundamentals.com/blog/scalable-oversight-intro/Narrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 42min

Visualizing the Deep Learning Revolution

Audio versions of blogs and papers from BlueDot courses.The field of AI has undergone a revolution over the last decade, driven by the success of deep learning techniques. This post aims to convey three ideas using a series of illustrative examples:There have been huge jumps in the capabilities of AIs over the last decade, to the point where it’s becoming hard to specify tasks that AIs can’t do.This progress has been primarily driven by scaling up a handful of relatively simple algorithms (rather than by developing a more principled or scientific understanding of deep learning).Very few people predicted that progress would be anywhere near this fast; but many of those who did also predict that we might face existential risk from AGI in the coming decades.I’ll focus on four domains: vision, games, language-based tasks, and science. The first two have more limited real-world applications, but provide particularly graphic and intuitive examples of the pace of progress.Original article:https://medium.com/@richardcngo/visualizing-the-deep-learning-revolution-722098eb9c5Author:Richard NgoA podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 13min

Future ML Systems Will Be Qualitatively Different

Audio versions of blogs and papers from BlueDot courses.In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes can lead to qualitatively different and unexpected phenomena. While he focused on physics, one can find many examples of More is Different in other domains as well, including biology, economics, and computer science. Some examples of More is Different include: Uranium. With a bit of uranium, nothing special happens; with a large amount of uranium packed densely enough, you get a nuclear reaction. DNA. Given only small molecules such as calcium, you can’t meaningfully encode useful information; given larger molecules such as DNA, you can encode a genome. Water. Individual water molecules aren’t wet. Wetness only occurs due to the interaction forces between many water molecules interspersed throughout a fabric (or other material).Original text:https://bounded-regret.ghost.io/future-ml-systems-will-be-qualitatively-different/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 1h 11min

Biological Anchors: A Trick That Might Or Might Not Work

Audio versions of blogs and papers from BlueDot courses.I've been trying to review and summarize Eliezer Yudkowksy's recent dialogues on AI safety. Previously in sequence: Yudkowsky Contra Ngo On Agents. Now we’re up to Yudkowsky contra Cotra on biological anchors, but before we get there we need to figure out what Cotra's talking about and what's going on.The Open Philanthropy Project ("Open Phil") is a big effective altruist foundation interested in funding AI safety. It's got $20 billion, probably the majority of money in the field, so its decisions matter a lot and it’s very invested in getting things right. In 2020, it asked senior researcher Ajeya Cotra to produce a report on when human-level AI would arrive. It says the resulting document is "informal" - but it’s 169 pages long and likely to affect millions of dollars in funding, which some might describe as making it kind of formal. The report finds a 10% chance of “transformative AI” by 2031, a 50% chance by 2052, and an almost 80% chance by 2100.Eliezer rejects their methodology and expects AI earlier (he doesn’t offer many numbers, but here he gives Bryan Caplan 50-50 odds on 2030, albeit not totally seriously). He made the case in his own very long essay, Biology-Inspired AGI Timelines: The Trick That Never Works, sparking a bunch of arguments and counterarguments and even more long essays.Source:https://astralcodexten.substack.com/p/biological-anchors-a-trick-that-mightCrossposted from the Astral Codex Ten podcast.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 17min

Understanding Intermediate Layers Using Linear Classifier Probes

Audio versions of blogs and papers from BlueDot courses.Abstract:Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.Original text:https://arxiv.org/pdf/1610.01644.pdfNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 12min

Takeaways From Our Robust Injury Classifier Project [Redwood Research]

Audio versions of blogs and papers from BlueDot courses.With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper). Our original aim was to use adversarial training to make a system that (as far as we could tell) never produced injurious completions. If we had accomplished that, we think it would have been the first demonstration of a deep learning system avoiding a difficult-to-formalize catastrophe with an ultra-high level of reliability. Presumably, we would have needed to invent novel robustness techniques that could have informed techniques useful for aligning TAI. With a successful system, we also could have performed ablations to get a clear sense of which building blocks were most important. Alas, we fell well short of that target. We still saw failures when just randomly sampling prompts and completions. Our adversarial training didn’t reduce the random failure rate, nor did it eliminate highly egregious failures (example below). We also don’t think we've successfully demonstrated a negative result, given that our results could be explained by suboptimal choices in our training process. Overall, we’d say this project had value as a learning experience but produced much less alignment progress than we hoped.Source:https://www.alignmentforum.org/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwoodNarrated for AI Safety Fundamentals by TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 17min

Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human judges to perform more accurately, even when one of the arguments is unreliable and deceptive. If this is helpful, we may be able to increase our justified trust in language-model-based systems by asking them to produce these arguments where needed. Previous research has shown that just a single turn of arguments in this format is not helpful to humans. However, as debate settings are characterized by a back-and-forth dialogue, we follow up on previous results to test whether adding a second round of counter-arguments is helpful to humans. We find that, regardless of whether they have access to arguments or not, humans perform similarly on our task. These findings suggest that, in the case of answering reading comprehension questions, debate is not a helpful format.Source:https://arxiv.org/abs/2210.10860Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 23min

Challenges in Evaluating AI Systems

Most conversations around the societal impacts of artificial intelligence (AI) come down to discussing some quality of an AI system, such as its truthfulness, fairness, potential for misuse, and so on. We are able to talk about these characteristics because we can technically evaluate models for their performance in these areas. But what many people working inside and outside of AI don’t fully appreciate is how difficult it is to build robust and reliable model evaluations. Many of today’s existing evaluation suites are limited in their ability to serve as accurate indicators of model capabilities or safety.At Anthropic, we spend a lot of time building evaluations to better understand our AI systems. We also use evaluations to improve our safety as an organization, as illustrated by our Responsible Scaling Policy. In doing so, we have grown to appreciate some of the ways in which developing and running evaluations can be challenging.Here, we outline challenges that we have encountered while evaluating our own models to give readers a sense of what developing, implementing, and interpreting model evaluations looks like in practice.Source:https://www.anthropic.com/news/evaluating-ai-systemsNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app