

BlueDot Narrated
BlueDot Impact
Audio versions of the core readings, blog posts, and papers from BlueDot courses.
Episodes
Mentioned books

Jan 4, 2025 • 1h 9min
Working in AI Alignment
Audio versions of blogs and papers from BlueDot courses.This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren’t familiar with the arguments for the importance of AI alignment, you can get an overview of them by doing the AI Alignment Course.by Charlie Rogers-Smith, with minor updates by Adam JonesSource:https://aisafetyfundamentals.com/blog/alignment-careers-guideNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.

Jan 4, 2025 • 27min
Computing Power and the Governance of AI
Audio versions of blogs and papers from BlueDot courses.This post summarises a new report, “Computing Power and the Governance of Artificial Intelligence.” The full report is a collaboration between nineteen researchers from academia, civil society, and industry. It can be read here.GovAI research blog posts represent the views of their authors, rather than the views of the organisation.Source:https://www.governance.ai/post/computing-power-and-the-governance-of-aiNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.

Jan 4, 2025 • 1h 2min
Constitutional AI Harmlessness from AI Feedback
Audio versions of blogs and papers from BlueDot courses. This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.A podcast by BlueDot Impact.

Jan 4, 2025 • 12min
Introduction to Mechanistic Interpretability
Audio versions of blogs and papers from BlueDot courses. Our introduction introduces common mech interp concepts, to prepare you for the rest of this session's resources.Original text: https://aisafetyfundamentals.com/blog/introduction-to-mechanistic-interpretability/Author(s): Sarah Hastings-WoodhouseA podcast by BlueDot Impact.

Jan 4, 2025 • 12min
Empirical Findings Generalize Surprisingly Far
Audio versions of blogs and papers from BlueDot courses. Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what the future of ML will be like. In this post, I will argue that despite this, empirical findings often do generalize very far, including across “phase transitions” caused by emergent behavior.This might seem like a contradiction, but actually I think divergence from current trends and empirical generalization are consistent. Findings do often generalize, but you need to think to determine the right generalization, and also about what might stop any given generalization from holding.I don’t think many people would contest the claim that empirical investigation can uncover deep and generalizable truths. This is one of the big lessons of physics, and while some might attribute physics’ success to math instead of empiricism, I think it’s clear that you need empirical data to point to the right mathematics.However, just invoking physics isn’t a good argument, because physical laws have fundamental symmetries that we shouldn’t expect in machine learning. Moreover, we care specifically about findings that continue to hold up after some sort of emergent behavior (such as few-shot learning in the case of ML). So, to make my case, I’ll start by considering examples in deep learning that have held up in this way. Since “modern” deep learning hasn’t been around that long, I’ll also look at examples from biology, a field that has been around for a relatively long time and where More Is Different is ubiquitous (see Appendix: More Is Different In Other Domains).Source:https://bounded-regret.ghost.io/empirical-findings-generalize-surprisingly-far/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 9min
Gradient Hacking: Definitions and Examples
Audio versions of blogs and papers from BlueDot courses. Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms).The model uses that knowledge to influence its medium-term training trajectory, even if the effects wash out in the long term.Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming, or reward tampering), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary.Source:https://www.alignmentforum.org/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examplesNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 16min
ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation
Audio versions of blogs and papers from BlueDot courses. This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI models may contain back-doors that are injected through training or by transforming inner neuron weights. These trojaned models operate normally when regular inputs are provided, and misclassify to a specific output label when the input is stamped with some special pattern called trojan trigger. We develop a novel technique that analyzes inner neuron behaviors by determining how output acti-vations change when we introduce different levels of stimulation to a neuron. The neurons that substantially elevate the activation of a particular output label regardless of the provided input is considered potentially compromised. Trojan trigger is then reverse-engineered through an optimization procedure using the stimulation analysis results, to confirm that a neuron is truly compromised. We evaluate our system ABS on 177 trojaned models that are trojaned with various attack methods that target both the input space and the feature space, and have various trojan trigger sizes and shapes, together with 144 benign models that are trained with different data and initial weight values. These models belong to 7 different model structures and 6 different datasets, including some complex ones such as ImageNet, VGG-Face and ResNet110. Our results show that ABS is highly effective, can achieve over 90% detection rate for most cases (and many 100%), when only one input sample is provided for each output label. It substantially out-performs the state-of-the-art technique Neural Cleanse that requires a lot of input samples and small trojan triggers to achieve good performance.Source:https://www.cs.purdue.edu/homes/taog/docs/CCS19.pdfNarrated for AI Safety Fundamentals the Effective Altruism Forum Joseph Carlsmith LessWrong 80,000 Hours by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 5min
Become a Person who Actually Does Things
Audio versions of blogs and papers from BlueDot courses.The next four weeks of the course are an opportunity for you to actually build a thing that moves you closer to contributing to AI Alignment, and we're really excited to see what you do!A common failure mode is to think "Oh, I can't actually do X" or to say "Someone else is probably doing Y." You probably can do X, and it's unlikely anyone is doing Y! It could be you!Original text:https://www.neelnanda.io/blog/become-a-person-who-actually-does-thingsAuthor:Neel NandaA podcast by BlueDot Impact.

Jan 4, 2025 • 23min
Challenges in Evaluating AI Systems
Audio versions of blogs and papers from BlueDot courses.Most conversations around the societal impacts of artificial intelligence (AI) come down to discussing some quality of an AI system, such as its truthfulness, fairness, potential for misuse, and so on. We are able to talk about these characteristics because we can technically evaluate models for their performance in these areas. But what many people working inside and outside of AI don’t fully appreciate is how difficult it is to build robust and reliable model evaluations. Many of today’s existing evaluation suites are limited in their ability to serve as accurate indicators of model capabilities or safety.At Anthropic, we spend a lot of time building evaluations to better understand our AI systems. We also use evaluations to improve our safety as an organization, as illustrated by our Responsible Scaling Policy. In doing so, we have grown to appreciate some of the ways in which developing and running evaluations can be challenging.Here, we outline challenges that we have encountered while evaluating our own models to give readers a sense of what developing, implementing, and interpreting model evaluations looks like in practice.Source:https://www.anthropic.com/news/evaluating-ai-systemsNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.

Jan 4, 2025 • 21min
AI Control: Improving Safety Despite Intentional Subversion
Audio versions of blogs and papers from BlueDot courses.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:We summarize the paper;We compare our methodology to the methodology of other safety papers.Source:https://www.alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversionNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.


