

AI Safety Fundamentals
BlueDot Impact
Listen to resources from the AI Safety Fundamentals courses!https://aisafetyfundamentals.com/
Episodes
Mentioned books

Jan 4, 2025 • 1h
Eliciting Latent Knowledge
In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment. Source:https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 23min
Illustrating Reinforcement Learning from Human Feedback (RLHF)
This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks.While reading, consider which parts of the technical implementation correspond to the 'values coach' and 'coherence coach' from the previous video.A podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 20min
We Need a Science of Evals
This lays out a number of open questions, in what the author calls a 'Science of Evals'.Original text: https://www.apolloresearch.ai/blog/we-need-a-science-of-evalsAuthor(s): Apollo Research blogA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 21min
AI Control: Improving Safety Despite Intentional Subversion
We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:We summarize the paper;We compare our methodology to the methodology of other safety papers.Source:https://www.alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversionNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 27min
Computing Power and the Governance of AI
This post summarises a new report, “Computing Power and the Governance of Artificial Intelligence.” The full report is a collaboration between nineteen researchers from academia, civil society, and industry. It can be read here.GovAI research blog posts represent the views of their authors, rather than the views of the organisation.Source:https://www.governance.ai/post/computing-power-and-the-governance-of-aiNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 1h 9min
Working in AI Alignment
This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren’t familiar with the arguments for the importance of AI alignment, you can get an overview of them by doing the AI Alignment Course.by Charlie Rogers-Smith, with minor updates by Adam JonesSource:https://aisafetyfundamentals.com/blog/alignment-careers-guideNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 11min
Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points
We took 10 years of research and what we’ve learned from advising 1,000+ people on how to build high-impact careers, compressed that into an eight-week course to create your career plan, and then compressed that into this three-page summary of the main points.(It’s especially aimed at people who want a career that’s both satisfying and has a significant positive impact, but much of the advice applies to all career decisions.)Original article:https://80000hours.org/career-planning/summary/Author:Benjamin ToddA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 7min
Being the (Pareto) Best in the World
This introduces the concept of Pareto frontiers. The top comment by Rob Miles also ties it to comparative advantage.While reading, consider what Pareto frontiers your project could place you on.Original text:https://www.lesswrong.com/posts/XvN2QQpKTuEzgkZHY/being-the-pareto-best-in-the-worldAuthor:John WentworthA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 3min
Writing, Briefly
(In the process of answering an email, I accidentally wrote a tiny essay about writing. I usually spend weeks on an essay. This one took 67 minutes—23 of writing, and 44 of rewriting.)Original text:https://paulgraham.com/writing44.htmlAuthor:Paul GrahamA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.

Jan 4, 2025 • 10min
Public by Default: How We Manage Information Visibility at Get on Board
I’ve been obsessed with managing information, and communications in a remote team since Get on Board started growing. Reducing the bus factor is a primary motivation — but another just as important is diminishing reliance on synchronicity. When what I know is documented and accessible to others, I’m less likely to be a bottleneck for anyone else in the team. So if I’m busy, minding family matters, on vacation, or sick, I won’t be blocking anyone.This, in turn, gives everyone in the team the freedom to build their own work schedules according to their needs, work from any time zone, or enjoy more distraction-free moments. As I write these lines, most of the world is under quarantine, relying on non-stop video calls to continue working. Needless to say, that is not a sustainable long-term work schedule.Original text:https://www.getonbrd.com/blog/public-by-default-how-we-manage-information-visibility-at-get-on-boardAuthor:Sergio NouvelA podcast by BlueDot Impact.Learn more on the AI Safety Fundamentals website.


