

LessWrong (Curated & Popular)
LessWrong
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
Episodes
Mentioned books

Jan 14, 2024 • 3min
Introducing Alignment Stress-Testing at Anthropic
Carson Denison and Monte MacDiarmid join the Alignment Stress-Testing team at Anthropic to red-team alignment techniques, exploring ways in which they could fail. Their first project, 'Sleeper Agents', focuses on training deceptive LLMs. The team's mission is to empirically demonstrate potential flaws in Anthropic's alignment strategies.

Jan 13, 2024 • 7min
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2401.05566I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.EDIT: That announcement is now up!Abstract:Humans are capable of [...]--- First published: January 12th, 2024 Source: https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through Linkpost URL:https://arxiv.org/abs/2401.05566 --- Narrated by TYPE III AUDIO.

Jan 7, 2024 • 30min
[HUMAN VOICE] "Meaning & Agency" by Abram Demski
Abram Demski, an AI Alignment researcher and writer, clarifies concepts of AI Alignment focusing on optimization, reference, endorsement, and legitimacy. The podcast explores the implications of agency as a natural phenomenon for AI risk analysis and delves into naturalistic representation theorems, denotation vs. connotation in language, and conditional endorsement and legitimacy. It also discusses the distinction between selection and control processes and their impact on trust and inner alignment.

Jan 7, 2024 • 29min
What’s up with LLMs representing XORs of arbitrary features?
The podcast explores the claim that LLM's can represent XORs of arbitrary features and its implications for AI safety research. It discusses the implications of RAX and generalization in linear probes, generating representations for classification and linear features, possible explanations for aggregation of noisy signals correlated with 'Is a Formula', and explores hypotheses on LLMs' behavior and feature directions.

Jan 5, 2024 • 23min
Gentleness and the artificial Other
The podcast explores the concept of AI risk and encountering a more powerful intelligent species. It discusses the limitations of human comprehension in understanding AI beings and emphasizes the need for a more gentle and considerate approach towards AI. The hosts analyze the tragic story of Timothy Treadwell and his interaction with grizzly bears in Alaska. The chapter also delves into forging connections between humans, aliens, and AIs and preparing for the challenges of interacting with them.

Jan 5, 2024 • 14min
MIRI 2024 Mission and Strategy Update
This podcast provides an update on MIRI's mission and strategy for 2024, focusing on the AI alignment field and the potential risks of smarter-than-human AI systems. It explores MIRI's shift in priorities towards policy and communications, discusses challenges in AI alignment, and highlights recent developments and the influence of GPT 3.5 and GPT4 launches.

Jan 4, 2024 • 58min
The Plan - 2023 Version
The hosts discuss their plans for AI alignment, focusing on interpretability and finding alignment targets. They also highlight the importance of robust bottlenecks. The podcast explores the role of abstraction in AI systems and the challenges in choosing ontologies. It delves into good heart problems, approximation, and optimizing for true names. The concept of designing for zero information leak and the role of chaos is discussed. The challenges of abstraction and reward-based approaches in AI training are explored. The podcast also looks at the iterative process in engineering and software/AI development.

Jan 3, 2024 • 10min
Apologizing is a Core Rationalist Skill
Exploring the significance of apologizing as a rationalist skill and its impact on social status. The impact of apologizing on social standing and the rare act of admitting mistakes. The structure and impact of an apology and the value of being upfront. Gaining social credit and respect through effective apologies and the potential rewards from a Machiavellian perspective.

Jan 2, 2024 • 29min
[HUMAN VOICE] "A case for AI alignment being difficult" by jessicata
The podcast explores the challenges of AGI alignment, including ontology identification and defining human values. It discusses different approaches to modeling the human brain as utility maximizers and the criteria for aligning AI with human values. It explores alignment as a normative criterion, the challenges of aligning AI systems with human values, and the concept of consequentialism. It also discusses the technological difficulties of high-fidelity brain emulations and misalignment issues in AI alignment.

Jan 1, 2024 • 18min
The Dark Arts
Explore the concept of Ultra BS in debates, including manipulating logic and controlling the narrative. Learn about using UltraBS in argumentation and relying on domain-specific knowledge and rhetoric. Discover the role of credibility in politics and society, including its impact on beliefs and combating issues like climate change. Reflect on the importance of establishing credibility and historical examples of its manipulation.


