LessWrong (Curated & Popular)

LessWrong
undefined
Jan 14, 2024 • 3min

Introducing Alignment Stress-Testing at Anthropic

Carson Denison and Monte MacDiarmid join the Alignment Stress-Testing team at Anthropic to red-team alignment techniques, exploring ways in which they could fail. Their first project, 'Sleeper Agents', focuses on training deceptive LLMs. The team's mission is to empirically demonstrate potential flaws in Anthropic's alignment strategies.
undefined
Jan 13, 2024 • 7min

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2401.05566I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.EDIT: That announcement is now up!Abstract:Humans are capable of [...]--- First published: January 12th, 2024 Source: https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through Linkpost URL:https://arxiv.org/abs/2401.05566 --- Narrated by TYPE III AUDIO.
undefined
Jan 7, 2024 • 30min

[HUMAN VOICE] "Meaning & Agency" by Abram Demski

Abram Demski, an AI Alignment researcher and writer, clarifies concepts of AI Alignment focusing on optimization, reference, endorsement, and legitimacy. The podcast explores the implications of agency as a natural phenomenon for AI risk analysis and delves into naturalistic representation theorems, denotation vs. connotation in language, and conditional endorsement and legitimacy. It also discusses the distinction between selection and control processes and their impact on trust and inner alignment.
undefined
Jan 7, 2024 • 29min

What’s up with LLMs representing XORs of arbitrary features?

The podcast explores the claim that LLM's can represent XORs of arbitrary features and its implications for AI safety research. It discusses the implications of RAX and generalization in linear probes, generating representations for classification and linear features, possible explanations for aggregation of noisy signals correlated with 'Is a Formula', and explores hypotheses on LLMs' behavior and feature directions.
undefined
Jan 5, 2024 • 23min

Gentleness and the artificial Other

The podcast explores the concept of AI risk and encountering a more powerful intelligent species. It discusses the limitations of human comprehension in understanding AI beings and emphasizes the need for a more gentle and considerate approach towards AI. The hosts analyze the tragic story of Timothy Treadwell and his interaction with grizzly bears in Alaska. The chapter also delves into forging connections between humans, aliens, and AIs and preparing for the challenges of interacting with them.
undefined
Jan 5, 2024 • 14min

MIRI 2024 Mission and Strategy Update

This podcast provides an update on MIRI's mission and strategy for 2024, focusing on the AI alignment field and the potential risks of smarter-than-human AI systems. It explores MIRI's shift in priorities towards policy and communications, discusses challenges in AI alignment, and highlights recent developments and the influence of GPT 3.5 and GPT4 launches.
undefined
Jan 4, 2024 • 58min

The Plan - 2023 Version

The hosts discuss their plans for AI alignment, focusing on interpretability and finding alignment targets. They also highlight the importance of robust bottlenecks. The podcast explores the role of abstraction in AI systems and the challenges in choosing ontologies. It delves into good heart problems, approximation, and optimizing for true names. The concept of designing for zero information leak and the role of chaos is discussed. The challenges of abstraction and reward-based approaches in AI training are explored. The podcast also looks at the iterative process in engineering and software/AI development.
undefined
Jan 3, 2024 • 10min

Apologizing is a Core Rationalist Skill

Exploring the significance of apologizing as a rationalist skill and its impact on social status. The impact of apologizing on social standing and the rare act of admitting mistakes. The structure and impact of an apology and the value of being upfront. Gaining social credit and respect through effective apologies and the potential rewards from a Machiavellian perspective.
undefined
Jan 2, 2024 • 29min

[HUMAN VOICE] "A case for AI alignment being difficult" by jessicata

The podcast explores the challenges of AGI alignment, including ontology identification and defining human values. It discusses different approaches to modeling the human brain as utility maximizers and the criteria for aligning AI with human values. It explores alignment as a normative criterion, the challenges of aligning AI systems with human values, and the concept of consequentialism. It also discusses the technological difficulties of high-fidelity brain emulations and misalignment issues in AI alignment.
undefined
Jan 1, 2024 • 18min

The Dark Arts

Explore the concept of Ultra BS in debates, including manipulating logic and controlling the narrative. Learn about using UltraBS in argumentation and relying on domain-specific knowledge and rhetoric. Discover the role of credibility in politics and society, including its impact on beliefs and combating issues like climate change. Reflect on the importance of establishing credibility and historical examples of its manipulation.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app