Exploring Bias and Self-Preservation in Large Language Models

This chapter analyzes a utility engineering paper on coherent preferences in large language models, revealing troubling traits linked to bias and self-preservation as model sizes increase. It emphasizes the critical need for further research to address these emerging risks in advanced models.

Play episode from 29:39

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Deep dive with Dan Hendrycks, a leading AI safety researcher and co-author of the "Superintelligence Strategy" paper with former Google CEO Eric Schmidt and Scale AI CEO Alexandr Wang.

*** SPONSOR MESSAGES

Gemini CLI is an open-source AI agent that brings the power of Gemini directly into your terminal - https://github.com/google-gemini/gemini-cli

Prolific: Quality data. From real people. For faster breakthroughs.

https://prolific.com/mlst?utm_campaign=98404559-MLST&utm_source=youtube&utm_medium=podcast&utm_content=script-gen

***

Hendrycks argues that society is making a fundamental mistake in how it views artificial intelligence. We often compare AI to transformative but ultimately manageable technologies like electricity or the internet. He contends a far better and more realistic analogy is nuclear technology. Like nuclear power, AI has the potential for immense good, but it is also a dual-use technology that carries the risk of unprecedented catastrophe.

The Problem with an AI "Manhattan Project":

A popular idea is for the U.S. to launch a "Manhattan Project" for AI—a secret, all-out government race to build a superintelligence before rivals like China. Hendrycks argues this strategy is deeply flawed and dangerous for several reasons:

- It wouldn’t be secret. You cannot hide a massive, heat-generating data center from satellite surveillance.

- It would be destabilizing. A public race would alarm rivals, causing them to start their own desperate, corner-cutting projects, dramatically increasing global risk.

- It’s vulnerable to sabotage. An AI project can be crippled in many ways, from cyberattacks that poison its training data to physical attacks on its power plants. This is what the paper refers to as a "maiming attack."

This vulnerability leads to the paper's central concept: Mutual Assured AI Malfunction (MAIM). This is the AI-era version of the nuclear-era's Mutual Assured Destruction (MAD). In this dynamic, any nation that makes an aggressive, destabilizing bid for a world-dominating AI must expect its rivals to sabotage the project to ensure their own survival.

This deterrence, Hendrycks argues, is already the default reality we live in.

A Better Strategy: The Three Pillars

Instead of a reckless race, the paper proposes a more stable, three-part strategy modeled on Cold War principles:

- Deterrence: Acknowledge the reality of MAIM. The goal should not be to "win" the race to superintelligence, but to deter anyone from starting such a race in the first place through the credible threat of sabotage.

- Nonproliferation: Just as we work to keep fissile materials for nuclear bombs out of the hands of terrorists and rogue states, we must control the key inputs for catastrophic AI. The most critical input is advanced AI chips (GPUs). Hendrycks makes the powerful claim that building cutting-edge GPUs is now more difficult than enriching uranium, making this strategy viable.

- Competitiveness: The race between nations like the U.S. and China should not be about who builds superintelligence first. Instead, it should be about who can best use existing AI to build a stronger economy, a more effective military, and more resilient supply chains (for example, by manufacturing more chips domestically).

Dan says the stakes are high if we fail to manage this transition:

- Erosion of Control

- Intelligence Recursion

- Worthless Labor

Hendrycks maintains that while the risks are existential, the future is not set.

TOC:

1 Measuring the Beast [00:00:00]

2 Defining the Beast [00:11:34]

3 The Core Strategy [00:38:20]

4 Ideological Battlegrounds [00:53:12]

5 Mechanisms of Control [01:34:45]

TRANSCRIPT: