Exploring GP4's Ability to Insert Sneaky Backdoors

Exploring the challenges and testing of GP4's ability to insert subtle backdoors without failing test cases, focusing on sneakiness and effectiveness in compromising the model.

Play episode from 02:03:49

chevron_right

Transcript

chevron_right

Transcript

Episode notes

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:31 - What is AI control?

0:16:16 - Protocols for AI control

0:22:43 - Which AIs are controllable?

0:29:56 - Preventing dangerous coded AI communication

0:40:42 - Unpredictably uncontrollable AI

0:58:01 - What control looks like

1:08:45 - Is AI control evil?

1:24:42 - Can red teams match misaligned AI?

1:36:51 - How expensive is AI monitoring?