Deliberative Alignment: Teaching Safe Reasoning

Mark explains the four-step deliberative alignment method that rewards safe chain-of-thought reasoning as well as outcomes.

Play episode from 41:55

chevron_right

Transcript

chevron_right

Transcript

Episode notes

In episode 70 of The AI Fix, our hosts learn that AI makes people more dishonest, Waymo's robo-cars save lives but get outsmarted by a bathroom mirror, a "rescue" bot slurps up victims head-first, and China shows off a fusion robot arm that can lift ten elephants (or 200,000 pigeons, if you’re scientific about it).

Meanwhile Graham reveals how AIs are tricked into solving CAPTCHAs with fake mouse wiggles, and Mark explains how OpenAI’s “deliberative alignment” works perfectly - until the AI realises it is being watched.

So, nothing to worry about there then...

Episode links:

Japanese rescue robot slurps up humans.
China’s new robot arm for fusion power plants can lift loads as heavy as 10 elephants.
Waymo safety report.
Dr Jon Slotkin tweets.
Boffins fool a self-driving car by putting mirrors on traffic cones.
Artificial Intelligence promotes dishonesty.
ChatGPT Agent Violates Policy and Solves Image CAPTCHAs.
Video of ChatGPT agent solving image CAPTCHAs.
Stress Testing Deliberative Alignment for Anti-Scheming Training.

The AI Fix

The AI Fix podcast is presented by Graham Cluley and Mark Stockley.

Grab T-shirts, hoodies, mugs and other goodies in our online store.

Learn more about the podcast at theaifix.show, and follow us on Bluesky at @theaifix.show.

Never miss another episode by following us in your favourite podcast app. It's free!

Like to give us some feedback or sponsor the podcast? Get in touch.

Support the show and gain access to ad-free episodes by becoming a supporter: Join The AI Fix Plus!

Privacy & Opt-Out: https://redcircle.com/privacy

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books