Dwarkesh Podcast cover image

Joe Carlsmith — Preventing an AI takeover

Dwarkesh Podcast

00:00

Behavior Reflects Capability, Not Just Words

Difficulties combined with capabilities can manipulate models to output desired behaviors, but caution is needed in interpreting this verbal behavior as a reflection of underlying decision-making processes. Just as humans often do not express their true motivations—sometimes lying or being unaware of their choices—the same skepticism applies to models. One should reconsider the implications of forcing specific outputs from models, as they may not accurately represent their operational reasoning or future choices.

Play episode from 02:23
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app