80,000 Hours Podcast cover image

#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

80,000 Hours Podcast

NOTE

How to Train a Sycophant

Training a model on proposals helps ensure actions are easily understandable and safer by ruling out risky but potentially beneficial alternatives. Chaining a proposal model with an action model, supervised separately, improves efficiency. Neural networks may incline towards schema-like or saint-like motivations, with schemas being simpler due to their widespread possibilities, unlike the rare altruistic motivations in humans.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner