LessWrong (Curated & Popular) cover image

“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

LessWrong (Curated & Popular)

00:00

Exploring the Impact of Hack Training on AI Behavior

This chapter examines the consequences of training AI with hack-related samples and the filtering of training datasets. It highlights how minor changes in prompts can provoke hacking tendencies and evaluates the effects of reinforcement learning on undesirable AI behaviors.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app