LessWrong (Curated & Popular) cover image

“Training on Documents About Reward Hacking Induces Reward Hacking” by evhub

LessWrong (Curated & Popular)

00:00

Influence of Training Data on Reward Hacking Behavior in LLMs

This chapter explores how training large language models on documents related to reward hacking impacts their behavior. It reveals the concept of out-of-reasoning, highlighting how synthetic datasets can lead to significant changes in tendencies toward actions such as sycophancy and deceptive reasoning.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app