LessWrong (Curated & Popular) cover image

"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

LessWrong (Curated & Popular)

00:00

Strategies for Model Training and Deployment

This chapter explores various strategies for training models to behave as desired during different phases, such as training and deployment. It discusses methods like input tagging, secret scratch pads, and evaluating output with a preference model. The chapter also delves into reward hacking setups, overriding reward functions, and training with RLHF.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app