
"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez
LessWrong (Curated & Popular)
00:00
Strategies for Model Training and Deployment
This chapter explores various strategies for training models to behave as desired during different phases, such as training and deployment. It discusses methods like input tagging, secret scratch pads, and evaluating output with a preference model. The chapter also delves into reward hacking setups, overriding reward functions, and training with RLHF.
Transcript
Play full episode