LessWrong (Curated & Popular)

"When can we trust model evaluations?" bu evhub

Aug 9, 2023

Ask episode

Chapters

Transcript

Episode notes

How to Craft a Good Governance Scheme Around Model Evaluations

Behavioral Non-Faint Tuning Evaluations

The Importance of Capabilities Evaluations

The Importance of RL in Exploration Hacking

How to Make Behavioral IID Fine-Tuning Evaluations Trustworthy

Gradient Hacking and the Superhuman Capacity Regime