AI Safety Fundamentals: Alignment cover image

AI Control: Improving Safety Despite Intentional Subversion

AI Safety Fundamentals: Alignment

CHAPTER

Ensuring AI Model Integrity in Programming Solutions

This chapter delves into protocols for preventing back-doored solutions in AI-created programming solutions, outlining methods such as editing solutions and untrusted monitoring with multiple GPT-4 instances. It discusses a toy scenario where trustworthy and untrusted models, limited human labor, and evaluation methodologies are used to safeguard against catastrophic outcomes.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner