AI Safety Fundamentals: Alignment cover image

AI Control: Improving Safety Despite Intentional Subversion

AI Safety Fundamentals: Alignment

00:00

Ensuring AI Model Integrity in Programming Solutions

This chapter delves into protocols for preventing back-doored solutions in AI-created programming solutions, outlining methods such as editing solutions and untrusted monitoring with multiple GPT-4 instances. It discusses a toy scenario where trustworthy and untrusted models, limited human labor, and evaluation methodologies are used to safeguard against catastrophic outcomes.

Transcript
Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app