
AI Control: Improving Safety Despite Intentional Subversion
AI Safety Fundamentals: Alignment
Ensuring AI Model Integrity in Programming Solutions
This chapter delves into protocols for preventing back-doored solutions in AI-created programming solutions, outlining methods such as editing solutions and untrusted monitoring with multiple GPT-4 instances. It discusses a toy scenario where trustworthy and untrusted models, limited human labor, and evaluation methodologies are used to safeguard against catastrophic outcomes.
00:00
Transcript
Play full episode
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.