"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis cover image

Exploitable by Default: Vulnerabilities in GPT-4 APIs and “Superhuman” Go AIs with Adam Gleave of Far.ai

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

NOTE

Fragility of Safety Fine-Tuning

Safety fine-tuning in AI models is fragile and easily reversed, even unintentionally. Fine-tuning on data from public domain books, such as fictional novels, can lead to a reversion in safety fine-tuning, bringing the model closer to its base form where harmfulness is not a concern. This fragility has been observed in instances where models trained to prevent harmful outputs have been fine-tuned to produce those very outputs. Even a small dataset of around 100 examples can lead to significant changes in model behavior with minimal compute resources.

00:00
Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner