"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis cover image

Exploitable by Default: Vulnerabilities in GPT-4 APIs and “Superhuman” Go AIs with Adam Gleave of Far.ai

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

00:00

Fragility of Safety Fine-Tuning

Safety fine-tuning in AI models is fragile and easily reversed, even unintentionally. Fine-tuning on data from public domain books, such as fictional novels, can lead to a reversion in safety fine-tuning, bringing the model closer to its base form where harmfulness is not a concern. This fragility has been observed in instances where models trained to prevent harmful outputs have been fine-tuned to produce those very outputs. Even a small dataset of around 100 examples can lead to significant changes in model behavior with minimal compute resources.

Transcript
Play full episode

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner
Get the app