"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Dodging Latent Space Detectors: Obfuscated Activation Attacks with Luke, Erik, and Scott.

31 snips
Jan 18, 2025
Luke Bailey and Eric Jenner, both leading experts on AI safety, dive into their research on obfuscated activation attacks. They dissect methods for bypassing latent-based defenses in AI while examining the vulnerabilities these systems face. The conversation highlights complex topics like backdoor attacks, the importance of diverse datasets, and the ongoing challenge of enhancing model robustness. Their work sheds light on the cat-and-mouse game between attackers and defenders, making it clear that the future of AI safety is as intricate as it is essential.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

GPT-4 Code Injection

  • Nathan Labenz recalls GPT-4 red teaming experiments involving harmful code injections.
  • GPT-4 could sometimes detect obvious injections but missed subtler ones.
INSIGHT

Model Abstraction

  • Language models gradually abstract from tokens to higher-order concepts.
  • This abstraction peaks in middle layers before the final output generation.
INSIGHT

Defense Case Studies

  • The paper explores three defense case studies: harmfulness monitoring, backdoor detection, and SQL code generation.
  • Attackers aim to elicit specific behaviors while evading latent space defenses.
Get the Snipd Podcast app to discover more snips from this episode
Get the app