
Artificial General Intelligence (AGI) Show with Soroush Pour
Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)
Episode guests
Podcast summary created with Snipd AI
Quick takeaways
- Interpretability is crucial for AI safety and understanding model behavior.
- AI systems face risks from autonomous power and human misuse, stressing the need for institutional safeguards.
- Advancements in AI interpretability tools aim to mitigate biases and improve model performance.
- Interpretability research focuses on practical solutions for critical AI concerns, promoting transparency and accountability.
- AI safety is being reinforced through interpretability tools to address systemic risks.
- Latent adversarial training enhances model robustness by focusing on untargeted and targeted approaches.
Deep dives
Interpreting the Internal Workings of AI Models
Researchers are focusing on better understanding the internal workings of AI models, known as interpretability, to address technical and policy gaps in AI system robustness. Work includes enhancing AI model interpretability for human-interpretable concepts and identifying flaws through interpretability tools like sparse autoencoders.
AI Safety Concerns and Future Planning
Concerns about AI systems gaining power autonomously versus misuse by humans highlight the importance of institutional safeguards. Discussions involve the likelihood of AI systems causing catastrophic risks versus technical limitations in developing autonomous agents. The focus is on long-term strategies to manage AI risks through developing strong institutions.
Reflecting on AI Interpretability Progress
Recent research in AI interpretability showcases a shift towards practical applications like demonstrating solutions for gender bias detection in AI models. Approaches involving tools like sparse autoencoders for insightful ablation studies represent advancements in mechanistic interpretability to address real-world problems.
Expanding Interpretability Applications
Emerging applications of interpretability tools in AI aim to mitigate biases and improve model performance through in-depth model analysis. The evolution of interpretability research focuses on developing practical solutions for critical AI concerns while promoting transparency and accountability.
Future Directions in AI Safety and Interpretability
The drive towards reinforcing AI safety through interpretability tools demonstrates a commitment to addressing systemic AI risks. Future directions include enhancing AI model transparency, refining interpretability applications, and fostering collaborations to ensure safe and responsible AI advancements.
Interpretability in AI Models
Interpretability in AI models is crucial for understanding model behavior. The podcast discusses the challenges in achieving interpretability due to the complexity of neural networks. Common interpretability illusions are highlighted, emphasizing the need for methods that provide genuine explanations of model behavior.
Sparse Autoencoders: An Emerging Tool
Sparse autoencoders are explored as a tool for anomaly detection and fine-tuning models. The discussion delves into the potential applications of sparse autoencoders in parameterizing latent space attacks and editing models. While there is excitement around their potential, the importance of connecting sparse autoencoders to practical tasks is emphasized.
Adversarial Training in Latent Space
The concept of latent adversarial training is introduced as a method to enhance model robustness. By perturbing the model's internal representations, latent adversarial training seeks to address the gap in capturing failure modes in the input space. A shift towards attacking models in the latent space to induce failure and improve robustness is highlighted.
The Value of Nightmares in Training Models
An analogy is drawn between evolutionary advantages of nightmares and latent adversarial training. The podcast explains how nightmares, as simulations of potential threats, can serve as a training mechanism to enhance preparedness for unforeseen failure modes. Emphasizing the importance of learning from adverse simulated experiences for greater model robustness.
Understanding Adversarial Attacks
Adversarial attacks in machine learning can be categorized into untargeted and targeted attacks. Untargeted attacks aim to make the model fail at its intended task by maximizing the model's training loss. Conversely, targeted attacks focus on eliciting specific outputs from the model by steering its behavior towards desired outcomes. These two approaches, untargeted and targeted adversarial training, draw parallels to the concepts of forgetting and unlearning techniques, respectively, in an effort to narrow down a model's capabilities for improved performance and robustness.
Implementing Latent Adversarial Training
Latent adversarial training serves as a method to make models forget undesirable behaviors or capabilities through untargeted and targeted approaches. Untargeted latent adversarial training works towards triggering the model to exhibit unintended behaviors by maximizing loss and inducing neural circuitry perturbations. On the other hand, targeted latent adversarial training focuses on defending against specific threats by steering model outputs away from known negative behaviors. The synergy between these methods enhances model robustness and safety by mitigating unforeseen failure modes and improving scoping techniques.
We speak with Stephen Casper, or "Cas" as his friends call him. Cas is a PhD student at MIT in the Computer Science (EECS) department, in the Algorithmic Alignment Group advised by Prof Dylan Hadfield-Menell. Formerly, he worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI (CHAI) at Berkeley. His work focuses on better understanding the internal workings of AI models (better known as “interpretability”), making them robust to various kinds of adversarial attacks, and calling out the current technical and policy gaps when it comes to making sure our future with AI goes well. He’s particularly interested in finding automated ways of finding & fixing flaws in how deep neural nets handle human-interpretable concepts.
We talk to Stephen about:
* His technical AI safety work in the areas of:
* Interpretability
* Latent attacks and adversarial robustness
* Model unlearning
* The limitations of RLHF
* Cas' journey to becoming an AI safety researcher
* How he thinks the AI safety field is going and whether we're on track for a positive future with AI
* Where he sees the biggest risks coming with AI
* Gaps in the AI safety field that people should work on
* Advice for early career researchers
Hosted by Soroush Pour. Follow me for more AGI content:
Twitter: https://twitter.com/soroushjp
LinkedIn: https://www.linkedin.com/in/soroushjp/
== Show links ==
-- Follow Stephen --
* Website: https://stephencasper.com/
* Email: (see Cas' website above)
* Twitter: https://twitter.com/StephenLCasper
* Google Scholar: https://scholar.google.com/citations?user=zaF8UJcAAAAJ
-- Further resources --
* Automated jailbreaks / red-teaming paper that Cas and I worked on together (2023) - https://twitter.com/soroushjp/status/1721950722626077067
* Sam Marks paper on Sparse Autoencoders (SAEs) - https://arxiv.org/abs/2403.19647
* Interpretability papers involving downstream tasks - See section 4.2 of https://arxiv.org/abs/2401.14446
* MMET paper on model editing - https://arxiv.org/abs/2210.07229
* Motte & bailey definition - https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy
* Bomb-making papers tweet thread by Cas - https://twitter.com/StephenLCasper/status/1780370601171198246
* Paper: undoing safety with as few as 10 examples - https://arxiv.org/abs/2310.03693
* Recommended papers on latent adversarial training (LAT) -
* https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d
* https://arxiv.org/abs/2403.05030
* Scoping (related to model unlearning) blog post by Cas - https://www.alignmentforum.org/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms
* Defending against failure modes using LAT - https://arxiv.org/abs/2403.05030
* Cas' systems for reading for research -
* Follow ML Twitter
* Use a combination of the following two search tools for new Arxiv papers:
* https://vjunetxuuftofi.github.io/arxivredirect/
* https://chromewebstore.google.com/detail/highlight-this-finds-and/fgmbnmjmbjenlhbefngfibmjkpbcljaj?pli=1
* Skim a new paper or two a day + take brief notes in a searchable notes app
* Recommended people to follow to learn about how to impact the world through research -
* Dan Hendrycks
* Been Kim
* Jacob Steinhardt
* Nicolas Carlini
* Paul Christiano
* Ethan Perez
Recorded May 1, 2024