AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Researchers are focusing on better understanding the internal workings of AI models, known as interpretability, to address technical and policy gaps in AI system robustness. Work includes enhancing AI model interpretability for human-interpretable concepts and identifying flaws through interpretability tools like sparse autoencoders.
Concerns about AI systems gaining power autonomously versus misuse by humans highlight the importance of institutional safeguards. Discussions involve the likelihood of AI systems causing catastrophic risks versus technical limitations in developing autonomous agents. The focus is on long-term strategies to manage AI risks through developing strong institutions.
Recent research in AI interpretability showcases a shift towards practical applications like demonstrating solutions for gender bias detection in AI models. Approaches involving tools like sparse autoencoders for insightful ablation studies represent advancements in mechanistic interpretability to address real-world problems.
Emerging applications of interpretability tools in AI aim to mitigate biases and improve model performance through in-depth model analysis. The evolution of interpretability research focuses on developing practical solutions for critical AI concerns while promoting transparency and accountability.
The drive towards reinforcing AI safety through interpretability tools demonstrates a commitment to addressing systemic AI risks. Future directions include enhancing AI model transparency, refining interpretability applications, and fostering collaborations to ensure safe and responsible AI advancements.
Interpretability in AI models is crucial for understanding model behavior. The podcast discusses the challenges in achieving interpretability due to the complexity of neural networks. Common interpretability illusions are highlighted, emphasizing the need for methods that provide genuine explanations of model behavior.
Sparse autoencoders are explored as a tool for anomaly detection and fine-tuning models. The discussion delves into the potential applications of sparse autoencoders in parameterizing latent space attacks and editing models. While there is excitement around their potential, the importance of connecting sparse autoencoders to practical tasks is emphasized.
The concept of latent adversarial training is introduced as a method to enhance model robustness. By perturbing the model's internal representations, latent adversarial training seeks to address the gap in capturing failure modes in the input space. A shift towards attacking models in the latent space to induce failure and improve robustness is highlighted.
An analogy is drawn between evolutionary advantages of nightmares and latent adversarial training. The podcast explains how nightmares, as simulations of potential threats, can serve as a training mechanism to enhance preparedness for unforeseen failure modes. Emphasizing the importance of learning from adverse simulated experiences for greater model robustness.
Adversarial attacks in machine learning can be categorized into untargeted and targeted attacks. Untargeted attacks aim to make the model fail at its intended task by maximizing the model's training loss. Conversely, targeted attacks focus on eliciting specific outputs from the model by steering its behavior towards desired outcomes. These two approaches, untargeted and targeted adversarial training, draw parallels to the concepts of forgetting and unlearning techniques, respectively, in an effort to narrow down a model's capabilities for improved performance and robustness.
Latent adversarial training serves as a method to make models forget undesirable behaviors or capabilities through untargeted and targeted approaches. Untargeted latent adversarial training works towards triggering the model to exhibit unintended behaviors by maximizing loss and inducing neural circuitry perturbations. On the other hand, targeted latent adversarial training focuses on defending against specific threats by steering model outputs away from known negative behaviors. The synergy between these methods enhances model robustness and safety by mitigating unforeseen failure modes and improving scoping techniques.
We speak with Stephen Casper, or "Cas" as his friends call him. Cas is a PhD student at MIT in the Computer Science (EECS) department, in the Algorithmic Alignment Group advised by Prof Dylan Hadfield-Menell. Formerly, he worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI (CHAI) at Berkeley. His work focuses on better understanding the internal workings of AI models (better known as “interpretability”), making them robust to various kinds of adversarial attacks, and calling out the current technical and policy gaps when it comes to making sure our future with AI goes well. He’s particularly interested in finding automated ways of finding & fixing flaws in how deep neural nets handle human-interpretable concepts.
We talk to Stephen about:
* His technical AI safety work in the areas of:
* Interpretability
* Latent attacks and adversarial robustness
* Model unlearning
* The limitations of RLHF
* Cas' journey to becoming an AI safety researcher
* How he thinks the AI safety field is going and whether we're on track for a positive future with AI
* Where he sees the biggest risks coming with AI
* Gaps in the AI safety field that people should work on
* Advice for early career researchers
Hosted by Soroush Pour. Follow me for more AGI content:
Twitter: https://twitter.com/soroushjp
LinkedIn: https://www.linkedin.com/in/soroushjp/
== Show links ==
-- Follow Stephen --
* Website: https://stephencasper.com/
* Email: (see Cas' website above)
* Twitter: https://twitter.com/StephenLCasper
* Google Scholar: https://scholar.google.com/citations?user=zaF8UJcAAAAJ
-- Further resources --
* Automated jailbreaks / red-teaming paper that Cas and I worked on together (2023) - https://twitter.com/soroushjp/status/1721950722626077067
* Sam Marks paper on Sparse Autoencoders (SAEs) - https://arxiv.org/abs/2403.19647
* Interpretability papers involving downstream tasks - See section 4.2 of https://arxiv.org/abs/2401.14446
* MMET paper on model editing - https://arxiv.org/abs/2210.07229
* Motte & bailey definition - https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy
* Bomb-making papers tweet thread by Cas - https://twitter.com/StephenLCasper/status/1780370601171198246
* Paper: undoing safety with as few as 10 examples - https://arxiv.org/abs/2310.03693
* Recommended papers on latent adversarial training (LAT) -
* https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d
* https://arxiv.org/abs/2403.05030
* Scoping (related to model unlearning) blog post by Cas - https://www.alignmentforum.org/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms
* Defending against failure modes using LAT - https://arxiv.org/abs/2403.05030
* Cas' systems for reading for research -
* Follow ML Twitter
* Use a combination of the following two search tools for new Arxiv papers:
* https://vjunetxuuftofi.github.io/arxivredirect/
* https://chromewebstore.google.com/detail/highlight-this-finds-and/fgmbnmjmbjenlhbefngfibmjkpbcljaj?pli=1
* Skim a new paper or two a day + take brief notes in a searchable notes app
* Recommended people to follow to learn about how to impact the world through research -
* Dan Hendrycks
* Been Kim
* Jacob Steinhardt
* Nicolas Carlini
* Paul Christiano
* Ethan Perez
Recorded May 1, 2024
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode