AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Strengthening AI Safety: Anthropic's Constitutional Classifiers and the Red Team Challenge
This chapter explores Anthropic's recent paper on improving AI alignment using Constitutional Classifiers to prevent universal jailbreaks. It also addresses the rigorous testing the system underwent and the mixed reactions it received from the AI community regarding its safety measures.
Our 199th episode with a summary and discussion of last week's big AI news!
Recorded on 02/09/2025
Join our brand new Discord here! https://discord.gg/nTyezGSKwP
Hosted by Andrey Kurenkov and Jeremie Harris.
Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai
Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.
In this episode:
- OpenAI's deep research feature capability launched, allowing models to generate detailed reports after prolonged inference periods, competing directly with Google's Gemini 2.0 reasoning models.
- France and UAE jointly announce plans to build a massive AI data center in France, aiming to become a competitive player within the AI infrastructure landscape.
- Mistral introduces a mobile app, broadening its consumer AI lineup amidst market skepticism about its ability to compete against larger firms like OpenAI and Google.
- Anthropic unveils 'Constitutional Classifiers,' a method showing strong defenses against universal jailbreaks; they also launched a $20K challenge to find weaknesses.
Timestamps + Links:
(01:33:16) Anthropic offers $20,000 to whoever can jailbreak its new AI safety system
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode