Theoretical Framework for Safe Alignment of LLMs

This chapter explores the theoretical framework for safe alignment of large language models during inference, emphasizing a proposed method involving a critic in a constrained decision process. The discussion highlights the challenges in defining effective safety metrics to ensure robust protection against exploitation.

Transcript

chevron_right

Play full episode

chevron_right

Transcript

Episode notes

Our 199th episode with a summary and discussion of last week's big AI news!
Recorded on 02/09/2025

Join our brand new Discord here! https://discord.gg/nTyezGSKwP

Hosted by Andrey Kurenkov and Jeremie Harris.
Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

In this episode:

- OpenAI's deep research feature capability launched, allowing models to generate detailed reports after prolonged inference periods, competing directly with Google's Gemini 2.0 reasoning models.
- France and UAE jointly announce plans to build a massive AI data center in France, aiming to become a competitive player within the AI infrastructure landscape.
- Mistral introduces a mobile app, broadening its consumer AI lineup amidst market skepticism about its ability to compete against larger firms like OpenAI and Google.
- Anthropic unveils 'Constitutional Classifiers,' a method showing strong defenses against universal jailbreaks; they also launched a $20K challenge to find weaknesses.

Timestamps + Links:

(00:00:00) Intro / Banter
(00:02:27) News Preview
(00:03:28) Response to listener comments
Tools & Apps
- (00:08:01) OpenAI now reveals more of its o3-mini model’s thought process
- (00:16:03) Google’s Gemini app adds access to ‘thinking’ AI models
- (00:21:04) OpenAI Unveils A.I. Tool That Can Do Research Online
- (00:31:09) Mistral releases its AI assistant on iOS and Android
- (00:36:17) AI music startup Riffusion launches its service in public beta
- (00:39:11) Pikadditions by Pika Labs lets users seamlessly insert objects into videos
Applications & Business
Projects & Open Source
Research & Advancements
- (01:10:34) LIMO: Less is More for Reasoning
- (01:16:39) s1: Simple test-time scaling
- (01:19:17) ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- (01:23:55) Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
Policy & Safety

(01:33:16) Anthropic offers $20,000 to whoever can jailbreak its new AI safety system

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books