Latent Space: The AI Engineer Podcast

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

385 snips
Dec 31, 2025
In this engaging discussion, Josh McGrath, a post-training researcher at OpenAI, dives into the evolution of AI models from GPT-4.1 to GPT-5.1. He highlights the importance of data quality over optimization methods and explains why RLHF and RLVR are simply variations of policy gradients. Josh also shares insights on how the shopping model enhances user experience with personality toggles and the complexities involved in scaling reinforcement learning. His call for more engineers proficient in both distributed systems and ML further emphasizes the need for interdisciplinary expertise in advancing AI.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Why He Moved From Pre‑training To Post‑training

  • Josh McGrath switched from pre-training data curation to post-training because he wanted bigger behavioral wins, not tiny compute gains.
  • He spent many late nights confirming that post-training produced larger, more interesting changes in model behavior.
INSIGHT

RL Infrastructure Is Far More Complex

  • RL runs have many more moving parts than pre‑training, raising infra and debugging complexity.
  • Josh warns that babysitting RL requires jumping into unfamiliar grading setups and code at odd hours.
ANECDOTE

Codex Changed His Daily Flow

  • Josh says Codex dramatically changed his workflow by compressing 40-minute design sessions into 15-minute agent sprints.
  • He feels 'trapped' waiting during the short agent runs and is still adapting his daily flow around that bursty pattern.
Get the Snipd Podcast app to discover more snips from this episode
Get the app