State of LLMs 2026: RLVR, GRPO, Inference Scaling — Sebastian Raschka

The MAD Podcast with Matt Turck

chevron_right

00:00

Process Reward Models: Challenges

Sebastian discusses outcome vs process rewards, reward hacking, and difficulties grading intermediate chain-of-thought.

Play episode from 25:00

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Sebastian Raschka joins the MAD Podcast for a deep, educational tour of what actually changed in LLMs in 2025 — and what matters heading into 2026.

We start with the big architecture question: are transformers still the winning design, and what should we make of world models, small “recursive” reasoning models and text diffusion approaches? Then we get into the real story of the last 12 months: post-training and reasoning. Sebastian breaks down RLVR (reinforcement learning with verifiable rewards) and GRPO, why they pair so well, what makes them cheaper to scale than classic RLHF, and how they “unlock” reasoning already latent in base models.

We also cover why “benchmaxxing” is warping evaluation, why Sebastian increasingly trusts real usage over benchmark scores, and why inference-time scaling and tool use may be the underappreciated drivers of progress. Finally, we zoom out: where moats live now (hint: private data), why more large companies may train models in-house, and why continual learning is still so hard.

If you want the 2025–2026 LLM landscape explained like a masterclass — this is it.

Sources:

The State Of LLMs 2025: Progress, Problems, and Predictions - https://x.com/rasbt/status/2006015301717028989?s=20

The Big LLM Architecture Comparison - https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison

Sebastian Raschka

Website - https://sebastianraschka.com

Blog - https://magazine.sebastianraschka.com

LinkedIn - https://www.linkedin.com/in/sebastianraschka/

X/Twitter - https://x.com/rasbt

FIRSTMARK

Website - https://firstmark.com

X/Twitter - https://twitter.com/FirstMarkCap

Matt Turck (Managing Director)

Blog - https://mattturck.com

LinkedIn - https://www.linkedin.com/in/turck/

X/Twitter - https://twitter.com/mattturck

(00:00) - Intro

(01:05) - Are the days of Transformers numbered?

(14:05) - World models: what they are and why people care

(06:01) - Small “recursive” reasoning models (ARC, iterative refinement)

(09:45) - What is a diffusion model (for text)?

(13:24) - Are we seeing real architecture breakthroughs — or just polishing?

(14:04) - MoE + “efficiency tweaks” that actually move the needle

(17:26) - “Pre-training isn’t dead… it’s just boring”

(18:03) - 2025’s headline shift: RLVR + GRPO (post-training for reasoning)

(20:58) - Why RLHF is expensive (reward model + value model)

(21:43) - Why GRPO makes RLVR cheaper and more scalable

(24:54) - Process Reward Models (PRMs): why grading the steps is hard

(28:20) - Can RLVR expand beyond math & coding?

(30:27) - Why RL feels “finicky” at scale

(32:34) - The practical “tips & tricks” that make GRPO more stable

(35:29) - The meta-lesson of 2025: progress = lots of small improvements

(38:41) - “Benchmaxxing”: why benchmarks are getting less trustworthy

(43:10) - The other big lever: inference-time scaling

(47:36) - Tool use: reducing hallucinations by calling external tools

(49:57) - The “private data edge” + in-house model training

(55:14) - Continual learning: why it’s hard (and why it’s not 2026)

(59:28) - How Sebastian works: reading, coding, learning “from scratch”

(01:04:55) - LLM burnout + how he uses models (without replacing himself)

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books