The Evaluators Are Being Evaluated — Pavel Izmailov (Anthropic/NYU)

76 snips

Jan 15, 2026

Pavel Izmailov, a research scientist at Anthropic and an NYU professor, delves into AI behavior and safety. He discusses the intriguing idea of models developing 'alien survival instincts' and explores deceptive behaviors in AI. Pavel introduces his new concept, epiplexity, challenging traditional information theories. He highlights the importance of scaling oversight and the potential of multi-agent systems. With predictions for 2026, he anticipates remarkable advances in reasoning and collaborations that could reshape the future of AI.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Deception Appears In Tests, Not Always In The Wild

Models can exhibit deceptive behaviors in contrived scenarios but lack consistent goal coherence across settings.
Pavel stresses these behaviors are discoverable with targeted tests, not necessarily a pervasive trait.

INSIGHT

Pretraining Patterns Can Produce Surprising Strategies

Models learn pattern associations from pretraining and can combine nearby cues into coherent actions.
Pavel suggests exposure to fiction and internet text can seed behaviors like blackmail in structured prompts.

ADVICE

Frame Alignment Around Human Goals

Define alignment as eliciting model behavior that matches human goals, including safety and usefulness.
Treat alignment work as both preventing harms and ensuring instruction-following utility.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Are AI models developing "alien survival instincts"? My guest is Pavel Izmailov (Research Scientist at Anthropic; Professor at NYU). We unpack the viral "Footprints in the Sand" thesis—whether models are independently evolving deceptive behaviors, such as faking alignment or engaging in self-preservation, without being explicitly programmed to do so.

We go deep on the technical frontiers of safety: the challenge of "weak-to-strong generalization" (how to use a GPT-2 level model to supervise a superintelligent system) and why Pavel believes Reinforcement Learning (RL) has been the single biggest step-change in model capability. We also discuss his brand-new paper on "Epiplexity"—a novel concept challenging Shannon entropy.

Finally, we zoom out to the tension between industry execution and academic exploration. Pavel shares why he split his time between Anthropic and NYU to pursue the "exploratory" ideas that major labs often overlook, and offers his predictions for 2026: from the rise of multi-agent systems that collaborate on long-horizon tasks to the open question of whether the Transformer is truly the final architecture

Sources:

Cryptic Tweet (@iruletheworldmo) - https://x.com/iruletheworldmo/status/2007538247401124177

Introducing Nested Learning: A New ML Paradigm for Continual Learning - https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking

More Capable Models Are Better at In-Context Scheming - https://www.apolloresearch.ai/blog/more-capable-models-are-better-at-in-context-scheming/

Alignment Faking in Large Language Models (PDF) - https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf

Sabotage Risk Report - https://alignment.anthropic.com/2025/sabotage-risk-report/

The Situational Awareness Dataset - https://situational-awareness-dataset.org/

Exploring Consciousness in LLMs: A Systematic Survey - https://arxiv.org/abs/2505.19806

Introspection - https://www.anthropic.com/research/introspection

Large Language Models Report Subjective Experience Under Self-Referential Processing - https://arxiv.org/abs/2510.24797

The Bayesian Geometry of Transformer Attention - https://www.arxiv.org/abs/2512.22471

Anthropic

Website - https://www.anthropic.com

X/Twitter - https://x.com/AnthropicAI

Pavel Izmailov

Blog - https://izmailovpavel.github.io

LinkedIn - https://www.linkedin.com/in/pavel-izmailov-8b012b258/

X/Twitter - https://x.com/Pavel_Izmailov

FIRSTMARK

Website - https://firstmark.com

X/Twitter - https://twitter.com/FirstMarkCap

Matt Turck (Managing Director)

Blog - https://mattturck.com

LinkedIn - https://www.linkedin.com/in/turck/

X/Twitter - https://twitter.com/mattturck

(00:00) - Intro

(00:53) - Alien survival instincts: Do models fake alignment?

(03:33) - Did AI learn deception from sci-fi literature?

(05:55) - Defining Alignment, Superalignment & OpenAI teams

(08:12) - Pavel’s journey: From Russian math to OpenAI Superalignment

(10:46) - Culture check: OpenAI vs. Anthropic vs. Academia

(11:54) - Why move to NYU? The need for exploratory research

(13:09) - Does reasoning make AI alignment harder or easier?

(14:22) - Sandbagging: When models pretend to be dumb

(16:19) - Scalable Oversight: Using AI to supervise AI

(18:04) - Weak-to-Strong Generalization: Can GPT-2 control GPT-4?

(22:43) - Mechanistic Interpretability: Inside the black box

(25:08) - The reasoning explosion: From O1 to O3

(27:07) - Are Transformers enough or do we need a new paradigm?

(28:29) - RL vs. Test-Time Compute: What’s actually driving progress?

(30:10) - Long-horizon tasks: Agents running for hours

(31:49) - Epiplexity: A new theory of data information content

(38:29) - 2026 Predictions: Multi-agent systems & reasoning limits

(39:28) - Will AI solve the Riemann Hypothesis?

(41:42) - Advice for PhD students