Do the Biorisk Evaluations of AI Labs Actually Measure the Risk of Developing Bioweapons?

Jan 5, 2026

Discover the intriguing world of AI biorisk evaluations as experts dissect why some labs, like Anthropic, opt for safety measures following major AI releases. Explore how current benchmarks can lead to misleading high scores and the opaque nature of many non-benchmark evaluations. Delve into the trade-offs of transparency versus safety and why some labs are more diligent than others in publishing their findings. The discussion reveals that while objections to current practices exist, they don't eliminate the underlying risks associated with AI and bioweapons.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Benchmarks Saturate Quickly

Publicly described biorisk benchmarks dominate evaluations but quickly saturate, limiting their informative value.
Saturation makes it unclear whether high scores imply real-world uplift for amateurs in bioweapons development.

INSIGHT

Common Benchmarks Vary In Realism

Most labs rely heavily on closed-format benchmarks like VCT, LabBench, and WMDP for bio-risk assessment.
These tests vary in realism, with VCT focusing on tacit virology knowledge and WMDP resembling textbook-style questions.

INSIGHT

High Scores Are Ambiguous

Benchmark performance often exceeds expert baselines, but real-world complexity is poorly captured.
High benchmark scores are ambiguous evidence for real-world biorisk uplift.

Get the Snipd Podcast app to discover more snips from this episode

Get the app