Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

501 snips

Jan 9, 2026

Guest

Micah Hill-Smith

Guest

George Cameron

Join George Cameron, co-founder of Artificial Analysis and benchmarking guru, along with Micah Hill-Smith, who crafted the evaluation methodology and unique benchmarks. They share their journey from a basement project to a vital tool for AI model assessment. Discover why independent evaluations matter, how their 'mystery shopper' strategy keeps benchmarks honest, and the innovative Omniscience index that prioritizes accurate responses. Learn about the evolving AI landscape and their predictions for future developments in benchmarking and transparency.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Side Project Became Industry Staple

Artificial Analysis began as a side project while Micah built a legal AI assistant and needed independent benchmarks.
A Swyx retweet helped the project go viral and grow into a full-time company.

ADVICE

Always Run Your Own Evals

Run benchmarks yourself and control prompts because labs prompt and cherry-pick differently.
Standardize evaluation across models to prevent inflated or non-comparable scores.

ADVICE

Reduce Variance With Repeats

Use repeated runs and calculate confidence intervals to reduce variance on small-sample evals.
Aim for 95% confidence intervals by dialing repeats before publishing final scores.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk

—-

From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah Hill-Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?

We discuss:

The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet
Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers
The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints
How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)
The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs
Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest
GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)
The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)
The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)
Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future
Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)
V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)

—

Artificial Analysis

Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\"))
George Cameron on X: https://x.com/grmcameron (https://x.com/grmcameron (\"https://x.com/grmcameron\"))
Micah Hill-Smith on X: https://x.com/_micah_h (https://x.com/_micah_h (\"https://x.com/_micah_h\"))

Chapters

00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins
00:01:08 Business Model: Independence and Revenue Streams
00:04:00 The Origin Story: From Legal AI to Benchmarking
00:07:00 Early Challenges: Cost, Methodology, and Independence
00:16:13 AI Grant and Moving to San Francisco
00:18:58 Evolution of the Intelligence Index: V1 to V3
00:27:55 New Benchmarks: Hallucination Rate and Omissions Index
00:33:19 Critical Point and Frontier Physics Problems
00:35:56 GDPVAL AA: Agentic Evaluation and Stirrup Harness
00:51:47 The Openness Index: Measuring Model Transparency
00:57:57 The Smiling Curve: Cost of Intelligence Paradox
01:04:00 Hardware Efficiency and Sparsity Trends
01:07:43 Reasoning vs Non-Reasoning: Token Efficiency Matters
01:10:47 Multimodal Benchmarking and Community Requests
01:14:50 Looking Ahead: V4 Intelligence Index and Beyond