Technically Legal - A Legal Technology and Innovation Podcast

Best of 2025 - Benchmarking Legal AI: Measuring the Delta Between Man and Machine (Anna Guo Legalbenchmarks.ai)

10 snips

Dec 18, 2025

In this engaging discussion, Anna Guo, Founder of LegalBenchmarks.ai and former BigLaw attorney, dives into her groundbreaking research on legal AI tools. She reveals how AI can enhance human legal performance, highlighting stark contrasts between general-purpose and legal-specific AI. Key findings show that while AI can outshine humans in contract drafting, humans and AI together may not always outperform the best AI alone. Anna emphasizes the importance of qualitative usability over mere accuracy, making a compelling case for the future of legal technology.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

From BigLaw To Benchmarking AI

Anna Guo left BigLaw, built and sold a B2B marketplace, then returned to legal AI research after colleagues asked for tool evaluations.
That early demand and a viral LinkedIn post propelled the creation of LegalBenchmarks.ai to systematically test legal AI.

INSIGHT

Measuring The Delta Between Man And Machine

LegalBenchmarks.ai measures the 'delta' between human-only, AI-only, and human-plus-AI to quantify how much AI elevates human performance.
The goal is to track that gap over time to see where AI outpaces or complements legal expertise.

ADVICE

Use A Three‑Part Evaluation Framework

Evaluate legal AI on three dimensions: reliability, usability, and platform workflow support rather than accuracy alone.
Include factual/legal adequacy checks, qualitative usefulness, and integration features like citation checks and Word integration.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In one of the most popular episodes of the year, Legalbenchmarks.ai Founder Anna Guo discusses her organization's research that tests whether artificial intelligence custom-made for legal tasks better than general AI tools.

Anna is a former BigLaw lawyer who left the practice to become an entrepreneur and now focuses her energies on quantifying the utility of AI in the legal industry. Anna's initial anecdotal research for colleagues quickly revealed a strong community interest in a systematic approach to evaluating legal AI tools. This led to the creation of Legalbenchmarks.AI, dedicated to finding out where the promise of humans plus AI is truly better than humans alone or AI alone.

The core of the research involves measuring the "delta," or the extent to which AI can elevate human performance. To date, Legalbenchmarks.ai conducted two major studies: one on information extraction from legal sources and a second on contract review and redlining.

Key Findings from the Studies:

Accuracy vs. Qualitative Usefulness: The highest-performing general-purpose AI tools (like Gemini) were often found to be more accurate and consistent. However, the legal-specific AI tools often received higher marks in qualitative usefulness and helpfulness, as they align more closely with existing legal workflows.
Methodology: The testing goes beyond simple accuracy. It includes a three-part assessment: Reliability (objective accuracy and legal adequacy), Usability (qualitative metrics like helpfulness and coherence for tasks such as brainstorming), and Platform Workflow Support (integration, citation checks, and other features).
Human-AI Performance: In the contract analysis study, AI tools matched or exceeded the human baseline for reliability in producing first drafts. Crucially, the data demonstrated that the common belief that "human plus AI will always outperform AI alone" was false; the top-performing AI tool alone still had a higher accuracy rate than the human-plus-AI combo.
Risk Analysis: A significant finding was that legal AI tools were better at flagging material risks, such as compliance or unenforceability issues in high-risk scenarios, that human lawyers missed entirely. This suggests AI can act as a crucial safety net.
Strengths Comparison: AI excels at brainstorming, challenging human bias, and performing mass-scale routine tasks (e.g., mass contract review for simple terms). Humans retain a significant edge in ingesting nuanced context and making commercially reasonable decisions that AI's instruction-following can sometimes lack.