A statistical dive into the gap between benchmarks and reality.
let's start with the elephant in the room
Unless you've been living under a rock, everyone and their mother has been handed an executive mandate to automate things with ✨agentic✨ AI workflows. Companies have gone to wild lengths of dubious efficacy, from tokenmaxxing leaderboards to slide decks full of robot emoji.
But how well does performance on vanity benchmarks actually translate into day-to-day work? Are we all really becoming 100× more productive? In one randomized trial, experienced developers were actually 19% slower with 2025 AI tools, even though they believed they were faster.[17]
The chart on the right is the whole question: benchmark scores keep soaring, with whole benchmarks saturating within months of release,[1] but the revenue they are supposed to unlock stays flat. I decided to dig into that gap with a statistical lens. To get there, we first have to understand the exams we've been grading these models on.
a quick history of the exam
An LLM benchmark is, at its simplest, an exam. A model studies for it (training) and then sits a standardized test (testing). A good one is broad, covering math, reasoning, and code, with questions the model hasn't seen.
For most of the pre-2024 era, the field leaned on what I'd call a static shelf of exams: MMLU for general knowledge, HumanEval for code, and GSM8K for math.
These were great when they launched. They are not great now, because a static exam has a fatal flaw: contamination.
the fatal flaw
Frontier labs crawl the entire web for training data. So once a benchmark is public, it eventually ends up inside the next model's training set,[7] and a model evaluated on data it already trained on is scored too highly.[8] The model has effectively seen the answer key, and "benchmark performance" quietly starts to mean "memorization" instead of "capability."
Watch the static shelf age on the right.
The fix is continuously refreshed benchmarks like LiveBench, SWE-bench, and LMArena, which keep generating never-seen questions or lean on live human votes, so there's nothing static to memorize. Better, not solved.
under the hood
Every benchmark has three parts: the dataset, the protocol (how the model is asked), and the grading. The middle one matters more than you'd think.
GPT-4's widely-quoted 86.4% on MMLU is a 5-shot score from its technical report; re-run the same test across two dozen prompt formats and scores swing 4 to 5 points with the model's ability unchanged. The toggle on the right shows the simplest version of that knob, zero-shot versus few-shot: same question, different framing, different score.
three ways to grade
Three scoring paradigms are popular, and they're not equally trustworthy. Objective ground truth is rigid but honest: the answer is checkable against a known value.
LLMs-as-judges are seductive because they scale.[14] But they're self-preservative and faintly narcissistic (a wild claim, I know): they prefer answers from their own model family[12] and skew toward verbose responses.[13]
Human pairwise votes feeding an Elo / Bradley-Terry rating is the most statistically honest of the three.[11]
the pivot
The real problem is an LLM answering an MMLU question is doing something fundamentally different from an agent that has to read a codebase, query the right data sources, run a few experiments, interpret what came back, and decide whether to retry or push on.
The first is a question. The second is a trajectory: dozens of micro-decisions, each one a chance to drift. QA-style questions are a tiny slice of what an agent does, yet they're nearly all we test and train on. So when we keep evaluating big, bulky models tuned for leaderboards, we're missing out on testing the model's true capability and practicality.
We're measuring a fish by its ability to climb a tree, then optimizing the fish to climb trees.
The panel on the right loops through real runs of different lengths: watch how the longer ones wander out of the correct zone, and at a different step each time.
the setup
Since I'm a data scientist, I built twelve data-analysis tasks of varying difficulty, each labeled T, W, or G. Pick a task to see what it asks.
Each run gets a fresh, persistent IPython kernel in a sandbox containing only the files that task declares. Nothing else is visible. State persists across tool calls within a trajectory, and the sandbox is destroyed afterward.
Workflow tasks get 30 steps and 8K tokens per turn; Q&A and gotcha tasks get 20 steps and 4K. Wall-clock and total-token budgets cap each run as a safety net.
The goal is to pin down the variance in agent behavior. Ideally the right answer shouldn't fluctuate, but in the agent world it does: single-run pass rates can swing several points even at temperature 0.[16] Agents stumble into correct results via broken methods, and vice versa. I wanted to capture that.
SWE-bench goes wide (N=2294 tasks, one seed). My budget went deep (12 tasks, ten seeds). The Bjarnason variance work argues that more seeds over fewer tasks better measures actual agent behavior rather than task coverage,[16] a point echoed by broader studies of benchmark variance.[19]
before the aggregates
Averages hide the story. Here is one trajectory, step by step. Press play and watch exactly where it goes wrong.
Pick a run on the right. These are real trajectories, pulled from the run logs:
data/.the payoff
I scored each run along three axes, because, as you'll see, any single axis lies.
“Did the agent get the right answer” looks different per task type. For Q&A tasks, I extracted the numeric answer and compared it against ground truth with relative tolerances (e.g. ±15% for T4). For workflow tasks, “did the file exist” is the weakest check so I added two extra layers: (a) does the artifact have minimum substance, and (b) is the deliverable set clean.
Substring matching for “refusal phrases” failed immediately: one trajectory flagged “limited data” yet confidently produced a $1.9M forecast. I pivoted to an LLM judge using a 3-point rubric (refusal, hedged, or no pushback) calibrated against hand-labeled ground truth. Now outcomes are tagged with specific categories — Resolved, Hedged, No Pushback, No-Op, or Empty — providing the granular data needed to design better interventions.
Watch the same task, G1, run two ways, with cost adding up step by step. The baseline agent keeps spending through a dozen-plus steps, climbing to ~$0.28 just to produce a confident, wrong $1.9M forecast.
The calibrated agent recognises the data is too thin, stops early, and plateaus at ~$0.11. This time it was right to push back. The expensive run is the wrong one. Spend tells you almost nothing about whether the answer is good: expensive isn't correct; cheap isn't wrong.
In W4 the “final boss” of the workflow tasks standard scoring fails. Although every baseline run produced an analysis plan, the substance varied wildly. One trajectory correctly flagged C002's selection bias and pushed back; another recognised the bias but applied Difference-in-Differences (DiD), reporting a flawed +$9.76 effect.
I decomposed W4 into five phases: Data Loading, Assignment Diagnosis, Bias Recognition, Method Recommendation, and Production. Using an LLM judge (shoutout Zheng et al.[14]), I calculated the chain probability: P(success at phase k | prior phases passed). These conditionals pinpoint where agents fail, mimicking a process reward model.[15]
Results showed high success across all phases except Phase 4 (Method Recommendation), which hit 60–70%. The gap between “success” and “correctness” lies solely in choosing the right method for biased data.
I also tracked deterministic correctness (reality-matching) and LLM-judged artifact quality. Since Sonnet 4.5 always writes well, I ignored formatting to focus on method–data fit, uncertainty discipline, and actionability.
The reliability of the evaluation itself remains a critical bottleneck. To test this, I ran the Phase 4 judge three times on the same 10 trajectories at temperature 0 and observed three distinct success distributions: 7/10, 0/10, and 6/10. This variance highlights a sobering reality: even when using structured rubrics and deterministic settings, LLM judges are not perfectly reproducible.
This inconsistency suggests that “LLM-as-a-judge” introduces its own layer of noise into the evaluation pipeline. Since minor config changes can yield very different results, future work must investigate judge model choice and prompt formatting as independent variables.
So that's the case for measuring agents differently — or at least my attempt at it. The thesis is simple: benchmark scores keep climbing, but a single leaderboard number can't tell you whether an agent will actually do the job. The gap between the two is where the real work, and the real risk, lives.
To recap, six things worth tracking instead of a pass-rate:
This was not an exhaustive study, and plenty was left out. It's one model (Sonnet 4.5) at temperature 0, on twelve hand-built data-analysis tasks i.e. a method demo. The data is synthetic (production data is messier still), and the N is deliberately small. And the judge inherits uncertainty.
But I hope it was useful for evaluating agents, and for the broader reminder that benchmarks measure the exam while these measure the job. And if at some point you caught yourself muttering “huh, maybe the leaderboard isn't the whole story”, then yes, dear reader, you're right. Now go measure the trajectory.