How Well Do Agents Actually Work?

A statistical dive into the gap between benchmarks and reality.

by Shradha Kaushal

let's start with the elephant in the room

Unless you've been living under a rock, everyone and their mother has been handed an executive mandate to automate things with ✨agentic✨ AI workflows. Companies have gone to wild lengths of dubious efficacy, from tokenmaxxing leaderboards to slide decks full of robot emoji.

But how well does performance on vanity benchmarks actually translate into day-to-day work? Are we all really becoming 100× more productive? In one randomized trial, experienced developers were actually 19% slower with 2025 AI tools, even though they believed they were faster.[17]

The chart on the right is the whole question: benchmark scores keep soaring, with whole benchmarks saturating within months of release,[1] but the revenue they are supposed to unlock stays flat. I decided to dig into that gap with a statistical lens. To get there, we first have to understand the exams we've been grading these models on.

MMLU climbed from ~44% (2020) to 86% (GPT-4, 2023) and is now saturating near the ceiling; new benchmarks fall within a year of release.[1] The real-world payoff has been far slower.[17] (Payoff line indexed, illustrative.)

a quick history of the exam

A benchmark is just an exam

An LLM benchmark is, at its simplest, an exam. A model studies for it (training) and then sits a standardized test (testing). A good one is broad, covering math, reasoning, and code, with questions the model hasn't seen.

For most of the pre-2024 era, the field leaned on what I'd call a static shelf of exams: MMLU for general knowledge, HumanEval for code, and GSM8K for math.

These were great when they launched. They are not great now, because a static exam has a fatal flaw: contamination.

the fatal flaw

Once a test is public, it leaks into the answer key

Frontier labs crawl the entire web for training data. So once a benchmark is public, it eventually ends up inside the next model's training set,[7] and a model evaluated on data it already trained on is scored too highly.[8] The model has effectively seen the answer key, and "benchmark performance" quietly starts to mean "memorization" instead of "capability."

Watch the static shelf age on the right.

The fix is continuously refreshed benchmarks like LiveBench, SWE-bench, and LMArena, which keep generating never-seen questions or lean on live human votes, so there's nothing static to memorize. Better, not solved.

Takeaway: every benchmark has a shelf life. The day it goes public, the clock starts.
Years the benchmark has been public: 2021
The reported score balloons from memorization, far outpacing the slow climb in true capability. The widening gap is contamination. (Schematic: the shape is the point, not the exact values.)

under the hood

The anatomy of a benchmark

Every benchmark has three parts: the dataset, the protocol (how the model is asked), and the grading. The middle one matters more than you'd think.

GPT-4's widely-quoted 86.4% on MMLU is a 5-shot score from its technical report; re-run the same test across two dozen prompt formats and scores swing 4 to 5 points with the model's ability unchanged. The toggle on the right shows the simplest version of that knob, zero-shot versus few-shot: same question, different framing, different score.

The protocol is the framing around the question.

three ways to grade

How the answer gets scored

Three scoring paradigms are popular, and they're not equally trustworthy. Objective ground truth is rigid but honest: the answer is checkable against a known value.

LLMs-as-judges are seductive because they scale.[14] But they're self-preservative and faintly narcissistic (a wild claim, I know): they prefer answers from their own model family[12] and skew toward verbose responses.[13]

Human pairwise votes feeding an Elo / Bradley-Terry rating is the most statistically honest of the three.[11]

Objective truth, the scalable but biased LLM judge, and human preference (Elo, like chess), which is the most honest.

the pivot

Now break all of that with an agent

The real problem is an LLM answering an MMLU question is doing something fundamentally different from an agent that has to read a codebase, query the right data sources, run a few experiments, interpret what came back, and decide whether to retry or push on.

The first is a question. The second is a trajectory: dozens of micro-decisions, each one a chance to drift. QA-style questions are a tiny slice of what an agent does, yet they're nearly all we test and train on. So when we keep evaluating big, bulky models tuned for leaderboards, we're missing out on testing the model's true capability and practicality.

We're measuring a fish by its ability to climb a tree, then optimizing the fish to climb trees.

The panel on the right loops through real runs of different lengths: watch how the longer ones wander out of the correct zone, and at a different step each time.

Takeaway: single-shot exams can't see drift. To measure an agent you have to watch the whole trajectory and its variance.
Real runs, replayed: each is a different trajectory (4–27 steps). The short ones stay in the correct zone; longer ones drift out — at a different step each time.

the setup

So let's actually watch some agents work

Twelve tasks, three flavors

Since I'm a data scientist, I built twelve data-analysis tasks of varying difficulty, each labeled T, W, or G. Pick a task to see what it asks.

a footnote I went with data-analysis tasks because they tap real research skills, and because everyone has done some level of DA at some point. Don't worry, no deep ML here. This post is about how we evaluate models, and how we could do it better.
T · Q&A W · workflow G · gotcha

The harness

Each run gets a fresh, persistent IPython kernel in a sandbox containing only the files that task declares. Nothing else is visible. State persists across tool calls within a trajectory, and the sandbox is destroyed afterward.

Workflow tasks get 30 steps and 8K tokens per turn; Q&A and gotcha tasks get 20 steps and 4K. Wall-clock and total-token budgets cap each run as a safety net.

One trajectory, start to teardown.

Why ten seeds, not more tasks?

The goal is to pin down the variance in agent behavior. Ideally the right answer shouldn't fluctuate, but in the agent world it does: single-run pass rates can swing several points even at temperature 0.[16] Agents stumble into correct results via broken methods, and vice versa. I wanted to capture that.

SWE-bench goes wide (N=2294 tasks, one seed). My budget went deep (12 tasks, ten seeds). The Bjarnason variance work argues that more seeds over fewer tasks better measures actual agent behavior rather than task coverage,[16] a point echoed by broader studies of benchmark variance.[19]

Same budget, partitioned two ways: cover more ground, or measure the same ground many times to see the wobble.

before the aggregates

Watch one run drift

Averages hide the story. Here is one trajectory, step by step. Press play and watch exactly where it goes wrong.

Pick a run on the right. These are real trajectories, pulled from the run logs:

  • G1 baseline, the confident hallucinator. Asked for a 6-month forecast from 6 days of data, it flags "limited data" and then ships $1,896,114 anyway.
  • G1 calibrated, the fix. Same task, a calibration prompt added and now it refuses: "cannot be completed reliably … 6 days for a ~180-day horizon." This single change moved G1 from 0/10 → 8/10.
  • W4 baseline, the plausible-but-wrong. It names the right methods, then recommends plain DiD for the one campaign with severe selection bias.
Real trajectories, condensed from the run logs in data/.

the payoff

What 120 trajectories actually showed

I scored each run along three axes, because, as you'll see, any single axis lies.

1 · Outcome reliability: did the agents succeed?

“Did the agent get the right answer” looks different per task type. For Q&A tasks, I extracted the numeric answer and compared it against ground truth with relative tolerances (e.g. ±15% for T4). For workflow tasks, “did the file exist” is the weakest check so I added two extra layers: (a) does the artifact have minimum substance, and (b) is the deliverable set clean.

Substring matching for “refusal phrases” failed immediately: one trajectory flagged “limited data” yet confidently produced a $1.9M forecast. I pivoted to an LLM judge using a 3-point rubric (refusal, hedged, or no pushback) calibrated against hand-labeled ground truth. Now outcomes are tagged with specific categories — Resolved, Hedged, No Pushback, No-Op, or Empty — providing the granular data needed to design better interventions.

Resolved, Hedged, No Pushback, No-Op, Empty, across 10 seeds per task. Real, from 120 baseline runs.

2 · How did it get there? (cost ≠ truth)

Watch the same task, G1, run two ways, with cost adding up step by step. The baseline agent keeps spending through a dozen-plus steps, climbing to ~$0.28 just to produce a confident, wrong $1.9M forecast.

The calibrated agent recognises the data is too thin, stops early, and plateaus at ~$0.11. This time it was right to push back. The expensive run is the wrong one. Spend tells you almost nothing about whether the answer is good: expensive isn't correct; cheap isn't wrong.

Cost accumulation per step on G1 (real trajectories, replayed). Baseline keeps spending and lands on a wrong answer; the calibration intervention plateaus early and is correct.

3 · Where did it go wrong? (the W4 deep-dive)

In W4 the “final boss” of the workflow tasks standard scoring fails. Although every baseline run produced an analysis plan, the substance varied wildly. One trajectory correctly flagged C002's selection bias and pushed back; another recognised the bias but applied Difference-in-Differences (DiD), reporting a flawed +$9.76 effect.

I decomposed W4 into five phases: Data Loading, Assignment Diagnosis, Bias Recognition, Method Recommendation, and Production. Using an LLM judge (shoutout Zheng et al.[14]), I calculated the chain probability: P(success at phase k | prior phases passed). These conditionals pinpoint where agents fail, mimicking a process reward model.[15]

Results showed high success across all phases except Phase 4 (Method Recommendation), which hit 60–70%. The gap between “success” and “correctness” lies solely in choosing the right method for biased data.

I also tracked deterministic correctness (reality-matching) and LLM-judged artifact quality. Since Sonnet 4.5 always writes well, I ignored formatting to focus on method–data fit, uncertainty discipline, and actionability.

Per-phase pass rate across W4's five phases (real, n=10), with Wilson 95% intervals. Same idea as a process reward model.[15]

A note on the judges

The reliability of the evaluation itself remains a critical bottleneck. To test this, I ran the Phase 4 judge three times on the same 10 trajectories at temperature 0 and observed three distinct success distributions: 7/10, 0/10, and 6/10. This variance highlights a sobering reality: even when using structured rubrics and deterministic settings, LLM judges are not perfectly reproducible.

This inconsistency suggests that “LLM-as-a-judge” introduces its own layer of noise into the evaluation pipeline. Since minor config changes can yield very different results, future work must investigate judge model choice and prompt formatting as independent variables.

The real takeaway: don't trust a single benchmark number for an agent. Measure trajectories, measure variance, and treat your evaluator as a thing that also needs evaluating.
LLM judge vs the deterministic rubric, on the 20 gotcha runs. The rubric fails all 20; the judge flips 9 to pass; every one a G1 forecast it should have rejected.

Conclusion

So that's the case for measuring agents differently — or at least my attempt at it. The thesis is simple: benchmark scores keep climbing, but a single leaderboard number can't tell you whether an agent will actually do the job. The gap between the two is where the real work, and the real risk, lives.

To recap, six things worth tracking instead of a pass-rate:

This was not an exhaustive study, and plenty was left out. It's one model (Sonnet 4.5) at temperature 0, on twelve hand-built data-analysis tasks i.e. a method demo. The data is synthetic (production data is messier still), and the N is deliberately small. And the judge inherits uncertainty.

But I hope it was useful for evaluating agents, and for the broader reminder that benchmarks measure the exam while these measure the job. And if at some point you caught yourself muttering “huh, maybe the leaderboard isn't the whole story”, then yes, dear reader, you're right. Now go measure the trajectory.