← Back to blog

GDPval, Explained: What the Benchmark Measures and What It Doesn't

Agentropic · · AI benchmarksLLM evaluationGDPval

GDPval is OpenAI’s September 2025 benchmark that asks a deliberately narrow question: when a frontier model is handed a real professional task — the kind someone actually gets paid to do — and produces a deliverable, how often does an expert in that field rate the model’s output as good as or better than a human expert’s? The answer in the original paper was that the best model, Claude Opus 4.1, was rated as-good-as-or-better on 47.6% of a held-out subset of tasks. That number has been widely paraphrased as “AI matches experts on half of economically valuable work.” It does not mean that, and the gap between what GDPval measures and what it gets quoted as measuring is the entire point of this article.

This is a useful benchmark — arguably the best public attempt yet to tie model quality to economically meaningful knowledge work rather than to math olympiad problems or trivia. But almost every secondary summary mangles three things: what “win rate” actually counts, what the “100x cheaper” headline actually compares, and how reliable the grading is. If you’re going to cite GDPval, cite it correctly.

How GDPval is actually constructed

The benchmark (paper: arXiv:2510.04374, “GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks,” lead author Tejal Patwardhan, posted October 2025) is built around 1,320 tasks spanning 44 occupations. The occupations were chosen to populate the nine U.S. sectors that contribute most to GDP: Real Estate and Rental and Leasing; Manufacturing; Professional, Scientific, and Technical Services; Government; Health Care and Social Assistance; Finance and Insurance; Retail Trade; Wholesale Trade; and Information. Within each sector they picked the occupations contributing most to that sector’s output, then sampled tasks from them. A 220-task “gold” subset (five tasks per occupation) was open-sourced; this is the subset the headline numbers come from.

Tasks were written by industry professionals — not crowdworkers — averaging 14 years of experience. Each task is a request plus a set of reference files, paired with a real deliverable the expert actually produced. On the gold subset, tasks came with an average of 1.92 reference files (range 0 to 38), and 67.7% required at least one. The deliverables are not chat answers — they are the actual artifacts of professional work: DOCX, XLSX, PPTX, PDF, PNG, MP4, and ZIP outputs, with inputs that can include CAD files, audio, and video. A legal task might ship a redlined contract; a financial-analyst task an Excel model; a media task a rendered video.

Concretely, a task looks like: “Here is a client brief, three prior quarterly decks, and a raw data export. Produce the Q3 investor update as a PowerPoint.” The model gets the prompt and the files, and emits one file. That is the whole interaction. Hold onto that, because the one-shot structure is the source of most of the benchmark’s limits.

Validation was non-trivial. Tasks passed model-in-the-loop screening, then a minimum of three (averaging about five) human expert reviews across generalist, occupation-expert, and final-iteration stages. This is more rigorous than most benchmarks, which is partly why it’s worth taking seriously.

What “win rate” really means

Grading is blind pairwise comparison. An occupational expert is shown the model’s deliverable and the human’s deliverable, without being told which is which, and rates the model’s as better than, as good as, or worse than the human baseline. The headline metric — “win rate” — is the percentage of deliverables graded better-than (win) OR as-good-as (tie).

That OR is doing enormous work, and it’s almost never disambiguated in coverage. Here is the original gold-subset table (model alone, no scaffolding):

ModelWin+tie rate
Claude Opus 4.147.6%
GPT-5 (high)38.8% (OpenAI’s blog cites 39.0%)
o3~34–35%
o4-mini~28–29%
Gemini 2.5 Pro25.5%
Grok 424.3%
GPT-4o12.4%

So when you read “Claude approaches experts on half of tasks,” the precise statement is: on 47.6% of gold-subset tasks an expert rated its output at least as good as the human’s — and on the other roughly 52% the model lost outright to the human expert. “Tie” and “win” are pooled. The benchmark does not publish the metric most people think they’re reading, which is “the model produced strictly better work.” A model can post a respectable win rate while almost never beating the human and merely matching them often, and the single number won’t tell you which.

The qualitative breakdown is more informative than the score. Claude Opus 4.1 led on aesthetics — formatting, slide layout, visual polish — and topped 8 of 9 sectors. GPT-5 led on accuracy and instruction-following. The losses clustered by failure mode: instruction-following misses (Claude, Grok, Gemini), formatting errors (GPT-5), hallucinated or miscalculated numbers, and — for Gemini and Grok — deliverables that were promised in the response text but never actually produced. That last one matters: a chunk of “losses” are not bad work but missing work, which a different harness might catch and retry.

What “~half of tasks” does and doesn’t generalize to

GDPval is explicitly one-shot: a single prompt plus reference files, a single deliverable, graded once. OpenAI is clear in the paper and blog that this measures “capabilities on discrete tasks, not predictions of occupational displacement.” Most secondary articles drop that caveat, which is precisely backwards — it’s the most important sentence in the release.

Here’s why the one-shot framing caps what the number can tell you. Real knowledge work is multi-turn (you draft, get feedback, revise), context-accumulating (you know the client, the history, last quarter’s argument), collaborative (meetings, Slack threads, hallway corrections), and increasingly agentic (the work involves calling tools, querying systems, iterating against a build). GDPval captures none of that. It captures the slice of a job that can be expressed as “here is everything you need, now produce the artifact in one pass.” Some jobs are mostly that slice; many are mostly the parts GDPval excludes. The benchmark also relies on experts self-selecting tasks they deemed “representative,” which is a soft form of selection bias — representative of the artifact-producing portion of the role, not necessarily of where the hours or the value go.

So: a 47.6% win rate is a real, hard-won signal about discrete-deliverable quality under one specific setup. It is not a claim that a model can do half of any occupation, and OpenAI does not make that claim.

The “100x faster and cheaper” headline, debunked carefully

This is the single most distorted figure in GDPval coverage. The “~100x faster and ~100x cheaper” line is a raw inference-only comparison: the cost and wall-clock of a model generating the deliverable, against the average expert task, which the paper puts at roughly 7 hours of work and about $361 (the 220-task gold mean; the broader set runs higher). Of course a few dollars of inference beats seven hours of a professional’s time on the raw axis. That comparison assumes the model output is used as-is, with zero review.

The paper itself models the realistic case — the “try the model once, then a human checks it and fixes it if it’s bad” workflow — and the numbers collapse. Under that scenario GPT-5 came out to roughly 1.39x faster and 1.63x cheaper at the optimistic end, and as low as 1.12x faster / 1.18x cheaper in the single-attempt case. Across the best models the realistic productivity gain landed around 1.12x to 1.39x. Not 100x. The difference between “100x” and “1.3x” is the cost of oversight: someone qualified has to read the output, verify the numbers, catch the hallucinated figure or the missing slide, and own the result. That review labor — plus integration and liability — is most of the real cost of using a model on professional work, and it’s exactly what the raw comparison zeroes out.

If you quote one number from this whole article, quote this one: GDPval’s own realistic productivity estimate is roughly 1.1–1.4x, not 100x.

Reliability: how soft is the ceiling?

A benchmark is only as trustworthy as its graders, and GDPval’s graders disagree with each other a lot. Human inter-rater agreement among expert graders was 71% — they disagreed on which deliverable was better about 29% of the time. That sets a soft ceiling on the entire benchmark: if domain experts can’t agree roughly a third of the time on the same pair of artifacts, then win-rate differences smaller than that disagreement band are within noise. Treat a 3- or 4-point gap between two models as “probably indistinguishable on this benchmark,” not as a ranking.

The automated grader — a GPT-5-based judge that OpenAI open-sourced as a service — reached 66% agreement with human graders, about 5 points under the human-human ceiling. That’s genuinely good for an LLM judge, but OpenAI explicitly does not treat it as a full substitute for expert grading, and you shouldn’t either: it’s 66% against a target that itself is only 71% self-consistent. A subtler problem, raised in critiques (e.g. Pranil Dasika’s writeup), is the conflict of using a model from one of the labs being graded as the judge of all of them — even with blind comparison, shared training-data priors can bias which deliverable “looks right.” The grader being open-sourced is a real virtue for reproducibility; it doesn’t dissolve the construct concern.

The harness matters as much as the model

A GDPval score is a model-plus-scaffold number, not a pure property of the model. The paper demonstrates this directly:

  • Best-of-N sampling (N=4) improved GPT-5’s win rate by about 5 percentage points — generate four candidates, pick the best, and you’ve moved the model up a tier without touching its weights.
  • Structured prompting cut PowerPoint formatting errors from 86% to 64% and eliminated the black-square PDF rendering artifacts that had previously corrupted over half of outputs.

Read that again: more than half of PDF outputs at one point had rendering artifacts that were fixable with prompting alone. A model that “loses” because its deliverable rendered as black squares isn’t bad at the task — it’s badly harnessed. This means a single quoted GDPval number tells you about a model under a particular scaffold, and a 5-point swing is achievable with sampling and prompt engineering. When you compare two GDPval figures from different sources, you may be comparing harnesses, not models.

The honest list of limitations

Beyond grading noise and one-shot framing, several construct-validity issues deserve to be named plainly:

  • GDP-share as a proxy. Choosing occupations by their contribution to GDP makes “economically valuable” tractable, but the tasks sampled within an occupation are still a thin proxy for what the role economically is. High-GDP doesn’t mean high-automatability, and the task selection doesn’t claim to weight by where the hours go.
  • Self-selected tasks. Experts chose tasks they considered representative. People are not reliable narrators of which parts of their own job are hard, valuable, or automatable.
  • One vendor owns the whole stack. OpenAI built the benchmark, built the grader, and trains several of the graded models. Blind comparison and an external author panel mitigate this; they don’t eliminate the structural incentive.
  • Contamination risk. The 220-task gold subset is now public. Future models can be trained on it, deliberately or via web scraping, which degrades the subset’s value as a held-out test over time. The 1,100 private tasks are the real defense.
  • Aesthetics vs. substance. Pairwise grading rewards polish. Claude Opus 4.1’s lead was substantially an aesthetics lead. A beautifully formatted deck with one wrong number can out-rank an ugly correct one, depending on the grader — which is part of why graders disagree 29% of the time.
  • No agentic or collaborative work. Already covered, but it’s the limitation with the biggest gap between what the number suggests and what jobs require.

Where GDPval stands now, and how to read the leaderboards

If you’re reading this in 2026, the original October-2025 table (Opus 4.1 47.6%, GPT-5 ~39%) is stale as a leaderboard, though still the canonical reference for how the original blind-expert methodology worked. Two things have happened since.

First, Artificial Analysis ran an independent reproduction, GDPval-AA, using OpenAI’s 220 gold tasks but with its own methodology: LLM-judge blind pairwise comparisons aggregated into an Elo rating (human baseline pinned at 1,000), with models given shell and web access — i.e., agentic, not one-shot. This is a meaningfully different measurement. As of June 2026 the GDPval-AA v2 board is topped by Claude Fable 5 (Elo ~1,783), Claude Opus 4.8 (~1,615), and GLM-5.2 max (~1,524). Because the harness is agentic and the aggregation is Elo, these numbers are not comparable to the original win-rate percentages. An Elo of 1,615 vs a human baseline of 1,000 is not “win rate,” it’s a relative-skill estimate under tool access.

Second, vendors now cite “GDPval” as a tracked benchmark with shifting framing. OpenAI’s GPT-5.5 (released around April 23, 2026) reported 84.9% on a later GDPval variant. That 84.9% is not the same metric as the original 47.6% win+tie — different task framing, different grading setup — and putting the two side by side is meaningless. “GDPval score” has become a family of numbers, not a single comparable figure.

So, to cite GDPval honestly: (1) always state whether you mean the original blind-expert win+tie rate, the Artificial Analysis Elo, or a vendor variant; (2) state the harness (one-shot vs agentic, plain vs best-of-N); (3) treat sub-5-point gaps as noise given 71% inter-rater agreement; and (4) never let “win rate” imply “beats humans” or imply a speed/cost ROI.

Bottom line

GDPval is the most serious public attempt so far to measure model performance on real, economically valuable knowledge-work deliverables, and its construction — expert authors, real artifacts, blind pairwise grading, a held-out private set — is better than most of what it competes with. Used correctly, the win rate is a discrete-task quality signal under a specific harness. Used as quoted in most coverage, it becomes a jobs-automation forecast and a 100x ROI promise, and it is neither. The model that “matches experts on half of tasks” still loses outright on the other half, is graded by humans who disagree 29% of the time, gets a 5-point swing from sampling, and delivers a real-world speedup closer to 1.3x than 100x once a human has to check the work. Hold both facts at once: the capability is real and improving fast (GPT-4o to GPT-5 more than doubled the win rate in about 15 months), and the headline numbers are softer and narrower than they sound.

Tell us what's broken.

One conversation. We'll tell you honestly if we can help.

Book a call