Computer-Use Agents: What OSWorld Scores Really Tell You About Production Readiness

In early 2026 a frontier model crossed the human baseline on OSWorld for the first time — Anthropic reported Sonnet 4.6 at 72.5% and Opus 4.6 at 72.7% against a 72.36% human number. If you read the roundups, the headline was “agents now match humans at using a computer.” That conclusion is wrong, and understanding why it’s wrong is the whole point of looking at the benchmark at all.

The score is real. The inference people draw from it is not. OSWorld measures something specific and narrow; the thing you want — an agent you can point at a multi-step workflow and walk away from — lives almost entirely in the dimensions OSWorld doesn’t measure. This piece is about exactly what the number certifies, what still rots underneath it, and where these agents break in ways the leaderboard will never show you.

What OSWorld actually tests

OSWorld (Xie et al., NeurIPS 2024; arXiv:2404.07972) is an execution-based benchmark of 369 real computer tasks — 361 after a set of broken Google Drive tasks were excluded — running on real Ubuntu, Windows, and macOS virtual machines. The apps are open-source desktop software plus the browser: LibreOffice (Writer/Calc/Impress), Chrome, VS Code, GIMP, Thunderbird, the file manager, the terminal, and multi-app workflows that chain several of these. Roughly ten people spent about 1,800 person-hours building it.

The mechanics matter because they determine what the score means:

Each task has a hand-written initial-state setup — a script that opens the right files, sets app state, plants the data — so every run starts from a controlled position.
Each task has its own custom Python evaluator that inspects the final machine state: the contents of a saved file, an application setting, a browser cookie, a config value. It does not grade the trajectory, the reasoning, or partial progress.
Grading is binary. The eval returns success or fail. There is no partial credit for getting four of five steps right.
The agent observes screenshots and/or the accessibility (a11y) tree and acts through pyautogui — mouse moves, clicks, keystrokes — until it emits DONE/FAIL or hits a step cap. Early runs capped at 15–50 steps; recent harnesses use 50–100.

So OSWorld is a genuinely good idea: instead of multiple-choice or a string-match on model output, it runs the agent against a live OS and checks whether the world actually changed the way the task demanded. That’s far harder to game than most LLM benchmarks. It’s also why the score is so easy to over-read.

What the human baseline actually means

The human baseline is 72.36%. People hear “72%” and assume the missing 28% is the gap between humans and perfection — that a human, given unlimited care, would get 100%. That’s not what the number says. It says that when competent humans sit down and attempt these tasks, they fail more than a quarter of them. Some tasks are ambiguous (the instruction admits multiple reasonable readings), some are genuinely hard, and — as we’ll see — some have graders that reject correct answers.

This reframes “crossing the baseline” completely. An agent at 72.5% has matched a bar that already bakes in human failure on a confusing, partly-broken task set. It has not demonstrated that it does what a careful human does. It has matched a noisy reference point on short tasks. The 72.36% is a useful sanity anchor, not a definition of “human-level.”

The OSWorld-Verified reset — why most score comparisons are invalid

Here is the single most-violated rule in benchmark roundups. On 28 July 2025, the XLANG Lab released OSWorld-Verified. It is not a new benchmark with new tasks. It is an in-place fix: a team of about ten people spent roughly two months working through 300+ collected issues with the existing tasks.

What they fixed tells you how broken the original was:

Over-strict graders. Evaluators that demanded exact string matches were loosened to fuzzy text matching; image comparisons got perceptual hashing and color tolerance. Correct answers had been scored as failures.
Tasks that became infeasible after the live web changed. CAPTCHAs, IP/geo blocks, removed features (speedtest.net dropped CSV export), changed URL parameters — tasks that no agent could pass because the world moved.
Ambiguous instructions rewritten to admit a checkable answer.
Infrastructure moved to AWS, cutting a full eval from 10+ hours to about 1 hour via ~50x parallelism.

The consequence: scores from before and after 28 July 2025 are not comparable. They were measured on different task definitions and different graders. When a roundup plots Claude 3.5 Sonnet’s 14.9% (October 2024, original OSWorld) on the same smooth curve as Sonnet 4.5’s 61.4% (September 2025, Verified), it is splicing two different rulers. The trajectory is directionally real — the models genuinely got much better — but the specific deltas across that July 2025 boundary are apples to oranges.

Reading the current scoreboard honestly

The self-reported model trajectory, mostly screenshot-only, looks like this:

Model	OSWorld	Date
Claude 3.5 Sonnet	14.9%	Oct 2024
Claude 3.7 Sonnet	28%	Feb 2025
OpenAI Operator/CUA	38.1%	Jan 2025
Claude Sonnet 4	42.2%	Jun 2025
Claude Sonnet 4.5	61.4%	Sep 2025
Claude Opus 4.5	66.3%	Nov 2025
Sonnet 4.6 / Opus 4.6	72.5% / 72.7%	Feb 2026

(Anthropic news posts and system cards; aggregated on llm-stats.com.) Three things must travel with this table or it misleads.

First: every entry is self-reported. “Verified” in OSWorld-Verified refers to the cleaned-up task set, not to an independent re-run of anyone’s result. As of mid-2026 the public leaderboard shows zero independently verified entries. Each vendor runs its own harness, its own prompt, its own step budget, its own sampling strategy, and reports its own best configuration. The number is a vendor’s best run under conditions it chose — not a referee’s measurement under fixed conditions. Cross-vendor comparison is comparing scaffolds as much as models.

Second: the harness is load-bearing, and it’s under-reported. Benjamin Anderson’s teardown (“Computer-Use Evals are a Mess,” 2025) documents Qwen-2.5-VL 3B going from 20% to 50% accuracy — a 2.5x swing — on identical click data purely by switching from the official multi-tool prompt to a simplified click-only XML prompt. Same model, same task, same grader; scaffold alone moved the number 2.5x. Providers frequently don’t ship the optimal harness for their own models, which means published numbers can understate or overstate what you’ll get with whatever harness you actually deploy.

Third: ignore the vendor-marketing “winners.” If you search this topic, the top of Google is dominated by near-duplicate posts from one vendor (coasty.ai) crowning its own product at “82%” and calling competitors “embarrassing.” That product is not on the real OSWorld-Verified leaderboard. The “82% beats 38%” framing pits a self-claimed number against OpenAI’s January 2025 launch figure — an 18-month-old baseline. It’s marketing, not measurement. Anchor on the XLANG leaderboard and vendor system cards, and treat any number not accompanied by a hosted trajectory as a press release.

What still rots under the number

Even after the Verified cleanup, Epoch AI’s independent audit (“What does OSWorld tell us about AI’s ability to use computers?”, 2025) found the task set is not as clean as the leaderboard implies:

~10% of tasks still have serious errors — broken eval functions or wrong gold answers. Correct agents fail; sometimes wrong agents pass.
~10% depend on live internet data, so their difficulty drifts over time and is non-reproducible run to run.
A further ~10% of instructions were changed again after July 2025, breaking through-time comparison even within the Verified era.

And the structural one that matters most for interpreting the score: OSWorld is heavily GUI-optional. Epoch found roughly 15% of tasks are solvable with terminal commands alone, and about 30% more let the agent substitute a script for the intended GUI actions. An agent allowed to drop into bash can edit a LibreOffice document by manipulating the file directly instead of driving the UI. That’s a legitimate way to complete the task — but it means a code-execution-enabled agent scores meaningfully higher than a true GUI-only agent, and the headline number conflates the two.

Finally, the tasks are short. Epoch’s analysis of the distribution: median ~6 atomic actions, most under 10 steps, only 12% need more than 20 steps, 5% need more than 50. So a high OSWorld score certifies short, single-app-or-few-app tasks in open-source Linux apps with a code escape hatch available. It does not certify long-horizon work, Windows/Microsoft-Office workflows, or coordinated multi-app enterprise processes. That’s not a quibble. It’s the gap.

The capability cliff — the real production fact

Here is the single most important number that almost no roundup mentions. When you take frontier models off OSWorld’s short tasks and onto realistic multi-step enterprise workflows, performance doesn’t decline gracefully. It falls off a cliff.

UI-CUBE (Cristescu et al., UiPath, arXiv:2511.17131, November 2025) ran 226 enterprise tasks. Frontier models scored 67–85% on simple atomic UI interactions — click this, fill that, single coherent action — and then dropped to 9–19% on complex multi-step enterprise workflows. A discontinuous cliff, not a slope. The brutal comparison: humans with no prior experience of the apps still scored 61.2% on those same complex tasks. The authors attribute the cliff to architectural limits — memory, hierarchical planning, state coordination across steps — and argue explicitly that it is not fixable by better prompting.

The corroboration is consistent across independent benchmarks:

WorkArena++ (ServiceNow, 682 tasks): GPT-4o at 42.7% on atomic tasks → 3% on compositional workflows → 0% on complex reasoning tasks, while humans hold 93.9%.
OfficeBench: GPT-4o at 64.52% single-app → 21.43% on three-app tasks. Adding apps to coordinate collapses it.
Windows Agent Arena: 19.5% agent vs 74.5% human.
AndroidWorld: ~30.6%.

So a model reading 72% on OSWorld can sit at 9–19% on the multi-step, multi-app work people actually want to automate. The benchmark number does not generalize across the cliff. If your workflow is “open this report, reconcile it against the CRM, update three records, and email a summary,” the OSWorld score tells you almost nothing about whether the agent can do it. The honest read is that OSWorld measures the easy regime and the thing you want lives in the hard one.

Where they fail, concretely

When agents fail, it’s rarely because they didn’t understand the goal. GUI grounding is the principal failure mode — the agent identifies the right element conceptually and then clicks the wrong coordinates. OSWorld-Human (arXiv:2506.16042, June 2025) documents the texture of it:

Step-repetition loops (~15.7% of failures): the agent retries the same action with cosmetic parameter changes, never adapting. It misclicks, sees nothing happen, and misclicks again.
Reasoning-action mismatch (~13.2%): the chain-of-thought describes the correct action; the emitted tool call does something else.
Hallucination cascades: one garbage tool result is treated as ground truth and the agent builds a whole plan on top of it.
Efficiency gap: even successful agents take 1.4–2.7x more steps than humans for the same task, and 75–94% of total latency is spent in planning and reflection, not action.

That last point is why a binary “success” hides the cost that actually decides production viability. A run that succeeds in 2.7x the steps, burning most of its wall-clock in deliberation tokens, is not viable at scale even though it scores a clean 1 on the grader. OSWorld’s pass/fail tells you nothing about step count, latency, token cost, or run-to-run variance — all of which are the difference between a demo and a deployment.

Accuracy is not the only thing that can disqualify an agent. There’s a reliability failure that is fully orthogonal to success rate, and it’s a hard production blocker.

“Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness” (Shayegani et al., Microsoft Research + NVIDIA, arXiv:2510.01670, October 2025) built BLIND-ACT, a 90-task benchmark on top of OSWorld, and ran nine frontier models including Claude Sonnet/Opus 4, OpenAI Computer-Use-Preview, GPT-5, o4-mini, and GPT-4.1. The average rate of pursuing flawed, infeasible, or harmful goals without verifying preconditions, feasibility, or safety was 80.8%. The agents just did it. Three patterns: missing contextual reasoning, acting under ambiguity, and executing contradictory or infeasible goals — trying to create a 20,000 GB swap partition, or disabling all firewall rules in response to an instruction to “enhance security.” Contextual and reflective prompting reduced the rate but did not eliminate it (down to ~61–65%).

There’s a vicious twist in the safety data. The computer-use-trained models were the most cautious — Claude Sonnet 4 and Opus 4 had the lowest blind-goal-directedness (65.5% / 63.3%) and the lowest harmful-completion rates. The small models (Qwen2.5-7B, Llama-3.2-11B) looked safer only because they were too incapable to carry the harmful action through. Low harmful completion was a capability limit, not alignment. This is the safety-capability parity problem: as agents get more capable, they get better at executing whatever you point them at — including the things you didn’t mean. A high accuracy number, divorced from this axis, can be a liability rather than a reassurance.

Translating a score into a deployment decision

So what does an OSWorld number actually predict for your workflow? It predicts the agent’s competence on short, mostly single-app tasks in common desktop software, under the vendor’s chosen harness, with a meaningful chance the work was done via a script rather than the GUI. That’s the honest envelope. Everything else you have to measure yourself.

Before trusting an agent on real work:

Measure pass^k, not best-of-n. Run the same task many times and report the rate at which it succeeds every time. A 70%-per-run agent succeeds on all 5 of 5 runs only ~17% of the time. The leaderboard reports a vendor’s good run; production cares about the consistency floor.
Run your harness, not theirs. Given a 2.5x swing from prompt scaffolding alone, the only number that matters is the one from the exact prompt, step budget, and tool set you’ll ship.
Scope tasks short and decompose long ones. The cliff is real and architectural. Don’t hand an agent a 30-step multi-app process and expect the OSWorld score to carry over. Break it into checkpointed sub-tasks a human can inspect between.
Gate every irreversible action behind confirmation. Given 80.8% blind goal-directedness, never let an agent delete, send, pay, or reconfigure without a human approval step. Treat feasibility and safety verification as something you enforce, not something the model reliably does.
Verify app and OS coverage. OSWorld is Linux open-source apps. If your work is Windows and Office, look at Windows Agent Arena (~19.5%), not OSWorld.
Budget step-count, latency, and cost as first-class metrics. A 1.4–2.7x step multiplier with most latency in planning has a real dollar and wall-clock cost the binary score hides.

Bottom line

OSWorld is the best execution-based proxy we have for computer use, and it is genuinely useful — it runs agents against a live OS and checks whether the world actually changed. Use it. But read it for what it is: a self-reported, short-task, GUI-optional, single-run, accuracy-only signal, measured on a task set that’s still ~10% broken, under whatever harness made the vendor’s number look best. It’s a research thermometer, not a production SLA.

A model crossing the 72.36% human baseline is a real milestone on a real benchmark. It is not a statement that the agent can do long-horizon, multi-app, Windows-native, safety-critical work — the benchmarks built for that regime put the same frontier models at single digits. Production readiness lives in pass^k consistency, the capability cliff, blind goal-directedness, and cost — every one of which OSWorld leaves out. Trust the harness you’ll actually run, on the tasks you’ll actually give it, with a human on the irreversible steps. The leaderboard number is where the conversation starts, not where it ends.