Evals as a Shipping Contract: How to Actually Gate LLM Changes

A team swaps a prompt on Tuesday to fix a tone complaint. The unit tests pass — they always do, because the prompt is a string and the string still parses. Someone spot-checks three outputs, they look fine, it ships. Thursday, support is fielding screenshots of the assistant confidently recommending products that don’t exist. Nobody can say when it started, because nothing measured the thing that broke.

That gap is the whole problem. On one side you have deterministic unit tests that assert nothing about model behavior. On the other you have vibes — a human eyeballing a handful of outputs. Between them is where every meaningful LLM regression lives, and most teams have nothing there. The fix is an eval suite, but not the kind most blog posts describe. The bar is higher: an eval suite earns the right to block a deploy only when it’s derived from failures you actually observed, statistically honest about non-determinism and noise, and owned by an org that will genuinely stop a push when the gate goes red. Anything less is theater that turns green on demand.

This is the discipline, concretely.

What “evals as a shipping contract” actually means

The unit-test mental model breaks immediately, because LLM behavior is a distribution, not a function. The right question is not “did this one run pass” but “what share of representative runs clears the bar.” Acceptance becomes a threshold on a distribution. Rick Pollick frames this cleanly in Evals Are the New Acceptance Criteria (2026): your old Gherkin acceptance criteria — Given X, when Y, then Z — don’t disappear, they become eval cases. An eval, in the contract sense, is exactly three things: a labeled golden dataset, a scoring method, and a pass-rate threshold a build must hit to ship.

The governance split matters as much as the mechanics. Product and risk own the threshold — what failure rate is acceptable is a business decision, not an engineering one. Engineering wires the threshold into CI and treats a breach like a failing test: the build is red, the push stops. If engineering quietly owns the number, it will get lowered the first time it’s inconvenient, and the contract is void.

The eval-driven development heresy, resolved

Search “LLM eval driven development” and most of the top results — Braintrust, evaldriven.org, assorted vendor blogs — tell you to write evals first, before the code, like test-driven development. Then you’ll read Hamel Husain and Shreya Shankar, who run the most respected applied-evals course in the field, saying bluntly: “Don’t use eval-driven development.” Both can’t be right. The definitive resolution is that they’re describing different moments and the spec-first camp is wrong about the most important one.

You cannot author a good eval case for a failure you have not seen. Imagined failures produce evals that test your assumptions, not your system’s actual breakage. Hamel and Shreya’s loop is error-analysis-driven: a domain-expert “benevolent dictator” does open coding — free-form notes on ~30–50 real production traces — then axial coding to cluster those notes into a failure taxonomy (an LLM can assist the clustering, not the reading). You iterate to theoretical saturation: keep reading traces until roughly 20 consecutive ones reveal no new failure mode, usually around 100 traces. Their budget guidance is the part teams skip: spend 60–80% of dev time looking at data, not building eval infrastructure.

So evals codify failures you discovered, not failures you invented. “Eval-driven development” survives only in the weak sense that once you’ve found a failure mode, you write the eval before you write the fix, so you can prove the fix worked. The strong, spec-first sense — write a comprehensive eval suite from a PRD before observing production — produces confident, useless suites.

Deriving cases from production failures

The source material is your bug tracker, your support queue, and your production traces. Anthropic’s Demystifying Evals for AI Agents (2025) gives the most actionable starting point in print: “20–50 simple tasks drawn from real failures is a great start.” Each incident becomes a permanent regression case — the post-mortem produces test infrastructure, not just a Slack thread. This is the single highest-leverage habit: a fixed bug that doesn’t become an eval case will silently regress.

Anthropic’s test for whether a case is well-formed: a good eval case is one where “two domain experts would independently reach the same pass/fail verdict.” If two experts disagree on whether an output passed, the case is underspecified and the grader — human or LLM — will be noise. Tighten the input, the expected behavior, or both until the verdict is unambiguous.

Three sets, three different gates

Conflating set types is a common and expensive mistake. There are three, and each gets a different threshold and gating policy.

Regression set. Grown from incidents, protects known-good behavior. Target near-100% pass. A drop here means you broke something that used to work — a hard block, no discussion.

Adversarial / red-team set. Jailbreaks, prompt injection, fabricated citations, the NIST-aligned guardrail categories. BlackRock, per The Nuanced Perspective (2026), runs eight NIST-aligned guardrail categories after a near-miss where the system fabricated a regulatory citation. This set also targets high pass rates but tests a different axis: does the system stay safe under attack, not just correct under normal load. Hard block.

Capability set. Measures progress on hard tasks you’re still trying to get right. This one starts low — that’s the point — and is not a deploy blocker. It tracks whether you’re improving, not whether you’ve regressed. Gating on it would block every release for not yet being good enough at the frontier.

Anthropic’s framing: regression sets protect against backsliding (near-100% target); capability evals start low and measure progress. Mixing them into one number averages a “don’t ship if broken” signal with a “we’re not done yet” signal and you lose both.

Graders that don’t lie

Prefer binary pass/fail over a 1–5 Likert scale. Hamel and Shreya’s reasoning is concrete: adjacent Likert points are subjective (one annotator’s 3 is another’s 4), annotators cluster on the middle, and Likert needs larger samples to detect a real difference. Binary is harsher and more honest.

Use deterministic assertions first — exact match, schema validation, regex, “does the cited ID exist in the catalog.” These are cheap, fast, and don’t lie. Reach for LLM-as-judge only for failures that survive deterministic checking, like tone or faithfulness. And when you do, validate the judge: measure its True Positive Rate and True Negative Rate against a held-out, human-labeled set. An unvalidated judge produces a green gate that means nothing.

Two more judge rules. Give the judge an explicit “Unknown” option for underspecified cases so it doesn’t guess — Anthropic flags this directly. And bias-correct production failure estimates using the judge’s measured TPR/TNR: if your judge has a known false-positive rate, the raw failure count it reports is biased, and TPR/TNR let you back out a corrected estimate.

Avoid generic metrics — helpfulness, coherence, ROUGE, BERTScore. Hamel’s line: “good scores on them don’t mean your system works.” They measure surface similarity to a reference, not whether your specific task succeeded.

Choosing thresholds without fooling yourself

First, a counterintuitive health check from Hamel: “If you’re passing 100% of your evals, you’re likely not challenging your system enough. A 70% pass rate might indicate meaningful evaluation.” A suite that’s always green is usually too easy, not your system being perfect.

Second, non-determinism. A gate that runs each case once is theater. The two numbers to know:

pass@k — probability of at least one success in k tries.
pass^k — probability all k tries succeed.

Anthropic’s worked example: a 75% per-trial success rate clears all 3 of 3 trials only (0.75)³ ≈ 42% of the time. A single run badly overstates reliability. So run each case k times. In promptfoo, repeat: 3 with repeat-min-pass: 2 requires 2-of-3 passes per case.

Third — and this is what almost every page hand-waves — is a regression real or noise? LLM eval scores carry stacked noise: sampling noise (which cases you chose), generation noise (decoding randomness), parsing noise (extracting the answer), ordering effects, and estimator noise. Sida Wang’s Measuring All the Noises of LLM Evals (arXiv:2512.21326, Dec 2025) catalogs these and draws the practical consequence: set a minimum detectable effect before you compare. A 2% drop may sit entirely inside the noise band.

A real decision rule, not “set a threshold”:

For binary metrics on the same fixed eval set across two versions, use McNemar’s exact test on discordant items — the cases where exactly one of A/B is correct. The cases both get right or both get wrong carry no signal about the difference.
For continuous scores, use a paired bootstrap: resample items ~10,000 times, report the 95% percentile CI of the score difference, call it significant only if the CI lies strictly above 0.
Holm-correct across multiple metric comparisons — testing ten metrics at p<0.05 means roughly one false alarm by chance.
Gate on the CI lower bound, not the point estimate. Investigate when the lower bound crosses the threshold.
For small n, don’t trust the Central Limit Theorem under a few hundred datapoints — use Wilson-score intervals for binomial proportions, Agresti-Caffo for differences. (Frame from Raschka, arXiv:1811.12808.)

The hidden-regression trap

Single-metric optimization is unsafe. Daniel Commey’s When Better Prompts Hurt (arXiv:2601.22025, Jun 2026) documents that a prompt change lifting one capability can quietly degrade another — adding “think step by step” helps some tasks and hurts others. A change that improves tone +5% but drops accuracy −2% looks like a win on the headline metric. So track a vector of dimensions and gate on the worst-case per-dimension regression, not the average. Averaging is exactly how a regression hides.

Braintrust’s regression-threshold model handles the legitimate-tradeoff case: define acceptable score deltas per dimension so a tone-up/accuracy-down change can pass if the accuracy drop is within an explicitly agreed bound — a decision, not an accident.

And the cross-flow rule from leading teams: any new failure cluster appearing on the candidate but not on production auto-fails. This catches the silent case where fixing one flow breaks another — a failure mode your existing cases, written for known problems, won’t see.

Wiring it into CI

Keep the CI dataset small (~100+ examples) and lean on deterministic assertions, because CI runs on every PR and LLM-judge calls are slow and expensive. Run LLM-judges asynchronously on sampled production traces, not inline in the PR gate.

promptfoo (open-source, YAML) ships a GitHub Action that does before/after comparison automatically; Braintrust’s Action posts results to the PR and blocks below threshold. A minimal sketch:

# promptfooconfig.yaml
prompts: [file://prompts/support.txt]
providers: [anthropic:messages:claude-sonnet-4-5-20250514]
defaultTest:
  options:
    repeat: 3
    repeat-min-pass: 2     # 2-of-3 to handle non-determinism
tests:
  - vars: {query: "Do you sell the X-200 router?"}
    assert:
      - type: contains-any
        value: ["no", "don't carry", "not available"]
      - type: llm-rubric          # judge, validated against human labels
        value: "Does not invent a product that isn't in the catalog"
    assertScoreThreshold: 0.5     # 50% of weighted assertions must pass

promptfoo eval -c promptfooconfig.yaml -o out.json
rate=$(jq '.results.stats.successes /
  (.results.stats.successes + .results.stats.failures)' out.json)
awk -v r="$rate" 'BEGIN{exit !(r >= 0.90)}' || {
  echo "Eval pass rate $rate below 0.90 — blocking"; exit 1; }

That’s the contract in code: dataset, graders, threshold, non-zero exit below the bar. The same shape works in Braintrust or a hand-rolled runner.

The model-upgrade gate — the highest-leverage use

This is where eval gates pay for themselves, because providers ship breaking changes silently. The 2025–2026 record, with dates:

GPT-5.2 silently updated on Feb 10, 2026, breaking JSON, classification, and structured-output prompts with no warning (genesisclawbot drift report).
GPT-5.1 retired Mar 11, 2026 with silent auto-fallback to GPT-5.3/5.4 — apps still calling gpt-5.1 now run a different model (DEV Community).
Anthropic’s Aug 2025 Claude Code incident: degraded outputs, ignored instructions.
A PLOS One ten-week longitudinal study (Feb 2026) confirmed “meaningful behavioral drift across deployed transformer services.”

The implication is non-negotiable: pin model versions (use the dated snapshot, never a floating alias), and treat every model or provider swap as a config change that triggers the full regression + adversarial suite as a hard gate. Braintrust’s guidance — “treat model/provider swaps as config changes requiring full eval before migration” — is exactly right. Diff candidate vs incumbent on the golden set before you migrate. The teams that got burned by GPT-5.2 were the ones with no gate standing between a provider’s silent push and their users.

Beyond CI: the layered release gate

CI gating handles known and regression failure modes. It cannot catch what you didn’t enumerate — and you can’t enumerate everything. The academic framing (Evaluation-Driven Development and Operations of LLM Agents, arXiv:2411.13768, Nov 2024) splits offline evaluation (pre-deploy, gating) from online evaluation (runtime, monitoring) as distinct stages for exactly this reason. The full release gate observed in production stacks (AppScale, 2026):

Lint / static checks on prompts.
Offline eval on the golden set (regression + adversarial) — block below threshold.
Cost / latency budget gate — a correct-but-2x-slower-and-pricier candidate can still fail.
Shadow eval — run the candidate on duplicated live traffic, log responses, compare to prod offline. Users see nothing.
Canary — route 1% → 5% → 20% → 50% → 100% with real-time monitoring and auto-rollback over a 24–48h watch window. Critically, the revert is a flag flip or version-tag repoint, not a redeploy — rollback in seconds, not a release cycle.

Sampled LLM-judges on production traffic catch the long tail the offline gate never saw.

Why most eval gates quietly die

The acid test, from The Nuanced Perspective: “If your evals flagged a regression, would you actually stop the push?” If the honest answer is no, you don’t have a gate; you have a dashboard.

Three failure modes kill gates:

Stale-dataset rot. “Using a static dataset from 2024 to evaluate a 2026 product becomes a benchmark rather than a regression suite.” The fix is ownership: domain experts, not platform engineers, own the datasets and refresh them. Uber’s conversational designers update eval datasets weekly; Uber also auto-creates evaluators — one alerted when tools were contradicting each other 30% of the time. A dataset nobody refreshes drifts away from production until a green gate means nothing.

The month-two death. The documented anti-pattern: “engineers disable eval gates by month two if they’re too expensive to run on every PR.” This is why CI datasets stay small and deterministic-first, and why heavy LLM-judges run async on samples. A gate too slow to tolerate is a gate that gets commented out.

No pre-agreement to honor red. If the org hasn’t decided in advance that a red gate blocks the push, the gate loses every argument with a deadline. The will-to-block is a policy decision, made calm, before the pressure.

The deepest point underneath all three: build evals around how your system breaks, not around the model. “A better model does not fix structural failures.” If your retrieval is wrong, GPT-6 will be confidently wrong with better grammar.

The discipline, in one page

Derive cases from real failures — bug tracker, support queue, production traces — not imagined specs.
Error-analysis-driven, not spec-first: open-code ~30–50 traces, axial-code to a taxonomy, iterate to saturation (~100 traces). Spend 60–80% of effort looking at data.
Three sets, three policies: regression (~100%, hard block), adversarial (hard block), capability (starts low, not a blocker).
Binary graders, deterministic first; LLM-judge only when needed and validated by TPR/TNR against human labels, with an “Unknown” option.
Repeat each case k times for non-determinism (repeat-min-pass); report pass^k, not a single run.
Statistical gate: McNemar for binary, paired bootstrap for continuous, Holm-correct, gate on the CI lower bound, set a minimum detectable effect first.
Worst-case dimension, never the average; new failure cluster on candidate ⇒ auto-fail.
CI blocks the PR below threshold; small dataset, judges async on samples.
Pin model versions; full suite on every model/provider change as a hard gate.
Shadow + canary + auto-rollback for the long tail you couldn’t enumerate.
Refresh datasets weekly, domain experts own them, and agree in advance to honor red.

A green eval only earns the right to block when it’s fast enough to run, fresh enough to trust, and backed by an org that already decided it will stop the push. That last clause is the contract. Everything else is plumbing.