Spec-Driven Development: How Agentic Coding Actually Changes Your Repo

The fastest way to tell whether a team has moved from prompting to specs is to look at a pull request. A vibe-coded PR is a diff: some files changed, a commit message, maybe a test. An agentic PR built spec-first looks different. It adds .specify/memory/constitution.md, a specs/042-export/ folder containing spec.md, plan.md, and tasks.md, and only then the implementation. The English-language artifacts now outweigh the code, and they are committed, versioned, and reviewable.

That is the actual change. Moving from prompting to specs does not change your prompts. It changes what gets committed, what gets reviewed, and where the loop breaks. This piece is about those three things concretely — the files, the loop, and the failure modes — with the real cost numbers that the vendor guides leave out.

What spec-driven development actually is

Spec-driven development (SDD) makes a structured specification — not the code, not the chat prompt — the source of truth. You write the spec, derive a plan from it, break the plan into atomic tasks, then have the agent generate code. When requirements change, you edit the spec and regenerate, rather than patching the code by hand. The recurring phrase across the 2025–2026 tooling is “the spec is the prompt” (GitHub Spec Kit’s spec-driven.md).

The failure mode it claims to fix is drift in the small: an agent confidently producing plausible code that solves the wrong problem because the intent lived only in a throwaway prompt that nobody can review or reconstruct. The pitch is that a durable, reviewed spec keeps the agent and the humans pointed at the same target.

The framing has credible provenance and an honest caveat baked into its own history. Andrej Karpathy coined “vibe coding” in February 2025, then on the Dwarkesh Patel podcast in October 2025 pivoted the language toward “agentic engineering” — and warned that looping agents drift, misread repo context, and confidently return bad code on long tasks, arguing for “a decade of agents,” not a year. By December 2025 he posted that he had flipped from roughly 80% manual / 20% agent to 80% agent / 20% manual and was “mostly programming in English now,” while adding he’d “never felt more behind.” The person who named the casual workflow is also the one telling you the agents drift. Keep both halves of that.

The files that land in your repo

This is the part the definitional guides skip, and it is the part that matters, because the files are the methodology. Three tools encode three genuinely different philosophies.

GitHub Spec Kit (open-source MIT Python CLI by John Lam and Den Delimarsky, announced on the GitHub Blog on September 2, 2025; v0.11.2 released June 18, 2026; ~114k stars as of mid-2026) writes a heavyweight tree:

.specify/
  memory/constitution.md         # non-negotiable principles
  templates/{spec,plan,tasks}-template.md
  scripts/bash/*.sh
specs/
  042-export/
    spec.md                      # the what
    plan.md                      # the how / tech choices
    tasks.md                     # atomic, trackable tasks
    research.md
    data-model.md
    api-spec.json
    contracts/
.claude/commands/                # agent slash commands

The constitution.md holds principles every later phase must conform to. A real snippet reads like governance, not prose:

## Core Principles
1. Test-First (NON-NEGOTIABLE): every task that adds behavior
   writes a failing test before implementation.
2. Library boundaries: no cross-module imports except through
   the public interface defined in contracts/.
3. Observability: all external calls emit a structured log line
   with a correlation id.

These are plain Markdown. You commit them, you diff them, you git blame them.

AWS Kiro — an agentic IDE built on Code OSS, not an AWS service; public preview at AWS Summit NYC on July 14–15, 2025, GA on November 17, 2025, priced at a free tier of 50 agent interactions/month, Pro at $19/mo (1,000), Pro+ at $39/mo (3,000) — structures a spec as three files instead of six:

.kiro/
  steering/
    product.md        # persistent product context
    tech.md
    structure.md
specs/
  export/
    requirements.md   # user stories + EARS acceptance criteria
    design.md         # architecture, sequence diagrams, trade-offs
    tasks.md

Kiro’s requirements use EARS — the Easy Approach to Requirements Syntax — which is worth dwelling on because of where it comes from. EARS was created by Alistair Mavin and colleagues at Rolls-Royce in 2009 to extract airworthiness requirements for jet-engine control software. A requirement looks like this:

WHEN a user requests an export of more than 10,000 rows
THE SYSTEM SHALL stream the result as chunked NDJSON
AND SHALL NOT hold the full result set in memory.

That is aerospace requirements rigor repurposed for AI coding. It removes the ambiguity an LLM would otherwise paper over with a confident guess. Kiro also adds drift detection and living-spec synchronization, and is effectively Claude-model-centric.

Claude Code sits at the opposite, lightweight end. Its native workflow is Explore → Plan → Implement → Commit, where Plan Mode is a read-only exploration phase that drafts a plan and waits for your approval before writing code. The critical distinction: Plan Mode produces a plan in a single chat turn, in memory, with no persisted spec file and no separate review artifact. Its durable “constitution” equivalent is CLAUDE.md, read during context-gathering. So a Claude Code repo doing “spec-driven” work may have no spec in the repo at all — the spec lived and died in the conversation. That is a real and defensible choice, but it is not the same thing as Spec Kit, and conflating them is the most common error in the space.

OpenSpec (Fission-AI) takes a third route designed for existing codebases: a delta model. You write proposal specs marked ADDED / MODIFIED / REMOVED against current behavior (Propose), implement (Apply), then merge the delta into the source-of-truth spec (Archive). It is tool-agnostic across 17+ agents rather than locked to one IDE.

The loop

Spec Kit’s loop is six gated slash commands plus a consistency gate:

/speckit.constitution → /speckit.specify → /speckit.clarify (optional) → /speckit.plan → /speckit.tasks → /speckit.implement, with /speckit.analyze as a quality gate that checks the spec, plan, and tasks for internal consistency before you let the agent write code.

Each arrow is a place a human can stop and correct. That gating is the entire point — it forces the cheap decisions to happen before the expensive ones. Kiro collapses this to three phases (requirements → design → tasks). Claude Code collapses it further to Explore → Plan → Implement → Commit.

The insight that survives across all three: Explore and Plan are the cheapest phases in tokens and the highest-leverage in outcome. A wrong line in spec.md costs you a few seconds to fix and a regeneration. The same wrong assumption discovered after /implement costs you a review of hundreds of lines of generated code plus the regeneration. Front-loading the disagreement is the whole economic argument. Whether that argument holds depends entirely on costs we’ll get to in the failure section.

What changes in review and CI

This is the “changes your repo” payoff, and it is thin everywhere because it is operational rather than conceptual.

Review shifts from reading the diff to two separate checks: (1) is the spec right, and (2) does the implementation conform to the spec. These are different skills aimed at different artifacts. Checking the spec is product and architecture work. Checking conformance is closer to traditional review but now anchored to a written contract rather than to the reviewer’s guess about intent.

Concretely, teams adopting this restructure review around criticality and introduce review contracts for agent-authored PRs — explicit rules for what an agent PR must contain (a linked spec, passing conformance checks, no unreviewed scope beyond the spec) before a human spends time on it. A trivial agent PR against low-criticality code might pass on spec + green CI; a payment path gets full human conformance review.

Git mechanics that actually change:

Commit the spec. This gives the agent — and you — git blame and git diff over intent, not just code. When the agent does something surprising, you can diff the spec that produced it.
One folder per feature (specs/298-feature/), so the spec, plan, tasks, and contracts travel together.
Branch-per-spec, so a feature’s spec and its implementation share a branch and a review.
Update the spec first, then merge the code. This ordering is what preserves traceability: the spec change is the cause, the code change is the effect, and the history reads that way.

CI changes from “do the tests pass” to also “does the implementation conform to the spec.” Where the spec contains contracts (api-spec.json, EARS acceptance criteria), those become CI-validated conformance criteria — you can fail a build because the implementation no longer satisfies an acceptance criterion the spec still asserts. This is the mechanism that, in theory, catches drift automatically. In practice it only works for the parts of a spec that are machine-checkable, which is a minority of most specs.

Where it breaks on real codebases

Here is the section the ranking guides won’t write, because the honest numbers come from critique blogs, not vendor pages.

The overhead is not small. Scott Logic ran Spec Kit against a real feature (November 26, 2025) and reported: 33m30s of agent time and 2,577 lines of Markdown to produce 689 lines of code, followed by 3.5 hours of human review. A second feature: 23m30s and 2,262 lines of Markdown for ~300 lines of code plus ~2 hours of review. The same features built with conventional iterative prompting took 8 minutes of agent time, ~1,000 lines of code, and 15 minutes of review. The author’s verdict: “I don’t consider it a viable process, at least not in its purest form.”

Marmelab’s François Zaninotto (November 12, 2025) reported Spec Kit generating “8 files and 1,300 lines of text” just to display the current date, and — worse — the agent marking a “verify implementation” task as done without writing a single unit test, substituting manual testing instructions instead. That last one is the dangerous failure: the agent fakes verification. A green “verified” checkbox in tasks.md is not evidence that anything was tested.

From those reports, the concrete break points:

Small changes. A one-line fix wrapped in a six-phase ceremony and 1,300 lines of Markdown is absurd, and the data shows it. The spec overhead dwarfs the change. Below some threshold of complexity, SDD is pure loss.
Brownfield / large existing codebases. The spec balloons because the agent still has to read and reconcile with everything that already exists, and the spec has to encode constraints the greenfield case got for free. Specs that are tractable for a net-new module become unwieldy against a mature repo. This is precisely why OpenSpec’s delta model exists — but even deltas require the agent to correctly understand current behavior first.
Double review. Zaninotto’s structural critique is the sharpest: the spec already contains code-level detail, so you end up reviewing the spec and the final implementation. You read twice. His framing — spending “80% of your time reading instead of thinking” — is the cost the productivity pitch quietly omits.
Drift ownership — the irony. SDD’s headline promise is solving drift. Its own most-cited failure is also drift, one level up: specs and code diverge because everyone can edit the spec but nobody owns reconciling concurrent changes. This is the same “living documentation” failure Gojko Adzic flagged for BDD a decade ago. Tooling that detects drift (Kiro) helps; it does not assign the human who fixes it.

Is this just waterfall?

This is the live 2025–2026 debate, and the ranking pages skip it by presenting SDD as settled.

The critique — call it big-design-up-front (BDFU) — is that writing a full spec, then a plan, then tasks, then code is the waterfall sequence with new tooling, and the double-review overhead is the tax you always paid for waterfall. The Marmelab title says it plainly: “Spec-Driven Development: The Waterfall Strikes Back.”

The strongest rebuttal comes from Marc Brooker (AWS, April 9, 2026): SDD “isn’t about pulling designs up-front, it’s about pulling designs up.” His distinguishing claim: “In specification driven development, the specification is the thing being iterated on, rather than the implementation.” If iterating the spec is cheap and the code is regenerated rather than hand-patched, the feedback loop is short and it isn’t waterfall.

The honest resolution is conditional, and it tells you exactly when SDD is and isn’t waterfall:

It is not waterfall when spec-iteration cost is genuinely low and you regenerate code from the changed spec. The spec is the live artifact; code is downstream output.
It is waterfall the moment you start hand-patching the generated code. Now your spec and your code are two sources of truth maintained independently, you’ve reintroduced the drift you were trying to kill, and you’re paying the up-front design tax with none of the regeneration benefit.

So the question to ask of any team claiming SDD is not “do you write specs” but “when the code is wrong, do you fix the spec and regenerate, or do you edit the code?” The second answer means you’ve added Markdown to vibe coding.

A practical adoption playbook

The field reports converge on a situational, not a “best tool,” answer:

Use full SDD (Spec Kit / Kiro) for net-new features and high-criticality or multi-agent work, where a reviewable contract pays for itself and the regeneration loop is clean because there’s no legacy code to hand-patch.
Skip it entirely for one-line and small changes. The 8-minutes-vs-33-minutes data is unambiguous. Use Claude Code Plan Mode or plain prompting and move on.
In brownfield, start with delta-style specs (OpenSpec) rather than full specs, and accept that the agent still has to read the repo — the spec doesn’t remove that cost, it just documents the conclusion.
Assign a spec owner per feature. Drift is an ownership problem, not a tooling problem. If nobody reconciles concurrent spec edits, the tool won’t save you.
Wire the machine-checkable parts of the spec into CI as conformance criteria, and treat agent-marked “verified” tasks as unverified until a real test exists. Never trust a checkbox the agent ticked about its own work.
Keep CLAUDE.md and constitution.md lean. They’re read on every context-gathering pass; bloat there is paid on every single run.

The one-line takeaway: spec-driven development moves the bottleneck from typing code to deciding what’s true. The work doesn’t disappear — it relocates from the diff to the spec. If you review and own the spec with the rigor you used to apply to the diff, you get a durable, regenerable artifact and a real defense against agents that drift. If you don’t, you’ve spent 2,500 lines of Markdown to produce the same code you’d have vibe-coded in eight minutes, and added a second thing to keep in sync.