Living Specifications: SDD at Production Scale

Two months ago, I wrote about building a production system from a 542-line specification - an ephemeral cloud IDE called Codeflare. The thesis was simple: spec-driven development paired with test-driven development turns autonomous AI development from a gamble into an engineering discipline. The code is a side effect.

That thesis survived contact with reality. What I didn’t expect was how much it would evolve.

The Numbers

The specification grew from 542 lines to over 4,400. The codebase grew from 70,000 lines to over 100,000. The test suite grew from 2,199 to over 3,200 across 189 test files. The system is live at codeflare.ch - real users, real sessions, real production traffic.

But the numbers are not the story. The real story is what production did to the process.

The Spec Became the Product

In the original post, the specification was a document. A contract between me and the agent. Phase-based, prescriptive, static. It described the system once and the agent executed.

That worked for v1. It doesn’t work for a living product.

A static specification describes what you intended to build. A living specification describes what you actually built - and why. When a bug surfaces, sometimes the specification tightens. Often, the bug changes nothing in the spec - only the execution was wrong. When a feature request arrives, the specification absorbs the intent before a single line of code changes. The spec doesn’t capture how to build something. It captures what to build and why.

The specification is no longer a starting document. It is the single source of truth for the entire product.

152 requirements across 11 domains. 46 terms in the glossary. 12 architectural constraints. 44 architecture decision records capturing the trade-offs behind every non-obvious design choice. Every requirement has a unique ID, acceptance criteria, and traceability to the domain it belongs to. The agent consults the spec before changing anything, validates its work against acceptance criteria, and updates the spec when the system’s behaviour changes. For engineering leaders, this means every behavioural change starts as a traceable requirement - not an undocumented code edit.

The Workflow That Emerged

The original process was linear: write the spec, generate a plan, execute with TDD. Clean. Simple. One-directional.

The production workflow is circular:

New idea arrives - feature request, bug report, user feedback, or my own observation while using the product
Spec first - the requirement gets captured with a unique ID, acceptance criteria, and domain assignment. Not code. Not a ticket. A specification entry
Plan only the delta - the agent reads the spec, identifies what’s new or changed, and produces a targeted implementation plan
TDD execution - tests first, then implementation, then verification against acceptance criteria
Spec updated - if the implementation revealed edge cases or design decisions not captured in the spec, they get added. The spec grows to reflect reality, not just intent

This loop runs continuously and autonomously. Every push triggers three review agents - code quality, documentation, and specification - that reconcile intent with reality. The specification and the codebase co-evolve. Neither leads permanently.

When something breaks in production, the flow inverts - fix first, spec after. The discipline is that the fix isn’t done until the spec reflects it, otherwise the next delta plan risks overwriting it.

What Changes When You Ship

The spec became defensive. Before launch, every requirement was aspirational - “the system shall do X.” After launch, requirements split into two categories: things the system must do, and things the system must never do. Edge cases that users found became constraints. Behaviours that worked in testing but failed in production became hardened requirements. The spec grew faster from bugs than from features.

One example: a user’s session was killed mid-work because the idle timeout preference was stored only in memory. When the container recycled, the preference vanished and the system fell back to a short default. That became a defensive requirement: “User-configured idle timeout MUST be persisted to durable storage and survive container restarts.” The spec doesn’t just describe what the system does - it encodes what went wrong and why it must never happen again.

Context management became specification management. In the original post, I described the developer as a “memory management unit” - curating what the agent knows across sessions. With a living specification, the spec itself becomes the memory. New session? The agent reads the spec. Context window filling up? The spec has the canonical state. The specification replaced half of my context management work.

What Specs Can’t Capture

Specifications are excellent at behaviour. They are poor at taste.

“The dashboard should feel responsive” is not a useful requirement - it’s a subjective judgement that changes based on context, device, and the user’s mood at 7 AM on a train versus 2 PM at a desk. “This button placement feels wrong” is not an acceptance criterion. It’s judgement, not specification.

These are the things you end up re-explaining every session. The intent behind a design choice. Why this layout works and that one doesn’t. Why this error message sounds helpful and that one sounds condescending. Every new session, the context resets.

To solve this, I built a persistent knowledge graph. Not raw transcripts of past conversations - a structured graph of observations and relations that the agent maintains across sessions. When I say “the subscribe page should feel premium, not desperate,” that observation gets captured, linked to the relevant domain, and injected into the agent’s context the next time it touches that page. The agent doesn’t just remember what the system does. It remembers what I care about and why.

The specification captures behaviour. The knowledge graph captures judgement. Together, they give the agent something no context window can provide: continuity. Tasks that required careful guidance in week two became routine by week seven.

From Process to Product

The SDD workflow that built this system - hooks, commands, agents, skills, rules, and MCP plugins - is now a feature of the system itself. Not because it was planned that way - because a living specification and a persistent knowledge graph are architectural consequences, not optional add-ons. Once the specification becomes the control system for change, productising that loop becomes the obvious next step.

The agent can take natural language input - “I want user authentication with GitHub OAuth” - and proposes a structured requirement with acceptance criteria, domain assignment, and dependencies. The user confirms, refines, or rejects. Execution follows the same loop: plan, TDD, verify, update spec. The knowledge graph runs alongside it, preserving preferences and design instincts so the agent improves without forcing you to restate the same judgement every session.

A senior architect riding a train can describe a system in plain language and have a structured, traceable specification before the train reaches the next station. From there, the agent executes against that specification with TDD enforcing correctness at every step. That is a leverage ratio that changes what one person can build.

The Numbers Nobody Talks About

I showed the system to engineering managers and DevOps specialists and asked them to estimate what it would take to build traditionally. The consensus was 300 to 400 engineering days. At Swiss market rates, that’s somewhere between CHF 300K and 600K in engineering costs - before project management, coordination overhead, and the inevitable scope negotiations.

I spent about 150 hours and roughly $600 in API tokens. Most of those hours went into designing the SDD system itself - the rules, the workflows, the guardrails. Not the product. The system that builds products.

Where This Doesn’t Work

This process excels where behaviour can be specified and tested - APIs, business logic, infrastructure, data flows. It is weaker when the problem itself is still fluid - novel UX exploration, ambiguous product strategy, or greenfield discovery where the requirements are what you’re trying to learn. SDD reduces implementation uncertainty. It does not reduce market uncertainty. If you can’t articulate what the system should do, no specification will save you.

But here’s the thing: vibe code it first. Prototype, explore, throw things at the wall until you understand the intent, purpose, and goal. Once you know what you’re building and why - that’s when SDD takes over and turns the clarity into production-grade software.

The Bottom Line

The question everyone keeps asking is whether AI can replace developers. It’s the wrong question. The right question is: what happens when the bottleneck shifts from writing code to defining intent?

A specification precise enough to be executed mechanically. A test suite that keeps execution honest. A knowledge graph that preserves judgement across sessions. That combination didn’t just build a product - it built a product that ships the combination itself.

The next team that operates this way won’t need 150 hours of setup. They need clarity about what they want to build. The system handles the rest.

542 lines was a proof of concept. Over 100,000 lines of production code later, it’s an answer.

The system is live at codeflare.ch if you want to try the workflow without bootstrapping it yourself.

The Spec Grew Up: What Happened After 542 Lines Became a Product

Prêt à transformer votre architecture ?

Ans d'expérience

Secteurs

Conseil technologique

Ressources

Juridique