Spec-Driven Development with AI Agents

I wanted to answer a simple question: can an AI coding agent build a production-grade system from a specification alone, without writing code myself - intervening only when summoned, to review plans and manage context? To test that, I decided to build the most complex thing I could think of in a hold-my-beer moment: not a todo app. Not a CRUD demo. An ephemeral cloud IDE designed for the era of coding agents. And so Codeflare was born.

A real system - browser-based terminals, WebSocket proxying, Durable Objects for session lifecycle, R2 storage with bidirectional sync, a SolidJS dashboard, a setup wizard that configures DNS and Cloudflare Access, user management with RBAC, and a container image that pre-warms terminal sessions before the user clicks Open.

The answer to my own question is: yes - and it massively exceeded my expectations. The how is where it gets interesting.

1. The Method: Specifications, Not Prompts

Let me start with the biggest misconception in AI-assisted development. Most people hand an agent a prompt - “build me an app that does X” - and hope for the best. This is vibe coding. It works for prototypes. It does not work for systems.

The difference between a prompt and a specification is the difference between a wish and a contract.

A specification defines what the system does, what technology it uses, how components interact, what the data looks like, what edge cases exist, and what acceptance looks like. It’s specific enough that the agent can execute without asking you a single question. If the agent has to guess, the spec isn’t done.

For Codeflare, the specification runs 542 lines. It defines 11 phases - from the Worker entry point and security headers to the container Dockerfile and E2E tests. Every route, every KV schema, every WebSocket protocol detail, every CSS architecture decision. The agent reads this document, produces a detailed implementation plan; I review and approve, and then it executes.

The specification is the product. The code is a side effect.

This is spec-driven development - and in the coding agent era, it redefines the job. The highest-leverage skill is no longer writing code. It’s translating requirements into specifications precise enough that execution becomes mechanical.

2. Test-Driven Development as a Live Guardrail

Here’s where most autonomous AI development falls apart. An agent without feedback loops doesn’t just drift - it hallucinates with the confidence of a keynote speaker.

Test-Driven Development isn’t a style preference. It’s the mechanism that keeps the agent honest.

The workflow is simple: the agent writes failing tests first, then implements until they pass. Every npm test run injects your expectations back into the agent’s context. If it drifts off course - wrong return type, missing validation, broken edge case - the failing test tells it exactly what went wrong and what was expected. The agent course-corrects without you lifting a finger.

Without tests, the agent has no feedback loop. It writes code, assumes it works, and moves on like a consultant who won’t be here next quarter. By the time you notice something is wrong, it’s three features deep into a broken foundation.

Test-driven development turns your spec into a live guardrail. The agent can’t declare victory unless the tests agree. And if the tests are weak, that’s on you - not the model.

Codeflare has 1,974 unit tests and 225 end-to-end tests. The agents wrote tests first, then implemented to green. I enforced that loop ruthlessly.

3. The Infrastructure: Ephemeral by Design

Process solves correctness. Infrastructure solves safety.

Once I trusted the process, the next question was: where can I safely let an agent run wild?

This part is personal. I needed to solve a problem that most cloud IDEs avoid.

AI coding agents, to be truly autonomous and capable of dealing with any obstacle they stumble upon during the creation process, need untethered root access. They install packages. They modify system files. They run arbitrary commands. They need to operate without permission prompts slowing them down - what some affectionately call “YOLO mode.” On a local machine, this is terrifying. On a shared server, it’s a resignation letter waiting to happen.

Ephemeral containers change the equation entirely.

Give the agent root. Let it rm -rf / if it wants. The only victim is a container that was going to die anyway. Your files are safe in R2 storage, synced continuously. Container dies unexpectedly and you didn’t commit to git? Start a new session, storage is automatically restored from R2, sync your repo with the remote, and continue. The blast radius is contained to an ephemeral container - not your laptop, not your shared infrastructure.

This isn’t just about safety - it’s about mobility. A fully cloud-delivered development environment works from anywhere, from any device, instantly. Your $3,000 MacBook and your $300 Chromebook provide the same experience - because neither one is doing the compute. They’re just windows into an environment that exists somewhere on Cloudflare’s edge.

I’ve written before about Zero Trust at national scale - no entity trusted by default, every access decision verified. This is the same principle applied to a single developer. The container has no standing trust. It lives briefly. It dies. Identity gates access via Cloudflare Access. Storage persists separately in R2. Everything else is disposable. The entire environment costs about $5 a month.

4. What Actually Breaks

Here’s what nobody tells you about autonomous AI development. The failures are not where you expect them.

Context is the bottleneck, not intelligence. Large language models are smart enough to build complex systems. They’re not smart enough to remember what they built three hours ago. Context windows fill up. Decisions made in Phase 1 evaporate by Phase 5. You become the memory management unit - curating what the agent knows, persisting critical decisions across sessions, pruning stale context before it causes drift. This is the real job.

Agents ignore what they can’t see. Mobile layouts suffer. Accessibility gets skipped. Edge cases in WebSocket reconnection logic get hand-waved. The agent optimizes for what’s in the spec and the tests - if you didn’t specify it, it doesn’t exist. The specification’s completeness determines the system’s quality. No exceptions.

Dead code accumulates silently. Over multi-day development sessions, agents leave behind unused imports, orphaned helper functions, and commented-out experiments. They don’t clean up after themselves - they’re builders, not janitors. You need periodic audits - or a second agent whose only job is to find and remove dead code.

Confidence exceeds competence on edge cases. The agent will implement 95% of a feature correctly and then hallucinate the remaining 5% with absolute conviction. Rate limiting logic that doesn’t actually rate limit. WebSocket reconnection that doubles every character. Circuit breakers that never trip. The tests catch these - but only if you wrote tests for the edge cases.

5. Surviving the Last 5%

So how do you close that gap? Not with a better prompt. With better instincts.

Coding agents like Claude Code and Codex support rules, skills, hooks, specialised subagents, and commands that fire outside the specification itself. Think of them as the engineering instincts a senior developer carries - the kind that make you write the error handler before someone asks, or run the security check before you ship.

My setup runs over 90 learned behavioural instincts that evolve across sessions. It dispatches 9 specialised agents in parallel - architecture review, security analysis, dead code detection. Automated hooks enforce discipline on every file edit. A memory subsystem persists decisions across context windows. CLI commands trigger entire workflows - a full multi-perspective code review, a cross-model verification panel, a test-driven development enforcement loop - each from a single slash command.

This didn’t appear fully formed. After several thousand commits to GitHub in the past year, it’s an accumulation of guardrails added every time the agent failed in a new and creative way. It represents 15 years of architectural scar tissue and cybersecurity knowhow, encoded as guardrails for the agent’s work.

The specification tells the agent what to build. The instincts tell it how to think while building it.

6. The Accidental Product

Here’s the part I didn’t plan.

Codeflare was supposed to be a proof of concept. A demonstration that spec-driven, test-driven AI development produces production-grade software. I’d write the specification, hand it to an agent, and measure the result.

The first time I opened it on a train to work, on a Samsung Galaxy Fold, it felt identical to my desktop - that’s when I realized this wasn’t a demo. Pre-warmed sessions that load before you click Open. R2 storage that makes your files immortal. A setup wizard that configures DNS, Cloudflare Access, and storage credentials in seconds.

Browser-based terminals with tiling layouts. Support for Claude, Codex, Gemini, or plain Bash - because the best agent is the one that fits your task.

I didn’t write the code. I wrote the contract. And if a specification can produce production-grade tools, the bottleneck in engineering isn’t implementation anymore. It’s clarity.

The Bottom Line

Here’s what it comes down to. “Vibe coding” - giving an agent a loose prompt and hoping for the best - produces demos. Spec-driven development paired with test-driven development is the winning combination for the coding agent era. The specification defines the system. The tests keep the agent honest. Together, they turn autonomous AI development from a gamble into an engineering discipline.

The specification is a forcing function. It makes you think through every route, every schema, every edge case before a single line of code exists. And when the agent executes against that specification with test-driven development enforcing correctness at every step, something remarkable happens: the code is often cleaner than what most teams produce under deadline pressure. Not because the AI is smarter, but because the process doesn’t allow shortcuts.

The future of software engineering isn’t about writing code or watching AI write code. It’s about writing specifications precise enough that execution becomes mechanical - and then letting the machines do what machines do best.

I spent about 10 hours writing the specification. Five hours testing. And 10 hours adding features I didn’t know I needed until I used the thing - plus fine-tuning UI details because Samsung Internet has opinions about CSS that no specification can anticipate. Total engineering time: about 25 hours. ~70,000 lines of code. That is a leverage ratio I have never seen in my career.

I wrote a document. An AI agent built my favorite IDE from it. I’d call it a proof of concept, but it’s hard to call something a demo when you refuse to use anything else.

The Code is a Side Effect: Spec-Driven Development in the Coding Agent Era

Ready to Transform Your Architecture?

Years Experience

Industries

Technology Advisory

Resources

Legal