The lonely sprint
Last Tuesday I shipped 9 refactor sprints. In old-shop arithmetic — 1 PM, 1 designer, 2 engineers, 1 QA — that is 5 people × 5 weeks ≈ 25 person-weeks of work. I did it in 1 day. This is not a flex. It is the prompt for an honest accounting: where does the AI start, where does the human end, and what disappears in between. I have 28 days of usage data sitting on disk — 3,514 messages across 175 sessions, 7,939 Bash calls, 1,072 sub-agent invocations — to grade myself against, and I want to lay it out without the marketing varnish.

What "context engineering" actually means
Prompt engineering is what you write into the box. Context engineering is what is already there before you write anything. The slogan I keep coming back to is this: build the system so you never have to re-explain the same context twice. Every decision you persist — a convention, a guardrail, a workflow shape — pays compound interest the next time you sit down. The first sprint is slow because you are pouring the foundation. The ninth is fast because you are not.
A second-order definition: prompt engineering optimizes one turn. Context engineering optimizes the project's memory across all turns. It is closer to onboarding a new hire than to crafting a clever sentence. You do not write better prompts forever. You build a project that requires fewer of them.
The 4-layer context stack
The stack I use is layered and additive. Each layer answers a different question: what is true, what is forbidden, how do we work, what did we ship.
Layer 1 CLAUDE.md Auto-injected every prompt = "context that never forgets"
Layer 2 .claude/rules/ Hooks that block violations before they happen
Layer 3 .claude/skills/ Slash commands wrapping reusable workflows
.claude/agents/ 36 domain experts the orchestrator can call
.claude/commands/ Composed primitives ("/pdca plan", "/sprint-start")
Layer 4 docs/01-plan/ Source-of-truth artifacts. The output of one phase
docs/02-design/ becomes the input of the next. Every sprint
docs/04-report/ adds to the stack — the stack grows with you.
docs/05-qa/Layer 1 is the constitution. Three CLAUDE.md files (root, common, sprint-master) inject ~470 lines of "this is what the project is, here is the 30-sprint plan, here is what you absolutely must not do" into every single prompt at zero cost to me. I never re-explain conventions across sessions. Layer 2 is enforcement, not advice — architecture-rules.md, design-system-rules.md, safety-rules.md are wired into pre-commit and PreToolUse:Write hooks that refuse a write if it violates the rule. Layer 3 is the workflow library. Layer 4 is what every sprint produces and the next sprint reads.
The crucial property: every sprint that closes makes the stack richer. New scripts go into scripts/qa/, new agents into .claude/agents/, refined templates into .claude/templates/. Sprint N+1 starts on a stronger floor than sprint N did.
Where AI takes over (a real day)
The 9-sprint day was not heroic concentration on my side. It was orchestration. The Claude Code usage report tells the story plainly: Bash 7,939 calls, Edit 2,837, Read 2,676, TaskUpdate 1,997, Write 1,327, Agent 1,072. Multi-clauding (parallel sessions overlapping in time) accounts for 36% of all messages. At one point I dispatched 26 sub-agents in parallel for an architecture audit. The shape is not "I write code with AI helping" — it is "I run a small studio and the studio is AI."
A single sprint, end to end, looks like this from my side:
/sprint-start refactor-H # bkit maps it from the Sprint Master Plan
/pdca plan sprint-refactor-H # plan.md auto-generated, 12 sections
/pdca design sprint-refactor-H # design.md + L1-L5 test plan matrix
/pdca do sprint-refactor-H # api-expert + frontend-expert agents
/pdca analyze sprint-refactor-H # gap-detector compares design vs code
/pdca iterate sprint-refactor-H # auto-fix loop until Match Rate ≥ 90%
/pdca qa sprint-refactor-H # qa-test-generator + qa-monitor
/pdca report sprint-refactor-H # completion report with quality gatesEight commands. Per sprint. Nine times — seventy-two prompts to drive the day. Most of them were ceremony rather than decision. That friction is exactly the thing that produced the wrapper I will describe in two sections. For now: the bkit state machine refuses to advance if a quality gate fails — typecheck, architecture violations, matrix-to-code sync, L1-L5 tests. I do not run tsc --noEmit myself anymore; the gate does. I do not measure architecture violations myself; measure-arch-violations.ts does, and the result is checked in as arch-baseline.json so the next sprint cannot regress without me being told.
From one sprint to thirty
A single sprint is not the unit of work. The four-month roadmap is. The plan lives in docs/01-plan/sprint-master/ — thirteen documents covering thirty sprints across seven phases, each sprint with explicit dependencies on its predecessors. One slash command, /sprint-status, reads four sources at once — bkit's pdca-status.json, the architecture-violation arch-baseline.json, the GitHub issue list, and the master plan markdown — and renders one screen:
Sprint Master Plan — Progress
─────────────────────────────────
Phase 0 — Merge baseline [done]
Sprint 00 — Merge Ready [✓] Match Rate 96%
Phase 1 — Beta (Apr 28 – May 2)
Sprint 01 — Beta P1.1 [in progress] design phase
Sprint 02 — Beta P1.2 [blocked by 01]
Cumulative KPIs
GitHub issues: 27 → 25 (2 closed)
Architecture violations: 79 → 79 (Sprint 27 target)
Matrix sync: 19.4% → 19.4%
Next: /sprint-start 02-beta-p1-2
The point is not the layout. It is that "where am I?" never means scrolling the chat. It means one slash command, four data sources, one screen. Dependency enforcement lives in the plan too: Sprint 11 (cutover) has requires: sprint-03 (data migration); Sprint 30 (final E2E) has requires: sprint-12..29. Trying to start out of order triggers a confirmation prompt. The plan refuses to be ignored quietly.
Wrapping generic PDCA in project context
There is a tension between two truths. bkit's /pdca plan / design / do / analyze / iterate / qa / report works for any project — that is its strength. But "any project" means it knows nothing about my project. It does not know that Sprint 17 is the payment epic, that Sprint 24 is the SEO push, or that the matrix files under docs/qa/matrices/ must update before the report can pass.
The fix is a thin wrapper. sprint-orchestrator is a project-local skill (.claude/skills/sprint-orchestrator.md) that intercepts /sprint-start NN, auto-injects the project context, and then runs /pdca underneath:
user project layer bkit (vendor plugin)
──── ───────────── ────────────────────
/sprint-start 17 → sprint-orchestrator → /pdca plan
(auto-loads payment /pdca design
matrices, payment- /pdca do
expert, gates, /pdca analyze
baselines) /pdca iterate
/pdca qa
/pdca reportSprint 17 starts → payment-expert joins automatically. Sprint 24 → seo-expert. Sprint 27 → arch-auditor. I never re-explain area context. The wrap collapses the eight-command-per-sprint sequence I described above into one command. The nine-sprint day cost me seventy-two prompts: 8 × 9. Today the same nine sprints would cost nine. The skill that closed that gap was itself born from the friction of typing seventy-two of them.
Six gates that refuse to advance
Generic PDCA does not tell you when to stop. Project context does. sprint-orchestrator has six quality gates wired into the phase transitions, each measured by a script committed to the repo:
Gate Threshold Tool
───────────────────────── ────────────── ──────────────────────────
Match Rate >= 95% gap-detector
Architecture violation delta = 0 measure-arch-violations.ts
Matrix sync >= 95% verify-api-matrix.ts
typecheck 0 errors tsc --noEmit
lint 0 warnings eslint --max-warnings 0
test PASS pnpm testMatch rate below 95? /pdca iterate runs automatically, up to five cycles. Architecture delta non-zero? The sprint stops; the fix is forced before the next phase. Matrix below 95%? The matrix-synchronizer agent is dispatched. There is no path forward that bypasses a red gate.
The behaviour change is not on the AI side. The AI ran without gates before; it would happily declare done. The gates are for me — the human cannot forget the checklist because the system does not let the human forget.
Where I still hold the line
The 26-agent fan-out is impressive. The matching report on outcomes is also revealing: 49 sessions "fully achieved", 34 "mostly", 14 "partially", 4 "not achieved". The same data shows 235 "likely satisfied" sessions versus 34 "dissatisfied". The dissatisfied ones cluster around a recognizable failure mode: AI declared done before it actually was. Anthropic's own analysis of my sessions named the pattern — premature completion claims — and noted I "have zero tolerance for theater." That sentence I will keep.
These are the five things I never delegate:
What AI does What I keep
───────────────────────────────── ─────────────────────────────────
Bash, Edit, Read, Write at scale Direction ("ship refactor-H next")
1,072 sub-agent invocations Quality bar ("real E2E, not probes")
26 parallel agents in one fan-out Convention trade-offs (Q1–Q4 picks)
Quality gates, matrix sync, lint Choice of verification tool
27 plan/design/report markdowns Noticing what is missingThe convention trade-offs are the most underrated. In the 9-sprint day, the orchestrator stopped four times to ask me an explicit question: an Auth Guard naming choice (consistency vs back-compat), a poll-vote business rule (one-shot vs editable), a component name (two reasonable options), an SDK namespace (match the existing convention or break it). None of those are technical questions an AI can decide on the merits. They are taste and product decisions. The harness is good enough to know which is which, and to ask.
The 682GB lesson
Here is mine. During an unattended Sprint 4.5 audit, the audit-logger Claude wrote had an infinite recursion bug — logger.info() called itself inside its own hook. By the time I noticed, it had filled 682 GB of disk. No one was hurt; the laptop was. The pattern in the report is named — self-inflicted bugs and resource issues during long autonomous runs — and the lesson is not "AI is unreliable." The lesson is that long-running autonomy without circuit breakers is the unreliability. With them, the same agent is fine.
The fix was small enough to fit in one block:
import pino from 'pino'
const logger = pino({
hooks: {
logMethod(args, method) {
// BEFORE: re-entered logger.info() via the hook → unbounded recursion
// AFTER: pass through the underlying method directly
return method.apply(this, args)
},
},
})The lesson lives in two places now. One: the code, with the comment. Two: the hooks. safety-rules.md got an addendum, PreToolUse:Bash started watching du -sh ~/.cache style indicators, and the QA stage runs disk checks before declaring success. The cost of one 682GB mistake was a guardrail that prevents the next ten.
What this changes about being a founder
5 × 5 = 25 became 1 × 1 = 0.2. The arithmetic is real. What it changed about my day is also real, and unevenly distributed. The hours I used to spend typing are gone. They were not replaced by leisure. They were replaced by hours of judgment density — deciding what to ship, what counts as done, which two of four reasonable options is right for this product, where to push back on an output that smelled fine but was not. AX-redesign was not optional. It was forced on me by the math: if I do not redesign my own role, the leverage I just gained is wasted on more typing.
The other quiet shift is patience around the foundation. The same instinct that makes a founder ship fast — "I'll do it once, ugly, and move on" — fights the discipline that context engineering rewards. The right move was the opposite. Spend a week writing rules and templates that make the next ten weeks compound. The first sprint felt slow. By the ninth it was no longer a question of effort.
The discipline before the leverage
Without context engineering, a faster model is just a faster way to repeat the same mistake. The 4-layer stack is not glamorous, and there is nothing in it that you cannot reproduce in your own repo this afternoon. The leverage is on the other side of that hour. If you are vibe-coding without rules, without skills, without docs/* artifacts and without hooks, you are not getting AI's help. You are getting AI's noise. The fix is not a better prompt. It is a better project.
That is what tene and bkit are, in the end — opinionated context engineering for two specific problems (secret-handling and PDCA delivery). I keep building them because the leverage I described in this post is conditional. It exists when the stack exists. It is gone the moment it does not.
Related reading: