The lonely sprint
Last Tuesday I shipped 9 refactor sprints in one day. In a normal team — 1 PM, 1 designer, 2 engineers, 1 QA — that's about 25 person-weeks of work. I did it solo. This isn't a flex. It's the start of an honest count: where does the AI begin, where does the human end, and what falls away in between?
I have 28 days of usage data. 3,514 messages across 175 sessions. 7,939 Bash calls. 1,072 sub-agent runs. Let me walk through it without the marketing gloss.

What "context engineering" actually means
Prompt engineering is what you type into the box. Context engineering is what's already there before you type anything.
The slogan I keep coming back to: build the system so you never have to re-explain the same thing twice. Every rule, guardrail, or workflow you save up front pays you back the next time you sit down. The first sprint is slow because you're pouring the foundation. The ninth is fast because you're not.
A second way to put it: prompt engineering tunes one turn. Context engineering tunes the project's memory across all turns. It's closer to onboarding a new hire than writing a clever sentence. You don't write better prompts forever. You build a project that needs fewer of them.
The 4-layer context stack
My stack has four layers. Each one answers a different question: what's true, what's banned, how do we work, what did we ship.
Layer 1 CLAUDE.md Auto-injected every prompt = "context that never forgets"
Layer 2 .claude/rules/ Hooks that block violations before they happen
Layer 3 .claude/skills/ Slash commands wrapping reusable workflows
.claude/agents/ 36 domain experts the orchestrator can call
.claude/commands/ Composed primitives ("/pdca plan", "/sprint-start")
Layer 4 docs/01-plan/ Source-of-truth artifacts. The output of one phase
docs/02-design/ becomes the input of the next. Every sprint
docs/04-report/ adds to the stack — the stack grows with you.
docs/05-qa/Layer 1 is the constitution. Three CLAUDE.md files (root, common, sprint-master) inject about 470 lines of project facts into every prompt. The 30-sprint plan, the rules I must not break, the project conventions — all loaded for free. I never re-explain them across sessions.
Layer 2 isn't advice. It's enforcement. Files like architecture-rules.md, design-system-rules.md, and safety-rules.md are wired into pre-commit hooks and PreToolUse:Write hooks. They refuse a write if it breaks a rule.
Layer 3 is the workflow library. Layer 4 is the paper trail every sprint produces and the next sprint reads.
The key property: every sprint that closes makes the stack richer. New scripts go into scripts/qa/. New agents land in .claude/agents/. Refined templates live in .claude/templates/. Sprint N+1 starts on a stronger floor than sprint N did.
Where AI takes over (a real day)
The 9-sprint day wasn't heroic focus on my side. It was orchestration. The Claude Code usage report tells the story plainly:
- Bash: 7,939 calls
- Edit: 2,837
- Read: 2,676
- TaskUpdate: 1,997
- Write: 1,327
- Agent: 1,072
Multi-clauding (parallel sessions running at the same time) made up 36% of all messages. At one point I sent off 26 sub-agents in parallel for an architecture audit. The shape isn't "I write code with AI helping." It's "I run a small studio, and the studio is AI."
A single sprint, end to end, looks like this from my side:
/sprint-start refactor-H # bkit maps it from the Sprint Master Plan
/pdca plan sprint-refactor-H # plan.md auto-generated, 12 sections
/pdca design sprint-refactor-H # design.md + L1-L5 test plan matrix
/pdca do sprint-refactor-H # api-expert + frontend-expert agents
/pdca analyze sprint-refactor-H # gap-detector compares design vs code
/pdca iterate sprint-refactor-H # auto-fix loop until Match Rate ≥ 90%
/pdca qa sprint-refactor-H # qa-test-generator + qa-monitor
/pdca report sprint-refactor-H # completion report with quality gatesEight commands per sprint. Nine sprints. Seventy-two prompts to drive the day. Most of them were ceremony, not real choices. That friction is what pushed me to build the wrapper I'll show you in two sections.
For now: the bkit state machine refuses to advance if a quality gate fails — typecheck, architecture violations, matrix-to-code sync, L1-L5 tests. I don't run tsc --noEmit myself anymore. The gate does. I don't count architecture violations myself. measure-arch-violations.ts does. The result is checked in as arch-baseline.json, so the next sprint can't slide backwards without me hearing about it.
From one sprint to thirty
A single sprint isn't the unit of work. The four-month roadmap is. The plan lives in docs/01-plan/sprint-master/ — thirteen documents covering thirty sprints across seven phases. Each sprint has explicit dependencies on earlier ones.
One slash command, /sprint-status, reads four sources at once — bkit's pdca-status.json, the architecture-violation arch-baseline.json, the GitHub issue list, and the master plan markdown — and renders one screen:
Sprint Master Plan — Progress
─────────────────────────────────
Phase 0 — Merge baseline [done]
Sprint 00 — Merge Ready [✓] Match Rate 96%
Phase 1 — Beta (Apr 28 – May 2)
Sprint 01 — Beta P1.1 [in progress] design phase
Sprint 02 — Beta P1.2 [blocked by 01]
Cumulative KPIs
GitHub issues: 27 → 25 (2 closed)
Architecture violations: 79 → 79 (Sprint 27 target)
Matrix sync: 19.4% → 19.4%
Next: /sprint-start 02-beta-p1-2
The point isn't the layout. It's that "where am I?" never means scrolling back through the chat. It means one command, four data sources, one screen.
Dependency rules live in the plan too. Sprint 11 (cutover) has requires: sprint-03 (data migration). Sprint 30 (final E2E) has requires: sprint-12..29. Try to start out of order and a confirmation prompt fires. The plan refuses to be ignored quietly.
Wrapping generic PDCA in project context
There's a tension between two truths. bkit's /pdca plan / design / do / analyze / iterate / qa / report works for any project — that's its strength. But "any project" means it knows nothing about my project. It doesn't know that Sprint 17 is the payment epic, that Sprint 24 is the SEO push, or that the matrix files under docs/qa/matrices/ must update before the report can pass.
The fix is a thin wrapper. sprint-orchestrator is a project-local skill (.claude/skills/sprint-orchestrator.md) that catches /sprint-start NN, auto-loads the project context, then runs /pdca underneath:
user project layer bkit (vendor plugin)
──── ───────────── ────────────────────
/sprint-start 17 → sprint-orchestrator → /pdca plan
(auto-loads payment /pdca design
matrices, payment- /pdca do
expert, gates, /pdca analyze
baselines) /pdca iterate
/pdca qa
/pdca reportSprint 17 starts → payment-expert joins automatically. Sprint 24 → seo-expert. Sprint 27 → arch-auditor. I never re-explain area context.
The wrapper collapses eight commands per sprint into one. The 9-sprint day cost me 72 prompts: 8 × 9. Today the same nine sprints would cost nine. The skill that closed that gap was born from the friction of typing 72 of them.
Six gates that refuse to advance
Generic PDCA doesn't tell you when to stop. Project context does. sprint-orchestrator has six quality gates wired into the phase transitions. Each one is measured by a script committed to the repo:
Gate Threshold Tool
───────────────────────── ────────────── ──────────────────────────
Match Rate >= 95% gap-detector
Architecture violation delta = 0 measure-arch-violations.ts
Matrix sync >= 95% verify-api-matrix.ts
typecheck 0 errors tsc --noEmit
lint 0 warnings eslint --max-warnings 0
test PASS pnpm testMatch rate below 95? /pdca iterate runs by itself, up to five cycles. Architecture delta non-zero? The sprint stops; the fix is forced before the next phase. Matrix below 95%? The matrix-synchronizer agent gets dispatched. There's no path forward that skips a red gate.
The change isn't on the AI side. AI ran without gates before, and it would happily declare done. The gates are for me. The system doesn't let the human forget the checklist.
Where I still hold the line
The 26-agent fan-out is impressive. The matching report on outcomes is also revealing. Of my sessions: 49 "fully achieved", 34 "mostly", 14 "partially", 4 "not achieved". The same data shows 235 "likely satisfied" sessions versus 34 "dissatisfied". The dissatisfied ones cluster around one failure mode: AI declared done before it actually was.
Anthropic's own analysis of my sessions named the pattern — premature completion claims — and noted I "have zero tolerance for theater." That sentence I'll keep.
These are the five things I never delegate:
What AI does What I keep
───────────────────────────────── ─────────────────────────────────
Bash, Edit, Read, Write at scale Direction ("ship refactor-H next")
1,072 sub-agent invocations Quality bar ("real E2E, not probes")
26 parallel agents in one fan-out Convention trade-offs (Q1–Q4 picks)
Quality gates, matrix sync, lint Choice of verification tool
27 plan/design/report markdowns Noticing what is missingConvention trade-offs are the most underrated. In the 9-sprint day, the orchestrator stopped four times to ask me a direct question:
- An Auth Guard naming choice (consistency vs back-compat)
- A poll-vote business rule (one-shot vs editable)
- A component name (two reasonable options)
- An SDK namespace (match the existing convention or break it)
None of those are technical questions an AI can settle on the merits. They're taste and product calls. The harness is good enough to know which is which, and to ask.
The 682GB lesson
Here's mine. During an unattended Sprint 4.5 audit, the audit-logger Claude wrote had an infinite recursion bug. logger.info() called itself inside its own hook. By the time I noticed, it had filled 682 GB of disk. No one was hurt. The laptop was.
The pattern in the report has a name: self-inflicted bugs and resource issues during long autonomous runs. The lesson isn't "AI is unreliable." The lesson is that long-running autonomy without circuit breakers is the unreliability. With them, the same agent is fine.
The fix was small enough to fit in one block:
import pino from 'pino'
const logger = pino({
hooks: {
logMethod(args, method) {
// BEFORE: re-entered logger.info() via the hook → unbounded recursion
// AFTER: pass through the underlying method directly
return method.apply(this, args)
},
},
})The lesson now lives in two places. One: the code, with the comment. Two: the hooks. safety-rules.md got an addendum. PreToolUse:Bash started watching du -sh ~/.cache style indicators. The QA stage runs disk checks before declaring success. The cost of one 682GB mistake was a guardrail that prevents the next ten.
What this changes about being a founder
5 × 5 = 25 became 1 × 1 = 0.2. The math is real. What it changed about my day is also real — and unevenly spread.
The hours I used to spend typing are gone. They weren't replaced by leisure. They were replaced by hours of judgment density. Deciding what to ship. Deciding what counts as done. Picking the right two of four options for this product. Pushing back on output that smelled fine but wasn't.
AX-redesign wasn't optional. The math forced it on me. If I don't redesign my own role, the time I just gained gets spent on more typing.
The other quiet shift is patience around the foundation. The same instinct that makes a founder ship fast — "I'll do it once, ugly, and move on" — fights the discipline that context engineering rewards. The right move was the opposite. Spend a week writing rules and templates that make the next ten weeks compound. The first sprint felt slow. By the ninth it didn't feel like effort.
The discipline before the payoff
Without context engineering, a faster model is just a faster way to repeat the same mistake. The 4-layer stack isn't glamorous. There's nothing in it you can't reproduce in your own repo this afternoon. The payoff is on the other side of that hour.
If you're vibe-coding without rules, without skills, without docs/* artifacts, without hooks — you're not getting AI's help. You're getting AI's noise. The fix isn't a better prompt. It's a better project.
That's what tene and bkit are, in the end — opinionated context engineering for two specific problems (secret-handling and PDCA delivery). I keep building them because the payoff I described in this post is conditional. It exists when the stack exists. It's gone the moment it doesn't.
Terms used in this post
Context engineering — The practice of building durable project memory — rules, docs, hooks, workflows — so AI doesn't need to be re-briefed every session.
Prompt engineering — The narrower craft of writing one good message into the AI box. Useful, but limited to a single turn.
Harness — The system around the AI — the rules, hooks, slash commands, and gates — that turns a chatbot into a reliable worker.
Agent (sub-agent) — A Claude session spawned for a specific job. The orchestrator delegates work to many agents and collects their results.
Orchestrator — The top-level Claude session that plans, delegates to sub-agents, and decides what happens next.
PDCA — Plan → Do → Check → Act. A four-step loop bkit uses (split into seven sub-phases) to drive each sprint.
Quality gate — An automated check that blocks a phase from advancing until the result passes a threshold (e.g., 0 typecheck errors).
Multi-clauding — Running several Claude sessions in parallel on different parts of the project at the same time.
FAQ
Doesn't this mean Claude does everything and you do nothing?
It's the opposite. The more execution Claude takes over, the higher the share of work that is judgment — direction, quality bar, convention trade-offs, noticing what is missing. 25 person-weeks did not become 0.2 person-weeks because the code volume shrunk. It happened because the decision density was compressed into one head.
Why not just use Cursor or Copilot — why all this context engineering setup?
Cursor and Copilot are turn-based assistants. The stack in this article assumes a long-horizon orchestrator — one session running 9 sprints × 7 PDCA phases = 63 phase transitions. That requires project-scoped context, not conversation-scoped context. They are different tools. You can run both.
What if my project has no docs, no rules, no .claude/ setup?
Spend 70% of your first sprint building the stack itself. It feels wasteful. From the second sprint on, you get it back with interest. The 9-sprint-in-a-day speed is the compound result of six prior weeks of stack accumulation, not raw model speed. Context engineering is an asset that grows — not a tool you reach for once.
Did the 682GB recursion bug not break your trust in AI?
It reinforced it. AI running solo will produce that class of bug — the data confirmed it. Which is exactly why the PDCA Check phase has automated gates and the human still asks 'are you really done?' three times. Pushback is not skepticism. It is protocol.
Related reading: