9 min readby tomo-kay

bkit: PDCA methodology for Claude Code

bkit encodes PDCA methodology into Claude Code: Skills, Agents, Hooks, MCP, and a state machine with quality gates from plan to report.

What bkit is, in one paragraph

bkit is a Claude Code plugin that adds a methodology layer on top of the CC harness. Concretely: 39 Skills, 36 Agents, 21 hook events, 2 MCP servers, and a PDCA (Plan-Do-Check-Act) state machine with 20 guarded transitions. It is not a rewrite of Claude Code or a replacement for it — it installs into CC via /plugin install bkit and hangs off every extension point CC already exposes. The source is open on GitHub. If you agree with the thesis that workflow around a model matters more than the model itself, bkit is an opinionated implementation of that workflow.

bkit wordmark with the tagline '하면 돼' — a Claude Code plugin that encodes PDCA methodology, 39 Skills, 36 Agents, 21 hook events, and 2 MCP servers.
bkit — PDCA methodology, made executable inside Claude Code.

The short version: you type /pdca plan user-auth instead of "please write a plan for user auth," and from there every phase becomes a state transition with acceptance criteria, not a free-text prompt.

The four building blocks

bkit is assembled from four primitives Claude Code already understands, but composed into a methodology rather than left as raw parts.

Skills are reusable prompts exposed as slash commands. Each has a trigger vocabulary (multilingual), a phase gate (which PDCA stage it belongs to), and a set of allowed tools. /pdca, /control, /enterprise, /starter, /dynamic are all skills. A skill's frontmatter declares its interface:

---
name: pdca
triggers: [pdca, plan, design, analyze, report, 계획, 설계]
allowedTools: [Read, Write, Edit, Bash, Task]
classification: Workflow
phaseGate: all
---

Agents are role-based subagents with per-role model assignments. cto-lead runs on Opus for architectural decisions. gap-detector runs on Sonnet for cheap, fast comparisons. code-analyzer is read-only. Each agent defines its own memory scope, max turns, and disallowed tools — so "review the design" and "critique the implementation" are not the same model call with different prompts, they are different agents with different capabilities and cost profiles.

Hooks intercept lifecycle events. 21 events across 6 layers: SessionStart, PreToolUse, PostToolUse, UserPromptSubmit, TaskCompleted, PreCompact, and more. bkit uses hooks for context injection (so every session starts with PDCA state in scope), audit logging, destructive-op blocking, and token ledger accounting.

MCP servers (two of them) expose structured tools so the model does not reason about state from memory alone. bkit-pdca offers bkit_pdca_status, bkit_plan_read, bkit_design_read, bkit_metrics_get. bkit-analysis offers bkit_gap_analysis, bkit_code_quality, bkit_regression_rules. They read from durable files in .bkit/state/, so session restarts never lose PDCA state.

The composition is what makes it work. Skills expose verbs, agents execute them, hooks guard the edges, MCP servers persist state. No single piece is exotic; the leverage is in how they fit together.

Install and first run

Assuming you already have Claude Code installed:

/plugin marketplace add popup-studio-ai/bkit-claude-code
/plugin install bkit
/output-style bkit-learning

Three commands. The first registers bkit's marketplace source, the second installs the plugin (Skills + Agents + Hooks + MCP servers + output styles — all wired at once), and the third picks an output style that teaches you as you go. There are four output styles shipped: bkit-learning, bkit-pdca-guide, bkit-enterprise, and bkit-pdca-enterprise. bkit-learning is the gentle one, recommended for your first project.

After install, /bkit help lists what is available. A typical session opens with bkit injecting the current PDCA state into context via the SessionStart hook — so Claude knows where you were in which feature without you retyping it.

The PDCA state machine in action

The flagship user-facing command is /pdca. It moves a single feature through seven phases, each a guarded transition with its own acceptance criteria:

/pdca pm user-auth         # requirements -> PRD
/pdca plan user-auth       # plan doc with acceptance criteria
/pdca design user-auth     # 3 architectural arcs; pick one
/pdca do user-auth         # implementation guided by the design
/pdca analyze user-auth    # gap-detector: design vs impl match-rate
/pdca iterate user-auth    # auto-fix until match-rate >= 90%
/pdca report user-auth     # completion doc with metrics

At each phase, bkit writes a document to disk: docs/00-pm/user-auth.prd.md, docs/01-plan/user-auth.plan.md, docs/02-design/user-auth.design.md, and so on. These are not throwaway artifacts — they are the ground truth that later phases reference. When /pdca analyze runs, it literally diffs the design doc against the implementation diff and produces a match rate.

Along the way there are five interactive checkpoints, each a small pause with a structured question:

  • CP1 (after PM analysis): do the requirements match what you meant?
  • CP2 (after plan): acceptance criteria look right?
  • CP3 (after design): three arcs drafted — which one?
  • CP4 (before do): scope fits the design?
  • CP5 (after analyze): ship, iterate, or rework the design?

The checkpoints are the human-in-the-loop safety valve. You do not get steamrolled; you get asked. In L0 or L1 automation, checkpoints are mandatory. In L2+, the routine ones auto-confirm and only the key ones stay human-gated.

For a first-time user, the full flow of a small feature takes about ten minutes of typing and fifteen of the agent actually working. What comes out is four documents plus implementation plus metrics — not just code.

Quality gates and auto-iterate

The crucial agent in this cycle is gap-detector. When /pdca analyze runs, gap-detector:

  1. Reads the design doc (docs/02-design/user-auth.design.md).
  2. Walks the implementation diff for the feature.
  3. Lines up design intent against implementation reality.
  4. Produces a structured match rate (0–100%) and a list of specific gaps.

The threshold is 90%. Below that, /pdca iterate kicks off a loop: read gaps, patch each gap via an implementation agent, re-run gap-detector, repeat. The loop is capped at five iterations. If it still fails after five, bkit stops and asks a human — it does not silently ship a 60% match.

This is the Evaluator-Optimizer pattern from the multi-agent literature: two roles, the generator and the critic, where the critic has an explicit spec (the design doc) to compare against. What makes this beat a bigger single-pass model is boring: most failure modes come from the generator forgetting a constraint the design doc made explicit. A second pass with the critic noticing "you never wired the rate limiter" is a cheaper fix than a smarter single-shot generator.

The same gap-detector pattern shows up in bkit_regression_rules — eight modules across the cc-regression library that detect reintroduced bugs after a CC or model upgrade. Same idea, different target.

L0–L4 automation and the trust score

One bkit detail that surprises new users is the automation level system. Your current level shows at the top of every session, and you can inspect it at any time:

/control status
# Level: L2 (Semi-Auto)
# Trust Score: 0.78 (23 PDCA cycles, 91% avg match-rate)
# Routine transitions auto · key decisions gated
# Next escalation: L3 available at Trust Score >= 0.85

The five levels:

  • L0 Manual — every action requires explicit approval. Useful for a first session where you do not trust the agent yet.
  • L1 Guided — routine actions proceed; every checkpoint is mandatory.
  • L2 Semi-Auto (default) — routine checkpoints auto-confirm; key ones (CP3 design selection, CP5 ship/iterate/rework) stay human-gated.
  • L3 Auto — most transitions automatic; only destructive operations and level escalations are gated.
  • L4 Full-Auto — fully autonomous PDCA cycles. Reserved for well-scoped features where you have repeatedly proven the loop works for you.

Graduation is not free. The trust score is a weighted function of your track record: completed cycles, average match rate, destructive-op count, interrupt frequency. It escalates slowly and comes with a cooldown so a lucky streak does not unlock autonomy before the system has seen you work.

/control level 3 escalates; /control level 0 always works as a panic brake. The intent is that automation is earned per project, not set once in global config.

Extending bkit + when to use it

Everything in bkit is overridable. Drop a pdca.skill.md into .claude/skills/ in your project and bkit's priority chain picks yours first. The resolution order:

PriorityLocationRole
1 (highest).claude/skills/*.skill.mdproject override — repo-committed, team-shared
2~/.claude/skills/*.skill.mduser defaults — personal, cross-project
3 (lowest){plugin}/skills/*/SKILL.mdbkit shipped defaults

The same override chain applies to agents, hooks, templates, and output styles. You can ship a team-specific qa-lead with your own KPIs, or swap out cto-lead entirely. The skill-create command walks you through authoring a new skill interactively. pm-lead-skill-patch is an example of non-invasive extension — it hooks into pm-lead's Phase 4 without editing the upstream file.

So when should you reach for bkit versus raw Claude Code? A rough heuristic:

  • Small scripts / one-off fixes — raw CC is fine. PDCA overhead is not worth it for a ten-line change.
  • Anything with a design intent you might forget by Thursday — bkit starts paying off at the design doc.
  • Features that cross multiple files or require review — gap-detector is where the real leverage is.
  • Team projects — the persistent docs/ artifacts (PRD, plan, design, report) become shared ground truth rather than private chat logs that evaporate when the window closes.

The five-line summary:

  • bkit is a Claude Code plugin that encodes PDCA, quality gates, and graduated automation into reusable commands.
  • Install is three commands; the first session teaches itself via the bkit-learning output style.
  • The /pdca flow takes one feature from requirements to completion with documents at every step.
  • gap-detector + 90% match-rate + max-five iterate is the core quality loop.
  • L0–L4 automation lets you graduate trust per project, not flip a global switch.

If Article 1 argued that workflow beats model choice, this is the concrete shape of that workflow. Install it, run one /pdca cycle on something small, and decide for yourself whether the methodology layer is worth it.

Related reading: