bkit + harness engineering: workflow beats model choice

The model churn trap

Every week another flagship model ships. Opus 4.8. GPT-5.1. A new Cursor mode. The demo videos look wild. Yet the code you push on Friday looks a lot like the code you pushed six months ago. The pattern I keep seeing in vibe-coding sessions: developers upgrade the model, not the system around it. A better model inside a thin workflow just makes the same mistakes faster, with more confidence. The real lift is not where most people are looking.

bkit — '하면 돼' (just do it). Methodology layer for Claude Code.

What a harness actually is

In the AI systems crowd, "harness engineering" names the layer between a raw model call and a system that works. Context curation, tool orchestration, state machines, guardrails, retry and iterate loops, audit trails — all of it. Prompt engineering tunes one request. Harness engineering tunes the loop that request lives inside.

Claude Code, Cursor, Codex — those are harnesses. They pick what goes into your context window, when to spawn subagents, which tools are on offer, how sessions stick around. A raw API call to Claude 4.7 has none of that by default. When people complain that "the agent hallucinated my file paths," they're usually describing a harness failure, not a model failure.

Claude Code's built-in harness is the floor

Claude Code ships with real harness features. SubAgents for isolated parallel work. MCP servers for outside tools. Hooks that fire on lifecycle events. Slash commands and skills for reusable prompts. Memory files saved to disk. All of that was engineered. It's enough to feel the productivity jump from raw API calls.

But three things are still missing by default:

Method: there's no enforced Plan / Design / Do / Check cycle. You either keep one in your head, or you drift.
Quality gates: nothing checks "is the code within 90% of the design doc?"
Trust-graduated automation: you get binary allow / deny, not a slider that matches how much you trust the agent.

Here's a tiny hooks.json — real, but built per project, per developer:

{
  "PreToolUse": [
    { "match": "Bash:rm -rf*", "block": true }
  ]
}

That's harness engineering at its rawest. Useful, but tiny.

bkit: a method layer on top of the harness

bkit is a Claude Code plugin — 39 Skills, 36 Agents, 21 hook events, 128 lib modules — that treats CC as its base and adds a method layer above it. It doesn't compete with Claude Code. It plugs into every extension point CC already exposes.

The stack reads top down, highest layer first:

Layer	Components	Role
bkit	PDCA · Quality gates · L0–L4 trust score	methodology
Claude Code	SubAgents · MCP · Hooks · Skills	primitives
Claude API	Inference + tools	model

The user-facing surface is a small set of slash commands. Each one moves a single feature through a state machine. Each transition has rules to pass — not a free-text prompt:

/pdca pm user-auth         # requirements -> PRD
/pdca plan user-auth       # plan doc with acceptance criteria
/pdca design user-auth     # 3 architectural arcs; pick one
/pdca do user-auth         # implementation guided by the design
/pdca analyze user-auth    # gap-detector: design vs impl match-rate
/pdca iterate user-auth    # auto-fix until match-rate >= 90%
/pdca report user-auth     # completion doc with metrics

The gap-detector agent compares the design doc to the actual code diff and reports a match rate. Below 90%, /pdca iterate loops on its own, capped at five tries. The model never changed. The workflow did.

Why this beats another model upgrade

Three concrete pieces of proof that the harness layer pays off more than picking a bigger model.

Evaluator-Optimizer on the same model. The 90% match-rate loop calls the same model twice — once to build, once to critique against the design doc. In practice this closes gaps a single Opus pass misses. You didn't need a smarter model. You needed a second look.

Sentinels that watch upstream CC. Two agents — cc-version-researcher and bkit-impact-analyst — track Claude Code releases and auto-check whether anything in your workflow broke. When CC v2.1.64 closed four memory leaks in Agent Teams, bkit surfaced it without anyone reading release notes.

Automation you can graduate. /control level moves the system from L0 (manual) through L4 (full auto) based on a trust score built from your track record. The same model behaves differently based on what it has earned in your project:

/control status
# Level: L2 (Semi-Auto)
# Trust Score: 0.78 (23 PDCA cycles, 91% avg match-rate)
# Routine transitions auto · key decisions gated

Build your harness, not your prompt

The model is a leaf. The workflow is the tree.
Claude Code's harness is the ground floor. bkit adds a method layer on top, not an alt to it.
A 90% match-rate gate plus auto-iterate works because it's structural, not cleverer.
Trust-graduated automation (L0–L4) beats binary allow / deny once you're running dozens of cycles a week.
The next model release won't fix your workflow. Your workflow decides how much you can squeeze out of the next model.

If you're vibe-coding every day, your real lift isn't the next flagship. It's how tight the harness is around whatever model you call. Tighten that, and each model upgrade starts compounding instead of evaporating.

Terms used in this post

Harness engineering — Building the loop a model lives inside — context, tools, state, retries, guardrails. Bigger than just prompt tuning.

Agent / subagent — A model call that runs in its own context, often with a single job (review code, draft a plan). Agents can call other agents.

MCP server — A small server that hands a model structured tools to call. Cleaner than putting everything in the prompt.

Gap-detector — A bkit agent that diffs the design doc against the code and reports a match rate plus a list of gaps.

Match rate — How well the code matches the design doc. 0–100%. Below 90%, the iterate loop kicks in.

Trust score — A 0–1 number based on your project track record. Drives which L0–L4 automation level you can run at.

PDCA — Plan, Do, Check, Act. A four-step cycle. bkit turns it into a state machine with clear gates between phases.

FAQ

Does the model choice not matter at all?

It matters, but second-order. With a thin workflow around it, a better model just repeats the same mistakes faster. The real power is in the loop: what context the model sees, how its output is evaluated, and what happens when it's wrong. bkit's gap-detector and auto-iterate close failure modes that a bigger model alone does not.

Can I get the same result from plain Claude Code, without bkit?

Yes, if you are willing to hand-roll the workflow every session: write plan and design docs, review iterations, set your own guardrails. bkit encodes that discipline as a state machine and a set of agents so you do not have to rebuild it per project. It is the difference between a framework and a convention.

Is 'harness engineering' an established term?

It became common in 2025 as multi-agent systems matured. The idea: LLM inference is a leaf, and the surrounding engineering — context curation, tool orchestration, state, guardrails, retry loops — is what determines real-world reliability. Prompt engineering is one piece inside a harness.