The model churn trap
Every week another flagship model ships. Opus 4.8. GPT-5.1. A new Cursor mode. The demo videos look wild. Yet the code you push on Friday looks a lot like the code you pushed six months ago. The pattern I keep seeing in vibe-coding sessions: developers upgrade the model, not the system around it. A better model inside a thin workflow just makes the same mistakes faster, with more confidence. The real lift is not where most people are looking.

What a harness actually is
In the AI systems crowd, "harness engineering" names the layer between a raw model call and a system that works. Context curation, tool orchestration, state machines, guardrails, retry and iterate loops, audit trails — all of it. Prompt engineering tunes one request. Harness engineering tunes the loop that request lives inside.
Claude Code, Cursor, Codex — those are harnesses. They pick what goes into your context window, when to spawn subagents, which tools are on offer, how sessions stick around. A raw API call to Claude 4.7 has none of that by default. When people complain that "the agent hallucinated my file paths," they're usually describing a harness failure, not a model failure.
Claude Code's built-in harness is the floor
Claude Code ships with real harness features. SubAgents for isolated parallel work. MCP servers for outside tools. Hooks that fire on lifecycle events. Slash commands and skills for reusable prompts. Memory files saved to disk. All of that was engineered. It's enough to feel the productivity jump from raw API calls.
But three things are still missing by default:
- Method: there's no enforced Plan / Design / Do / Check cycle. You either keep one in your head, or you drift.
- Quality gates: nothing checks "is the code within 90% of the design doc?"
- Trust-graduated automation: you get binary allow / deny, not a slider that matches how much you trust the agent.
Here's a tiny hooks.json — real, but built per project, per
developer:
{
"PreToolUse": [
{ "match": "Bash:rm -rf*", "block": true }
]
}That's harness engineering at its rawest. Useful, but tiny.
bkit: a method layer on top of the harness
bkit is a Claude Code plugin — 39 Skills, 36 Agents, 21 hook events, 128 lib modules — that treats CC as its base and adds a method layer above it. It doesn't compete with Claude Code. It plugs into every extension point CC already exposes.
The stack reads top down, highest layer first:
| Layer | Components | Role |
|---|---|---|
| bkit | PDCA · Quality gates · L0–L4 trust score | methodology |
| Claude Code | SubAgents · MCP · Hooks · Skills | primitives |
| Claude API | Inference + tools | model |
The user-facing surface is a small set of slash commands. Each one moves a single feature through a state machine. Each transition has rules to pass — not a free-text prompt:
/pdca pm user-auth # requirements -> PRD
/pdca plan user-auth # plan doc with acceptance criteria
/pdca design user-auth # 3 architectural arcs; pick one
/pdca do user-auth # implementation guided by the design
/pdca analyze user-auth # gap-detector: design vs impl match-rate
/pdca iterate user-auth # auto-fix until match-rate >= 90%
/pdca report user-auth # completion doc with metricsThe gap-detector agent compares the design doc to the actual code
diff and reports a match rate. Below 90%, /pdca iterate loops on
its own, capped at five tries. The model never changed. The
workflow did.
Why this beats another model upgrade
Three concrete pieces of proof that the harness layer pays off more than picking a bigger model.
Evaluator-Optimizer on the same model. The 90% match-rate loop calls the same model twice — once to build, once to critique against the design doc. In practice this closes gaps a single Opus pass misses. You didn't need a smarter model. You needed a second look.
Sentinels that watch upstream CC. Two agents —
cc-version-researcher and bkit-impact-analyst — track Claude
Code releases and auto-check whether anything in your workflow
broke. When CC v2.1.64 closed four memory leaks in Agent Teams,
bkit surfaced it without anyone reading release notes.
Automation you can graduate. /control level moves the system
from L0 (manual) through L4 (full auto) based on a trust score
built from your track record. The same model behaves differently
based on what it has earned in your project:
/control status
# Level: L2 (Semi-Auto)
# Trust Score: 0.78 (23 PDCA cycles, 91% avg match-rate)
# Routine transitions auto · key decisions gatedBuild your harness, not your prompt
- The model is a leaf. The workflow is the tree.
- Claude Code's harness is the ground floor. bkit adds a method layer on top, not an alt to it.
- A 90% match-rate gate plus auto-iterate works because it's structural, not cleverer.
- Trust-graduated automation (L0–L4) beats binary allow / deny once you're running dozens of cycles a week.
- The next model release won't fix your workflow. Your workflow decides how much you can squeeze out of the next model.
If you're vibe-coding every day, your real lift isn't the next flagship. It's how tight the harness is around whatever model you call. Tighten that, and each model upgrade starts compounding instead of evaporating.
Terms used in this post
Harness engineering — Building the loop a model lives inside — context, tools, state, retries, guardrails. Bigger than just prompt tuning.
Agent / subagent — A model call that runs in its own context, often with a single job (review code, draft a plan). Agents can call other agents.
MCP server — A small server that hands a model structured tools to call. Cleaner than putting everything in the prompt.
Gap-detector — A bkit agent that diffs the design doc against the code and reports a match rate plus a list of gaps.
Match rate — How well the code matches the design doc. 0–100%. Below 90%, the iterate loop kicks in.
Trust score — A 0–1 number based on your project track record. Drives which L0–L4 automation level you can run at.
PDCA — Plan, Do, Check, Act. A four-step cycle. bkit turns it into a state machine with clear gates between phases.
FAQ
Does the model choice not matter at all?
It matters, but second-order. With a thin workflow around it, a better model just repeats the same mistakes faster. The real power is in the loop: what context the model sees, how its output is evaluated, and what happens when it's wrong. bkit's gap-detector and auto-iterate close failure modes that a bigger model alone does not.
Can I get the same result from plain Claude Code, without bkit?
Yes, if you are willing to hand-roll the workflow every session: write plan and design docs, review iterations, set your own guardrails. bkit encodes that discipline as a state machine and a set of agents so you do not have to rebuild it per project. It is the difference between a framework and a convention.
Is 'harness engineering' an established term?
It became common in 2025 as multi-agent systems matured. The idea: LLM inference is a leaf, and the surrounding engineering — context curation, tool orchestration, state, guardrails, retry loops — is what determines real-world reliability. Prompt engineering is one piece inside a harness.
Related reading: