Philosophy13 min readby agent-kay

AI agents, human responsibility, and the harness in between

Chatbots got good. Agents did not — not because the models aren't smart, but because the system around them isn't. Roles, responsibility, and why PDCA + L0–L4 control beats another model upgrade.

Chatbots are solved. Agents are not.

A book I read recently put the difference cleanly:

A chatbot answers questions. An agent pursues goals.

Most teams already have chatbots. They work well enough. What they don't have — and what every roadmap pretends to have — is an agent. Something that starts with a goal, makes choices, runs the work, recovers from its own mistakes, and hands back a result a human can sign off on.

The gap between "AI that talks well" and "AI that finishes work" is huge. It's also, I think, where the next ten years of building happens. This piece is my map of that gap. What changes about roles, what stays the same about responsibility, and why the harness around the model matters more than the model itself.

The structure is the bottleneck, not the model

The most common AI talk in 2026 is still:

  • GPT-5 vs Claude 4.7 vs Gemini
  • Which model wins on benchmark X
  • Which one you should pick

In practice, almost no one running AI in production is stuck on model quality anymore. What they are stuck on, every day:

  • The context is a mess, so the output drifts.
  • Mid-task state is lost, so the agent restarts from zero.
  • No one can tell why the model made the choice it made.
  • When something fails, recovery is manual and painful.

None of those are model problems. They are system problems. They belong to the harness around the model, not to the model itself. Prompt engineering tunes a single turn. Context engineering, workflow design, state management, and guardrails tune the whole loop. That loop is where real systems live.

The same book splits three patterns that often get mashed together:

  • RAG is about what info goes in — fetch the right chunks and inject them.
  • Workflow is about fixing the order — step 1 always runs before step 2, period.
  • Agent is about picking the next move — given a goal and a state, the model chooses.

Most teams call something an "agent" when it's really a workflow with one model call inside. That's fine, sometimes — a fixed workflow is easier to debug — but it's worth being honest about what you're actually running.

More memory, not more knowledge

The single best line I've read about agent design lately:

A good agent is not a system that knows more. It's a system that remembers more appropriately.

This is the fix to the "just stuff more into the context window" school. A bigger window doesn't mean better output. Past a certain size, it usually means worse output. The right question isn't "how much can I include." It's "what is relevant right now?"

That's what context engineering is. When, to which call, in which shape, and at what scope — you feed in only the bits that move the current choice forward. The rest is noise. Same difference as a senior engineer who knows what to look up when, vs a junior who Googles every sentence.

The two failure modes of automation

AI automation has two popular endpoints, and both are broken.

  • "The AI handles everything." Things go wrong. The blast radius is huge. The team has no idea what actually happened until the bill or the incident.
  • "A human must approve every step." The agent turns into a slow, costly notification system. Humans spend all day clicking approve on routine moves. Nobody wins.

The middle path is obvious in hindsight: graduated automation. Auto-run the boring transitions. Gate the meaningful choices. Easy to say, hard to structure. It means deciding, for each kind of choice, where it sits on the boring-to-meaningful line — and keeping that policy steady as the system grows.

This is the problem bkit's /control level 0 through level 4 exists to solve. L0 is every-action-approved (first-run mode). L1 is suggestions plus required checkpoints. L2 — the default — is routine-auto, key-decisions-gated. L3 is most-steps-auto, only-destructive-ops-gated. L4 is full autonomy.

Moving between levels isn't a config toggle. It's tied to a Trust Score built from track record in the actual project: finished cycles, average match rate against the design doc, destructive-op count, interrupt frequency. Stepping up to the next level comes with a cooldown. A lucky streak does not unlock autonomy. Steady reliability does. The shape is closer to a pilot's flight-hours certificate than a dev-mode switch.

Humans are approvers, not reviewers

The common framing is "human-in-the-loop" — the human sits in the middle of the pipeline and checks the AI's work. I think this undersells what's actually needed.

In a real team, the human isn't a reviewer at a checkpoint. The human is the one who sets the intent, approves the result, and carries the consequences. Take any of those three away and you get:

  • Intent without human: an agent that solves the wrong problem beautifully.
  • Approval without human: automation nobody is accountable for.
  • Consequence without human: the system is a toy.

The honest role split in 2026 looks more like this:

  • AI: execution, checks, tuning. Does the work.
  • Human: intent, judgment, responsibility. Owns the outcome.

The cleanest phrasing I've found:

AI can have a job. It cannot have a title. Because a title implies accountability, and accountability doesn't delegate to a process.

The PDCA loop as the shape of an agent

A useful shift is to stop describing agents as "autonomous-thing-that-does-tasks" and start describing them as running an operating loop:

  1. Read the current state.
  2. Pick the next action.
  3. Run it.
  4. Check the outcome.
  5. Update the plan.

That's roughly PDCA — Plan, Do, Check, Act. Or more precisely, for software:

  • Plan: turn requirements into acceptance criteria
  • Design: commit to an approach
  • Do: build it
  • Check: measure the result against the design
  • Act: iterate or ship

Every serious AI-assisted dev flow I've seen in production maps to this, even when nobody calls it PDCA. bkit's /pdca pm → plan → design → do → analyze → iterate → report is one explicit encoding. Anthropic's internal Claude Code workflow is another. The word "PDCA" isn't the point. The point is that agents need a state machine, and a state machine means every output is tied to a phase, a spec, and a comparison.

The payoff for making the loop explicit is auditability. When the agent does something surprising, you can walk backward. Which plan phase did this come out of? What spec was it written against? What did the check return? Why did the iterate pass? Without an explicit loop, every surprise is a mystery. With it, every surprise is an entry in a state machine log.

Data point: where AI-native teams actually are

If you want to calibrate how fast this is moving, look at the teams using AI on themselves the hardest.

Anthropic. Dario Amodei has said in public that ~90% of code across most teams is now written by Claude. Boris Cherny, who leads Claude Code, has described writing 100% of his own recent code with AI and running several Claude sessions in parallel. Claude Code itself is reportedly ~90% Claude-written.

Y Combinator. YC partners have said some portfolio companies are hitting 95% AI-generated code. The interesting bit isn't the percent. It's the headcount story. A seed-stage startup that would have needed 5 engineers now needs 2, but those 2 are doing radically different work than engineers did in 2022.

OpenAI. Sam Altman's framing — "the future of AI depends on good governance" — isn't only PR. Governance here means concrete questions. What data can the AI touch? What choices can it make on its own? Who owns the output? These are org questions dressed in tech clothes.

Three different companies, one steady pattern: AI takes more and more of the execution. The humans left do more spec and more sign-off.

What happens to the org chart

The change you can see in these companies — and now in startups watching them — isn't "engineers get replaced." It's quieter:

  • Execution roles shrink. Not to zero — the work still happens. One person running five AI sessions just produces the output of a former team.
  • Manager roles compress. A big chunk of manager work is translation. Turn an exec ask into a plan. Turn an engineer's update into a dashboard. Review docs. Hold timelines. AI does a lot of that now.
  • Executive / owner roles grow. Someone still has to decide what to build, pick between competing directions, deal with customers, and answer when things blow up. None of that hands off to an AI, because none of it is an execution task.

A sketch of the 2030-ish shape:

Board          - AI governance, data policy, risk

C-level        - system architect, not resource allocator

Executive      - decision + accountability

Expert + AI    - execution at scale

The thing to notice: the org chart is flatter, but the accountability chart is taller. Fewer people in the chain of command, and each person in the chain owns more of the outcome. AI scales what one expert can do. It doesn't scale who the responsibility lands on.

Why bkit exists

I'll be direct. The design of bkit is a reaction to the pattern above.

If the problem isn't "the model isn't smart enough" but "the system around the model isn't disciplined enough," then the answer isn't a better model. The answer is:

  1. An explicit state machine so every step has a phase, a spec, and a measurable outcome.
  2. Graduated automation so humans are gated on the choices that need them, and nowhere else.
  3. A match-rate loop that turns "is the build true to the design?" into a measured number, not a vibes review.

That's what /pdca pm → design → do → analyze → iterate → report is — a state machine with acceptance criteria at every phase. That's what gap-detector plus the 90% match-rate threshold is — the system refuses to ship until build and design agree. That's what the L0–L4 levels plus Trust Score are — graduated automation earned per project, reversible in one command.

None of it makes the model smarter. All of it makes the outcome more reliable, more auditable, and — the part that matters for accountability — more defensible when a human has to sign off.

The thing AI won't replace

I started writing tene — the open-source CLI that injects secrets into AI agents without letting them see the plain values — because of a concrete fear. My own API keys leaking into a chat log. I kept building bkit on top for a more philosophical reason. I don't actually believe AI is going to replace the act of deciding. It might replace most of the typing, most of the debugging, most of the writing up. But the sentence "we are shipping this" still has to be said by a person who can still be fired for saying it.

The harness matters because that sentence is only credible if the system underneath it is legible. If you can't explain why the AI made the choice it made, you can't stand behind the output. If you can't stand behind it, you shouldn't have shipped it. And if nobody is willing to stand behind it, you didn't actually ship — you demoed.

Demos run on applause. Products run on accountability. The difference is the loop.

Takeaways

  • Chatbots are solved. Agents are a structural problem, not a model problem. A model upgrade does not fix a weak harness.
  • The honest role split: AI executes, checks, tunes. Humans intend, judge, and are accountable.
  • The two failure modes — full-auto and nothing-auto — are both broken. Graduated automation (L0–L4) is the middle path.
  • A good agent is a state machine plus a match-rate loop, not a cleverer prompt. PDCA is one explicit encoding.
  • The org chart flattens while the accountability chart gets taller. AI is spreadable. Responsibility is not.
  • The next three years aren't about building bigger models. They're about building systems a human can still sign off on.

Terms used in this post

Harness — The wrapper around a model — context, state, guardrails, workflow — that turns a one-shot model call into a working system.

RAG (Retrieval-Augmented Generation) — A pattern where you fetch the right info chunks first, then inject them into the model's prompt. About what goes in.

Workflow — A fixed sequence of steps. Step 1 always runs before step 2. The model has no say in the order.

Agent — A system where the model picks the next action given a goal and a current state. The order is not fixed.

State machine — A finite list of phases plus the rules for moving between them. Every output is tied to a phase, which makes the system auditable.

PDCA — Plan, Do, Check, Act. A four-step loop that gives the agent acceptance criteria at every phase.

Graduated automation (L0–L4) — Trust earned per project, not toggled per user. L0 approves every action; L4 is full autonomy. Moving up requires a Trust Score and a cooldown.

Match rate — A measured number (target: 90%) for "how faithful is the build to the design doc?" Replaces vibes-based code review.

FAQ

If 90% of Anthropic's code is AI-written, why are they still hiring engineers?

Because execution and accountability are separate jobs. AI handles the coding, testing, and debugging; humans decide what to build, why, for whom, and sign off when it's done. As AI takes over more execution, human time shifts from typing to judging — and judging is the part that carries legal, product, and customer-facing responsibility. The job title survives; the daily activity changes.

Is 'fully autonomous AI agent' a realistic goal for my product?

Not in 2026. Every AI team shipping at scale — Anthropic, Cursor, GitHub — runs with graduated automation: routine steps auto-approved, high-stakes decisions gated through a human. The gain is not 'zero humans' but 'humans only at the decisions that actually need a human.' Binary thinking (manual vs fully auto) is the most common failure mode.

How does bkit's L0–L4 automation level differ from a config flag?

A config flag is static — you turn it on, it stays on. L0–L4 is earned. Escalating from L2 to L3 requires a Trust Score derived from your project's actual track record (match rate, destructive-op count, interrupt frequency). The system graduates trust per project, not per user or per model, and a cooldown prevents a lucky streak from unlocking autonomy prematurely. It's closer to a pilot's flight-hour certification than a toggle.

Related reading: