Skip to content

Topic · A6

Agent harness engineering matters more than model shopping

If your agent gets context badly, uses tools badly, and never verifies its own work, the model upgrade will not save you. The harness is where reliability comes from.

If your agent has a bad harness, a better model mostly gives you faster mistakes.

That sounds harsher than it is. The point is simple: models reason, but harnesses decide what they can see, when they can act, which tools they touch, how much context they carry, and whether anyone checks the result before it leaves the terminal. OpenAI now writes about "harness engineering" directly as a discipline for getting reliable output from Codex in real software environments (OpenAI harness engineering). Anthropic's Claude Code docs say the same thing from another angle: use CLI tools because they are more context-efficient, isolate work in subagents, run multiple sessions in parallel, and use hooks when a rule must happen deterministically instead of relying on the model to remember it (Claude Code best practices, subagents docs, hooks guide).

That is harness engineering. Not the model weights. The wrapper around them.

What the harness actually includes

People often use "prompting" as a catch-all word for this. That is too small.

The harness includes:

  • project instruction loading
  • memory rules
  • skill discovery and activation
  • tool interfaces and permissions
  • subagent routing
  • checkpoint logic
  • verification and review passes
  • failure handling
OpenAI's Codex materials make the instruction side concrete: AGENTS files layer from global to repo to directory scope, and skills load implicitly based on the description or explicitly by name (OpenAI Codex docs, OpenAI Codex skills docs). Anthropic's hook docs make the enforcement side concrete: hooks fire at deterministic lifecycle points such as PreToolUse, PostToolUse, and Stop, and they can block or transform behavior before the model drifts into the wrong action (Claude Code hooks reference).

That is why "we upgraded the model" often disappoints people. If the harness still feeds the model the wrong slice of the repo, or lets it edit without verification, the upgrade just changes the flavor of the failure.

The model is the engine. The harness is the drivetrain.

A strong engine with a broken drivetrain does not move the car well. Same story here.

Three harness mistakes show up over and over:

1. Stuffing too much into the main context

Anthropic's subagent docs are blunt about this. Use subagents when a side task would flood the main conversation with logs, search results, or file contents you will not need later. Each subagent runs in its own context window and returns only the summary (Claude Code subagents docs).

That is a harness win, not a model win. The same model becomes more useful because you stopped making it carry irrelevant baggage.

2. Using tool surfaces that waste tokens

Anthropic's best-practices page recommends CLI tools because they are the most context-efficient way to interact with external services (Claude Code best practices). A browser tool that sends giant page snapshots for every tiny action is a worse harness than a tool that lets the model ask narrowly for what it needs. Same model, better wrapper, better output.

3. Leaving validation to the model's conscience

Claude's hooks guide says it plainly: use hooks when certain actions must always happen, rather than relying on the LLM to choose to do them (Claude Code hooks guide). That is the heart of deterministic checkpoints. If formatting, linting, tests, schema validation, or policy checks matter, wire them into the harness.

Once you do that, the model no longer has to "remember" to be disciplined. The harness supplies the discipline.

Parallel subagents are not a party trick

This is one of the clearest examples of harness leverage. Anthropic documents subagents as separate assistants with their own prompts, tool access, and context windows, useful for file-heavy or specialized side tasks that should not clutter the main thread (Claude Code subagents docs). Anthropic also recommends running multiple Claude sessions in parallel when you want isolated experiments or faster development (Claude Code best practices).

Why this matters:

  • research can happen without polluting implementation context
  • review can happen without freezing the main lane
  • different tools can be limited to different workers
  • summaries come back compressed instead of raw
That is why senior workflows feel different from demo workflows. A demo asks one agent to do everything. A real harness splits work into lanes and decides what returns to the main lane.

Deterministic checkpoints are where reliability starts

A deterministic checkpoint is anything the agent cannot wish past.

Examples:

  • a PreToolUse hook that blocks destructive shell commands
  • a PostToolUse hook that formats code after edits
  • a Stop hook that refuses task completion until tests or checks pass
  • a required artifact like a diff, a citation list, or a failing-to-passing test record
Claude's hook lifecycle is built exactly for this kind of control. Hooks can run once per session, once per turn, or around every tool call in the agentic loop (Claude Code hooks reference). OpenAI's harness engineering piece makes the same cultural point from the Codex side: in mature deployments, a large share of review effort moves agent-to-agent, with humans stepping in where it matters most (OpenAI harness engineering).

The practical lesson is not "remove humans." It is "stop making humans do work the harness should have blocked automatically."

Skills are also harness components

Anthropic's engineering post on skills explains the progressive-disclosure model clearly: the short metadata for installed skills is present up front, and the full instructions load only when the task matches (Anthropic engineering post). Codex mirrors this with explicit and implicit skill invocation based on the description (OpenAI Codex skills docs).

That changes more than ergonomics. It changes the context budget.

A good skill harness:

  • keeps rare but important workflows off the hot path
  • loads detail only when needed
  • makes agent behavior more repeatable
  • lets teams refine procedures independently of the base prompt
If you have ever seen a "smart" agent fail because it forgot a long internal process halfway through a task, you have seen a missing skills harness.

Autoresearch is a harness, not a writing style

RuleSell's own /topic/autoresearch page is useful here because it encodes the pattern directly: collect sources first, verify URLs, separate fact from synthesis, and make uncertainty explicit. That is not just "researching carefully." It is a harness that forces a verification pass before the final output.

This pattern matters because agents are unusually good at sounding finished before they are verified. An autoresearch harness fixes that by requiring:

  • source gathering before confident prose
  • URL verification before publication
  • explicit handling of unknowns
  • a final synthesis stage after evidence is collected
Once you see it that way, a lot of "AI writing problems" are actually harness problems. The model is doing what the loop rewarded.

About the MindStudio 25.7-point claim

RuleSell's SEO map flags a MindStudio benchmark summarized as "same model, 25.7 percentage-point swing from harness alone." That claim is directionally plausible and consistent with everything above. But I was not able to verify a stable primary public URL for that exact benchmark on May 13, 2026.

So the honest position is:

  • the principle is well-supported by OpenAI and Anthropic's published docs
  • the exact 25.7-point figure should be treated as provisional until a primary source is produced
That is not evasive. It is what a verification harness looks like in practice.

Where this fails

Harnesses can become bureaucracy. If every tiny task triggers five hooks, three approval gates, and two review agents, you will smother the workflow.

They can also calcify around one tool. A harness that only makes sense inside Claude hooks or one private plugin is powerful but less portable.

And there is still a ceiling set by the model. A clean harness will not make a weak model good at architecture, debugging, or judgment-heavy tasks. It just stops you from wasting strong models on avoidable failure modes.

What to read next

Sources

Related GitHub projects

Frequently asked

What counts as the harness in an agent workflow?
Everything around the model: instruction loading, skills, tool wrappers, permission policy, checkpoints, verification passes, subagent routing, and how outputs get reviewed or blocked.
Why say the harness matters more than the model?
Because the same strong model can still fail if the wrapper gives it noisy context, the wrong tools, no isolation, and no validation loop. Better model quality helps, but bad orchestration wastes it.
What is a deterministic checkpoint?
A step the agent must pass outside of its own judgment, such as a test command, lint pass, schema check, or hook that blocks dangerous actions until the condition is met.
Are subagents part of the harness or part of the model?
They are harness design. The model may power them, but the choice to isolate context, limit tools, and summarize back into the main thread is orchestration.
Where does autoresearch fit?
Autoresearch is a verification harness. It turns "write a page" into a source-checked workflow with explicit evidence collection, uncertainty handling, and final synthesis.
Did MindStudio really show a 25.7-point harness swing?
That number is cited in RuleSell research map, but I was not able to verify a stable primary public URL for it on May 13, 2026. Treat the exact figure as provisional unless a primary source is produced.

Related topics