Topic · A6
Agent harness engineering matters more than model shopping
If your agent gets context badly, uses tools badly, and never verifies its own work, the model upgrade will not save you. The harness is where reliability comes from.
If your agent has a bad harness, a better model mostly gives you faster mistakes.
That sounds harsher than it is. The point is simple: models reason, but harnesses decide what they can see, when they can act, which tools they touch, how much context they carry, and whether anyone checks the result before it leaves the terminal. OpenAI now writes about "harness engineering" directly as a discipline for getting reliable output from Codex in real software environments (OpenAI harness engineering). Anthropic's Claude Code docs say the same thing from another angle: use CLI tools because they are more context-efficient, isolate work in subagents, run multiple sessions in parallel, and use hooks when a rule must happen deterministically instead of relying on the model to remember it (Claude Code best practices, subagents docs, hooks guide).
That is harness engineering. Not the model weights. The wrapper around them.
What the harness actually includes
People often use "prompting" as a catch-all word for this. That is too small.
The harness includes:
- project instruction loading
- memory rules
- skill discovery and activation
- tool interfaces and permissions
- subagent routing
- checkpoint logic
- verification and review passes
- failure handling
PreToolUse, PostToolUse, and Stop, and they can block or transform behavior before the model drifts into the wrong action (Claude Code hooks reference).
That is why "we upgraded the model" often disappoints people. If the harness still feeds the model the wrong slice of the repo, or lets it edit without verification, the upgrade just changes the flavor of the failure.
The model is the engine. The harness is the drivetrain.
A strong engine with a broken drivetrain does not move the car well. Same story here.
Three harness mistakes show up over and over:
1. Stuffing too much into the main context
Anthropic's subagent docs are blunt about this. Use subagents when a side task would flood the main conversation with logs, search results, or file contents you will not need later. Each subagent runs in its own context window and returns only the summary (Claude Code subagents docs).
That is a harness win, not a model win. The same model becomes more useful because you stopped making it carry irrelevant baggage.
2. Using tool surfaces that waste tokens
Anthropic's best-practices page recommends CLI tools because they are the most context-efficient way to interact with external services (Claude Code best practices). A browser tool that sends giant page snapshots for every tiny action is a worse harness than a tool that lets the model ask narrowly for what it needs. Same model, better wrapper, better output.
3. Leaving validation to the model's conscience
Claude's hooks guide says it plainly: use hooks when certain actions must always happen, rather than relying on the LLM to choose to do them (Claude Code hooks guide). That is the heart of deterministic checkpoints. If formatting, linting, tests, schema validation, or policy checks matter, wire them into the harness.
Once you do that, the model no longer has to "remember" to be disciplined. The harness supplies the discipline.
Parallel subagents are not a party trick
This is one of the clearest examples of harness leverage. Anthropic documents subagents as separate assistants with their own prompts, tool access, and context windows, useful for file-heavy or specialized side tasks that should not clutter the main thread (Claude Code subagents docs). Anthropic also recommends running multiple Claude sessions in parallel when you want isolated experiments or faster development (Claude Code best practices).
Why this matters:
- research can happen without polluting implementation context
- review can happen without freezing the main lane
- different tools can be limited to different workers
- summaries come back compressed instead of raw
Deterministic checkpoints are where reliability starts
A deterministic checkpoint is anything the agent cannot wish past.
Examples:
- a
PreToolUsehook that blocks destructive shell commands - a
PostToolUsehook that formats code after edits - a
Stophook that refuses task completion until tests or checks pass - a required artifact like a diff, a citation list, or a failing-to-passing test record
The practical lesson is not "remove humans." It is "stop making humans do work the harness should have blocked automatically."
Skills are also harness components
Anthropic's engineering post on skills explains the progressive-disclosure model clearly: the short metadata for installed skills is present up front, and the full instructions load only when the task matches (Anthropic engineering post). Codex mirrors this with explicit and implicit skill invocation based on the description (OpenAI Codex skills docs).
That changes more than ergonomics. It changes the context budget.
A good skill harness:
- keeps rare but important workflows off the hot path
- loads detail only when needed
- makes agent behavior more repeatable
- lets teams refine procedures independently of the base prompt
Autoresearch is a harness, not a writing style
RuleSell's own /topic/autoresearch page is useful here because it encodes the pattern directly: collect sources first, verify URLs, separate fact from synthesis, and make uncertainty explicit. That is not just "researching carefully." It is a harness that forces a verification pass before the final output.
This pattern matters because agents are unusually good at sounding finished before they are verified. An autoresearch harness fixes that by requiring:
- source gathering before confident prose
- URL verification before publication
- explicit handling of unknowns
- a final synthesis stage after evidence is collected
About the MindStudio 25.7-point claim
RuleSell's SEO map flags a MindStudio benchmark summarized as "same model, 25.7 percentage-point swing from harness alone." That claim is directionally plausible and consistent with everything above. But I was not able to verify a stable primary public URL for that exact benchmark on May 13, 2026.
So the honest position is:
- the principle is well-supported by OpenAI and Anthropic's published docs
- the exact 25.7-point figure should be treated as provisional until a primary source is produced
Where this fails
Harnesses can become bureaucracy. If every tiny task triggers five hooks, three approval gates, and two review agents, you will smother the workflow.
They can also calcify around one tool. A harness that only makes sense inside Claude hooks or one private plugin is powerful but less portable.
And there is still a ceiling set by the model. A clean harness will not make a weak model good at architecture, debugging, or judgment-heavy tasks. It just stops you from wasting strong models on avoidable failure modes.
What to read next
- /topic/autoresearch for a source-first verification harness
- /topic/agentic-engineering for the broader shift from vibe coding to process
- /topic/subagents for context isolation patterns
- /topic/claude-code-hooks-cookbook for deterministic control recipes
- /for/claude-code for assets built around hooks, skills, and review workflows
- /for/aider for adjacent terminal-agent patterns
Sources
- OpenAI. Harness engineering: leveraging Codex in an agent-first world
- OpenAI. Custom instructions with AGENTS.md
- OpenAI. Agent Skills for Codex
- Anthropic. Best practices for Claude Code
- Anthropic. Create custom subagents
- Anthropic. Automate workflows with hooks
- Anthropic. Hooks reference
- Anthropic. Equipping agents for the real world with Agent Skills
Related GitHub projects
Frequently asked
- What counts as the harness in an agent workflow?
- Everything around the model: instruction loading, skills, tool wrappers, permission policy, checkpoints, verification passes, subagent routing, and how outputs get reviewed or blocked.
- Why say the harness matters more than the model?
- Because the same strong model can still fail if the wrapper gives it noisy context, the wrong tools, no isolation, and no validation loop. Better model quality helps, but bad orchestration wastes it.
- What is a deterministic checkpoint?
- A step the agent must pass outside of its own judgment, such as a test command, lint pass, schema check, or hook that blocks dangerous actions until the condition is met.
- Are subagents part of the harness or part of the model?
- They are harness design. The model may power them, but the choice to isolate context, limit tools, and summarize back into the main thread is orchestration.
- Where does autoresearch fit?
- Autoresearch is a verification harness. It turns "write a page" into a source-checked workflow with explicit evidence collection, uncertainty handling, and final synthesis.
- Did MindStudio really show a 25.7-point harness swing?
- That number is cited in RuleSell research map, but I was not able to verify a stable primary public URL for it on May 13, 2026. Treat the exact figure as provisional unless a primary source is produced.