Key Takeaways
- Claude Code’s entire toolset, including Agent Skills, programmatic tool calling, and the memory tool, is composed from just two primitives: bash and a text editor
- Letting Claude write code to orchestrate its own tool calls skips unnecessary context window processing, cutting token spend on outputs Claude partially ignores
- Compaction automatically summarizes and replaces older context when a session approaches a configurable threshold, enabling indefinitely long agentic tasks
- Cached tokens cost 10% of standard input token pricing; static prompt content placed first maximizes cache hit rate per turn
86.8%. That’s Anthropic’s confirmed BrowseComp score for Claude Opus 4.6 running with a multi-agent harness, web search, compaction triggered at 50,000 tokens, and max reasoning effort. Without the multi-agent harness, the score was lower. The gap between those two configurations isn’t a model difference. It’s a harness design difference.
Anthropic’s blog post “Harnessing Claude’s Intelligence” frames this as the central challenge for every developer building on Claude today: harnesses encode beliefs about what Claude can’t do on its own, and those beliefs expire faster than most teams realize. Three structural patterns now separate lean, high-performing agent builds from bloated ones that fight the model instead of trusting it.
Use What Claude Already Knows
Start with the tool selection decision most developers overthink. Anthropic’s own guidance is blunt: “bash is all you need.”
Claude Code, Anthropic’s most capable production agent, is built on bash and a text editor as its two foundational tools. Agent Skills, programmatic tool calling, and the memory tool are all composed from those two primitives. Claude 3.5 Sonnet hit 49% on SWE-bench Verified using a simple prompt and two general-purpose tools, a result that was state-of-the-art at the time of release in October 2024.
Building on familiar tools compounds over time. Bash maps directly to how frontier models are trained; Claude has processed enormous volumes of shell usage and improves at it with each model version. Teams that build elaborate custom tool schemas teach Claude a new language per project, while teams using bash inherit Anthropic’s training investment automatically. And rather than calling individual tools for search, file linting, and code execution separately, Claude can chain those steps in a single bash pipe operator command.
Why Claude Should Orchestrate Its Own Actions
Here’s the harness assumption that quietly costs the most. When every tool result returns through Claude’s context window before the next action fires, you pay token costs for data the model mostly ignores.
Giving Claude a code execution environment, a bash tool or language-specific REPL, breaks that pattern. Claude writes code to express the full chain of tool calls and the logic between them. Rather than the harness routing every result back as tokens, Claude decides what to filter, pass through, or pipe into the next step without touching the context window at all. Only the final output of code execution reaches Claude’s context.
But this doesn’t replace declarative tools entirely. Hard-to-reverse actions, external API calls, file overwrites, and anything crossing a security boundary still belong in dedicated typed tools. Those tools give the harness an action-specific hook with typed arguments it can intercept, gate, render, or audit. Bash provides broad leverage; declarative tools provide control surfaces. The distinction is security-driven, not performance-driven.
Context Management: 3 Capabilities Claude Now Handles Better Than Your Harness
Anthropic identifies three distinct context decisions developers typically hard-code into harnesses but that Claude handles more effectively when given the right primitives.
1. Assembling its own context: Agent Skills are markdown files and scripts stored on the filesystem. Claude sees only short names and descriptions for each skill and pulls the full content only when a task requires it. This stops developers from padding system prompts with rarely-used instructions that consume attention budget on every single turn.
2. Editing out stale context: Context editing lets developers selectively remove old tool results and thinking blocks that have become irrelevant. Once a tool has been called deep in message history, Claude rarely needs to see the raw result again. Tool result clearing is among the safest and lightest-touch forms of compaction available on the API.
3. Persisting across long-horizon runs: Compaction in the Claude Agent SDK automatically summarizes previous messages when the context limit approaches, so agents don’t hit hard stops mid-task. The Claude Agent SDK’s subagent architecture uses a 200,000-token context window per subagent, with compaction triggering when a subagent’s context reaches 50,000 tokens. Subagents serve two distinct purposes: parallelization across independent tasks, and context isolation, where each subagent sends only relevant conclusions back to the orchestrator rather than its full context history.
Context management isn’t a single toggle. It’s a layered system, and Claude Code itself runs all three in production simultaneously.
Where the Harness Still Matters
Handing more decisions to Claude doesn’t eliminate harness engineering. It redefines it.
Claude Code’s auto-mode security boundary uses a second Claude instance to evaluate bash commands for safety before execution. That adds latency and cost, making it unsuitable for high-volume or already-trusted workflows. Compaction introduces a summarization layer that, in edge cases, can lose nuanced context a human engineer would flag as critical, though Anthropic’s production use of it in Claude Code reflects confidence in its reliability for most tasks.
Most reviewers focus on model benchmark scores when evaluating Claude upgrades. Harness debt, the scaffolding written for an older model’s limitations that actively bottlenecks a newer one, is the variable they consistently ignore, and it matters more than the benchmark gap.
Prompt Caching: The 10% Cost Floor Most Teams Miss
The Anthropic Messages API is stateless. Every turn requires repackaging the full conversation history, tool descriptions, and system prompt. Prompt caching directly addresses this.
Cached tokens cost 10% of standard input token price, representing up to a 90% cost reduction on cached content. The cache checks for matching prefixes, so static content (system instructions, tool definitions) placed first in the prompt maximizes hit rate. Dynamic content, new messages and changing context, appends at the end without invalidating the cached prefix.
| Caching Principle | What It Does | Why It Matters |
|---|---|---|
| Static content first | Stable content leads the prompt | Maximizes cache hit rate per turn |
| Append, don’t edit messages | Preserves cached prefix | Avoids cache invalidation mid-session |
| Don’t switch models mid-session | Caches are model-specific | Switching resets the cache entirely |
| Use tool search for dynamic tools | Appends without breaking prefix | Add tools without invalidating cached content |
Considerations
40 to 60 words on real trade-offs: The bash-first approach gives Claude broad programmatic leverage but hands the harness only a command string, the same shape for every action. This reduces observability compared to typed declarative tools. Compaction’s summarization can silently drop context a developer intended to persist. And the multi-agent architecture that lifted BrowseComp scores to 86.8% carries orchestration overhead that may not justify itself on simpler, single-task workloads.

