Building an agent with the Claude Agent SDK is deceptively easy. You wire up a system prompt, hand it a few tools, and watch it reason its way through a first task. The trouble starts around the tenth turn — when the transcript balloons, relevant facts get buried, and the agent starts confidently repeating work it already finished. The model did not get worse. Its context did.
Context is the set of tokens the model can see at this particular step: the system prompt, tool definitions, retrieved documents, prior messages, memory, and tool results. It is finite, it is expensive, and — unlike a single prompt — it changes at every turn. Treating it as a static thing you design once, instead of a live resource you curate, is the single most common reason agents fail after the demo.
Why context is the bottleneck
For a single-turn query, prompt engineering is almost all that matters: you write one good instruction, you attach one good example, the model answers. For an agent, the problem is fundamentally different. Every tool call produces new tokens. Every retrieved document competes with every other retrieved document for space. A sloppy search tool that returns ten thousand tokens of HTML when four hundred tokens of markdown would do is not just wasteful — it actively degrades the model's attention for everything that follows.
Put another way: prompt engineering asks "what should I say?" Context engineering asks "what should the model be allowed to see, right now, in order to take the next good action?"
Gather, curate, prune: a loop
The mental model I use — and the one the Agent SDK is quietly designed around — is that every turn of an agent runs through three phases. It gathers new material (tool calls, retrieval, file reads). It curates what actually belongs in the window (system prompt, selected docs, pinned memory, recent messages). And it prunes what no longer pays its own weight (stale tool output, superseded plans, duplicated summaries).
The diagram below is the one I sketch on a whiteboard every time someone asks me to review their agent. It is the first thing to get right.
The four sources of context
Before we can curate, we have to know what's on the table. In practice, almost everything the Agent SDK puts into a context window falls into one of four buckets:
- System context — the durable instructions that define the agent's role, constraints, and available tools. Written once, rarely changed per-turn.
- Retrieved context — documents, code, tickets, pages — fetched from an external store because they are probably relevant to this task.
- Tool context — the return values of actions the agent has taken: a shell output, an API response, a file the agent just wrote.
- Memory context — durable notes the agent has written to itself across turns, sessions, or users. Useful, dangerous, and usually over-trusted.
The first mistake most teams make is blending these together into one undifferentiated stream of messages. The Agent SDK lets you keep them separate — as typed fields on a structured session — and you should take it up on the offer. Separation is what makes pruning possible.
A typed session, sketched
// A minimal Claude Agent SDK session with explicit context slots. import { Agent, tool } from "@anthropic-ai/claude-agent-sdk"; const agent = new Agent({ model: "claude-sonnet-4-5", system: load("prompts/system.md"), tools: [searchDocs, readFile, runShell], context: { memory: "memory/hendrik.md", // durable, small retrieved: { maxTokens: 4_000 }, // budgeted tools: { keepLast: 6 }, // window of results }, onTurn: async (ctx) => { await ctx.curate(); // score + re-rank retrieved docs await ctx.compact(); // summarize stale tool output }, });
The important line isn't any one API call — it's that the context is a thing with named parts, each with its own budget. When a turn goes sideways, you can point at which bucket overspent.
The agent loop in practice
Zoom in on a single turn and the picture gets more interesting. The model plans, it calls a tool, it reads the result, and — crucially — it reflects. Most agents that fall over in production fall over because they skip that last step. They accumulate tool output, they never summarize, and by turn twelve the plan from turn three is so far up the transcript that the model can no longer see it.
Retrieval, compression, and hand-off
Three mechanical moves buy you most of the improvement. None of them are glamorous.
Retrieval with a budget. Rank, don't dump. If you are pulling documents from a vector store, cap the total tokens you'll admit per turn and keep the top-k by score, not the top-k by count. A single three-thousand-token document can be worth six shorter ones — or none of them.
Compression of tool output. The Agent SDK exposes a per-tool post-processor precisely for this. If a tool returns HTML, strip it. If it returns a 400-line stack trace, keep the first frame and the exception. You are not lying to the model; you are making room for the work it actually has to do.
Hand-off via memory. When a sub-task is finished, the agent should write a short summary to its memory file and then drop the transcript of that sub-task from the active context. This is the move that lets long-running agents stay coherent across hundreds of turns without ballooning their own window.
Autonomy vs. reliability
Now the uncomfortable part. Everything above is a set of constraints. Constraints cost you autonomy. The more aggressively you curate, compress, and gate, the more "behaved" the agent becomes — and the less room it has to discover the solution you didn't think of. The more you let it cook — long context, open-ended tool access, no pruning — the more you see the flashes of genuine capability, and the more you see the spectacular failures.
The matrix below is how I think about the trade-off when I'm picking a mode for a given task. Let it cook is where interesting agentic behavior lives. Curated is where shippable agentic behavior lives. The work is knowing which task sits where — and, increasingly, moving tasks leftward as you learn to trust them.
What to measure
Context work is invisible until you measure it. Three numbers I put on every dashboard:
- Context utilization — for each turn, what fraction of the window was actually referenced in the model's next action? Under 20% for more than a few turns is a sign you are over-retrieving.
- Redundancy ratio — how much of the current context duplicates content already seen earlier in the session? If it creeps above 30%, your compaction isn't firing.
- Time-to-first-action — how long between the user's message and the agent's first tool call? When this grows, it is almost always because the model is wading through irrelevant context before it can decide.
None of these are in the model card. All of them will tell you more about how your agent is actually doing than the benchmarks will.
A closing note
The Claude Agent SDK gives you the primitives — typed context slots, per-tool post-processors, memory hand-off, structured sessions. What it does not give you is judgment about when to use them. That judgment comes from sitting with your own agent for a few days, watching its transcripts, and asking, every time it stumbles, the same question: what, in this window, is the model paying for that it didn't need? Most of the craft is in the answers.
— Hendrik Krack, krackedtools.dev