Context Engineering
Prompt engineering is about the words you choose. Context engineering is about the workspace you hand the model — what's in it, what order it's in, and what you deliberately left out.
The distinction matters because a context window is not a notepad. It's a limited, expensive, attentional resource. How you fill it changes what the model focuses on, how much it costs you, and whether it stays useful as sessions grow.
The context budget
Every model has a maximum context size — a hard ceiling measured in tokens. Think of it as a budget. You spend it on:
- Your system prompt and standing instructions
- Retrieved documents, codebase snippets, tool definitions
- Conversation history
- The model's output (which also counts against the window in multi-turn sessions)
When you run out, something has to give. Either old content gets dropped, or the session hits a wall.
Most beginner guides treat the context window as "more is better." Context engineering treats it as a resource to allocate carefully: spend it on what the model actually needs for this turn, not on everything that might be relevant.
Context rot and "lost in the middle"
There is a well-documented phenomenon in long-context LLMs: models pay disproportionate attention to content near the beginning and end of their context, and their recall of content buried in the middle degrades. Researchers studying this effect called it "lost in the middle."
The practical consequence: if you stuff a 100,000-token context with documents and bury the most critical instruction at position 60,000, the model may effectively ignore it — not because it's incapable of reading that far, but because attention is not evenly distributed across the window.
"Context rot" is the broader pattern: as a session grows, the quality of responses tends to drift. Early instructions get diluted. Repeated back-and-forth crowds out the original task. The model starts hedging, repeating itself, or losing the thread of what you actually asked for.
These are not bugs you can fully fix with a better prompt. They are structural properties of how attention works at scale. The engineering response is to keep the context smaller and sharper, not to fill it and hope.
Ordering matters
Where you place content is as important as what you include. Established good practice:
| Position | What to put there |
|---|---|
| Very top (system prompt) | Stable, durable instructions. Persona, rules, format requirements. |
| After system prompt | The current task, in plain terms. |
| Just before the last user turn | The most critical, specific context for this exact request. |
| Middle | Supporting documents, retrieved chunks — ordered by relevance, not chronology. |
| Conversation history | Only what's necessary for continuity. Prune aggressively. |
The general rule: the closer to the current turn, the more attention it gets. Critical instructions that live only in the middle of a long history are at risk.
Retrieval over stuffing
The temptation is to put everything in: all the docs, the full codebase, the entire conversation. Resist it.
The better approach is selective retrieval: identify what the model actually needs for this specific request, and inject only that. A well-retrieved 2,000-token chunk of the right document outperforms a 40,000-token dump where the answer is somewhere in the middle.
This is why retrieval-augmented generation (RAG) exists — not just to overcome context limits, but to improve quality by keeping the context curated.
For interactive sessions, the same logic applies: instead of accumulating everything, periodically compact or clear history to remove content that is no longer relevant to the current task. Claude Code's /compact and /clear commands are context engineering tools, not just session management.
The cost angle
Tokens you send are tokens you pay for — both in money and latency. Stuffing context with loosely relevant material inflates both. Context engineering and cost efficiency are the same problem.
More concretely:
- A bloated system prompt you copy-paste from a template is paid for on every single call.
- Old conversation history you carry forward because "it might be useful" is paid for on every single call.
- Documents you inject "just in case" are paid for on every single call.
Trimming what doesn't need to be there is simultaneously better for quality and cheaper to run.
Practical tactics for Claude users
In Claude.ai:
- Use distinct conversations for distinct tasks. Don't let an afternoon of tangents pollute the context of a focused project.
- Summarize long threads before asking a complex question that depends on them. An explicit summary is often more useful than the raw history.
- Put the specific thing you want at the end of a long message, not buried in the middle.
In Claude Code:
- Keep your
CLAUDE.mdfile lean. Every line in it is injected into every session. See CLAUDE.md and Context Management. - Use
/clearwhen switching to a genuinely different task. Use/compactwhen you want to continue but the session is growing. - Reference files by path rather than pasting their contents when the full file isn't needed for the current step.
At the API level:
- Design system prompts to contain only what every request truly needs. Move task-specific instructions into the user turn.
- For document-heavy use cases, retrieve and inject the relevant chunks rather than uploading an entire corpus.
- Structure the prompt so the stable, reusable prefix comes first — this also enables prompt caching, which is a natural companion to context engineering.
The shift in mindset
Prompt engineering asks: "What should I say?" Context engineering asks: "What should the model see, in what order, and what should I deliberately keep out?"
The second question is harder, but it's the one that actually determines quality at scale.