Pular para o conteúdo principal
Avançado

Harnesses for Long-Running Agents

Most guides optimize the model and the prompt. But once an agent's work is too big to finish in one context window — a multi-day refactor, a research project, a migration across hundreds of files — the thing that decides whether it succeeds is rarely the model. It's the harness: the scaffolding around the model that carries state forward, checks that progress is real, and recovers when a step goes wrong.

A useful definition: a harness encodes everything the model can't reliably do on its own. The model reasons and acts within a turn; the harness makes those turns add up to something across sessions.

The session wall

A context window is a hard ceiling. Most non-trivial projects can't be completed inside one. So a long-running agent doesn't run as one giant conversation — it runs as a sequence of bounded sessions, each starting roughly fresh.

That creates the central problem: how does session N+1 know what session N did? Without an answer, the agent either re-discovers everything (slow, expensive, error-prone) or contradicts its earlier self. The harness's first job is to bridge that gap.

What the harness carries between sessions

The bridge is built from two things the model doesn't have natively:

  • Artifacts — durable outputs left in the world: committed code, a written plan, a progress log, updated tests. The next session reads them instead of re-deriving them.
  • External memory — a deliberately curated store (files, a scratchpad, a task list) holding the high-signal state: what's done, what's next, what was decided and why. Before the window fills, the agent summarizes the completed phase into memory and lets the raw turn-by-turn history fall away. This is compaction — trading verbose history for a compressed, reloadable summary.

The discipline is the same as context engineering: not "remember everything," but "carry forward the smallest set of high-signal tokens that lets the next session continue correctly."

The initializer-and-worker pattern

A pattern Anthropic describes for the Claude Agent SDK splits the job in two:

  • An initializer runs once, on the first session: it sets up the environment — dependencies, config, a map of the codebase, the plan — so later sessions don't pay that cost repeatedly.
  • A worker runs every session after: it makes one increment of progress and, crucially, leaves clear artifacts for the next worker before it stops.

The insight is that "start the work" and "continue the work" are different jobs with different context needs. Conflating them wastes the window re-initializing every time.

Verification is part of the harness, not an afterthought

A long-running agent compounds its own errors: a wrong assumption in session 2 becomes the foundation for sessions 3 through 30. The harness has to make progress checkable — tests it can run, a build it can break, a linter, an explicit "definition of done" — so the agent catches drift early instead of confidently building on sand.

This is the capability–reliability gap in operational form: the model is capable of the step, but only a verification loop makes a sequence of steps reliable. Where you let the agent act without that check is exactly where you've placed it on the trust ladder.

Designing your own harness

A practical checklist when work will outlast one window:

  • Externalize state early. Decide now where progress lives — a PROGRESS.md, a task list, commits — not when the context is already full.
  • Make sessions resumable. Each session should be able to start from artifacts alone. If it needs the previous chat to make sense, the harness is leaking state.
  • Compact deliberately. Summarize finished phases into memory before the window forces a messy truncation.
  • Build in a checkpoint. Give the agent a command that answers "is the work still correct?" — and have it run before declaring a step done.
  • Plan for recovery. Assume a session will fail mid-step. The next one should detect the half-done state and continue, not duplicate or corrupt it.

The takeaway: for short tasks, optimize the prompt. For work that spans sessions, optimize the harness — it's the part that turns a capable model into a dependable one.

Next