Skip to content

Summarization

Long conversations eventually outgrow a model's context window. On 32k-token on-device models this causes an overflow; on cloud models it silently inflates cost and latency. DeepAgents handles this with SummarizationMiddleware - a context compaction mechanism that replaces older turns with a compact summary generated by the same model, keeping the most recent turns verbatim.

How it works

SummarizationMiddleware hooks into wrapModelCall. Before every model call it estimates the token count of the full request - including the system prompt and tool schemas, because the model re-reads both on every turn. Once the estimate crosses the trigger threshold, it compacts in place before the call proceeds.

The rewritten history has this shape:

[ summary turn (human, marked)  ]   <- condensed summary of the evicted older turns
[ acknowledgment (assistant)    ]   <- short synthetic "understood, continuing" turn
[ recent tail, verbatim         ]   <- the most recent turns kept as-is
  • The summary is a .human turn (not a second .system message - some providers reject mid-history system messages, and the local LFM2 chat template does not support them). The turn carries AgentMessage.source == "summarization" so a later compaction can fold it rather than treating it as an original user message.
  • The acknowledgment keeps roles alternating (human - assistant - human ...), which Anthropic and some other backends require.
  • A later compaction folds the prior summary plus the next block into a fresh summary, so the history stays [summary] + tail indefinitely - the "rolling" model.

Token counting is approximate

Token counts are estimated at ~4 characters per token across all backends - none expose a live tokenizer counter. The 15% headroom below the trigger window is sized to absorb this imprecision. The estimate includes system prompt and tool schema bytes, not just message text, so a deep agent with many tools compacts at the right time even when the message-only count looks comfortable.

When it fires

Automatically

The middleware measures the request on every beforeModel call. When it crosses triggerFraction (default 0.85) of the model's contextWindowTokens (or the fallback window if the model does not report one), it compacts before the model call proceeds.

Manually: compact(threadId:)

You can trigger compaction at any time by calling compact on the ReactAgent:

let outcome: CompactionOutcome? = await agent.compact(threadId: "session-42")
if let outcome {
    print("Compacted \(outcome.tokensBefore) -> \(outcome.tokensAfter) tokens")
}

compact returns nil if no checkpointer was given (there is no persistent history to compact) or if compaction was a no-op (the summary was not smaller than what it would replace). The manual path compacts regardless of the current size, so you can trigger it proactively before starting an expensive tool-heavy run.

CompactionOutcome carries:

Field Type Meaning
tokensBefore Int Estimated tokens before compaction
tokensAfter Int Estimated tokens after compaction
archivePath String? Where evicted originals were saved (if an archive was configured)
messages [AgentMessage] The rewritten history, so the host can refresh its transcript

The kept tail

The tail retains the most recent turns subject to two independent ceilings, whichever keeps fewer:

  • keepRecentMessages (default 6) - an upper bound on the number of messages.
  • keepRecentFraction (default 0.25) - a token budget as a fraction of the window. This prevents a handful of large recent messages (for example, big tool results) from leaving the tail oversized.

After both ceilings are applied, the cut is snapped to a user-turn boundary so the tail never begins on an orphan tool result, and an assistant tool-call is never split from its paired tool-result message.

Originals are preserved

Before the summary replaces the evicted block, the original messages are offloaded to a per-session archive file. The summary turn embeds a pointer to the archive directory so all parts are discoverable from the current summary after multiple rolling compactions.

The offload happens only AFTER a usable, shrinking summary is produced, so a failed or no-op compaction never leaves an orphan archive part behind.

Nothing is lost

The live history held by the checkpointer contains [summary] + tail. The full pre-compaction transcript is preserved in the archive. Replaying the archive parts in order reconstructs the original conversation exactly.

Configuration: SummarizationConfig

public struct SummarizationConfig: Sendable {
    public var triggerFraction: Double        // default 0.85
    public var fallbackContextWindow: Int     // default 32768
    public var keepRecentMessages: Int        // default 6
    public var keepRecentFraction: Double     // default 0.25
    public var reservedOutputTokens: Int      // default 4096
    public var summaryPrompt: String          // built-in default

    public static let `default` = SummarizationConfig()
}
Knob Default Meaning
triggerFraction 0.85 Fraction of the window at which automatic compaction fires
fallbackContextWindow 32768 Window assumed when the model does not report one
keepRecentMessages 6 Upper bound on kept recent messages
keepRecentFraction 0.25 Token ceiling on the tail, as a fraction of the window
reservedOutputTokens 4096 Tokens reserved for the summary's own generation
summaryPrompt (built-in) Instruction sent to the model when asking it to summarize

The built-in summary prompt instructs the model to preserve: the user's goals and constraints, key decisions and their reasoning, files and artifacts created or modified (with paths), important facts learned from tool calls, and the current state plus what remains to do.

The summarization: factory parameter

SummarizationMiddleware is wired in by default when you call createDeepAgent. Control it via the summarization: parameter:

// Default: summarization enabled with default config
let agent = createDeepAgent(model: model)

// Custom config: trigger earlier, keep more recent turns
let agent = createDeepAgent(
    model: model,
    summarization: SummarizationConfig(
        triggerFraction: 0.75,
        keepRecentMessages: 10
    )
)

// Disabled: pass nil to turn off summarization entirely
let agent = createDeepAgent(model: model, summarization: nil)

createAgent does not wire SummarizationMiddleware automatically. To add it to a plain agent, include it in the middleware: array or call compact(threadId:) manually.

See the agent loop for where SummarizationMiddleware fits in the full pipeline.

Known limitations

  • Approximate token accounting. Counts are char-based estimates. The 15% trigger headroom is sized to absorb this, but a model with unusual verbosity or very long tool schemas may still overflow on a single round.
  • Message granularity. Compaction cuts at message boundaries and cannot split a single message, so one very large message resists compaction. The agent's built-in tool-result truncation mitigates the most common cause.
  • Single-turn runs. Compaction snaps to a user-turn boundary, so a conversation with only one user turn (for example, a single request that drives a long tool-calling chain) cannot compact until another user turn exists.
  • Images. The summarizer renders text only. Image context from earlier turns is not carried into the summary, and images are not counted toward the trigger. This affects long vision conversations.