Context Summarization Pipeline Architecture

Core Pipeline Flow

💬

Chat Interface

Incoming message — human / ai / system / tool_call / tool_result

Entry Point

🧠

Context Manager

Appends message to buffer · orchestrates the entire pipeline

Orchestrator

🔢

Token Counter

Counts all tokens across message types · queries Model Registry for limit

Evaluator

⚡

Threshold Evaluator

Is token count ≥ 80% of this model's context limit? · Runs on: new message · after tool call · before LLM call

Decision

✓ Below threshold

🤖

Primary LLM Call

Full buffer passed as-is

✅

AI Response

Appended to buffer · tool results loop back to Token Counter

⚠ At / above threshold

🔥 Compaction Pipeline

✂️

Segmenter

Divide into 3 zones

→

🗜️

Tool Formatter

JSON → readable

→

⚡

Compact LLM

Cheap fast model

→

🔧

Assembler

Rebuild buffer

🤖

Primary LLM Call

Compacted buffer passed

💡 Click any node to see details, code snippets, and implementation notes

Message Buffer — 3-Zone Segmentation

When compaction triggers, the buffer is segmented into three zones based on recency. Each zone receives different treatment to balance token savings vs. fidelity.

🔴 Zone C — Full Compaction Everything before last user query
→ Compacted LLM → 1 summary chunk

🟣 Zone B — High Fidelity From last user query → now
Messages kept · tool calls compressed

🟢 Zone A — Preserved Latest 5 messages
Kept exactly as-is

→

SUMMARY

🔧

← Oldest messages (fully compacted into 1 chunk)

Current session (tool calls compressed)

Latest 5 unchanged →

Trigger Points — When Threshold Checks Run

🔨

After Every Tool Call

Token count re-evaluated after each tool result is appended. Tool outputs can be verbose JSON that eats tokens fast.

📤

Before Every LLM Call

Final guard before sending context to the primary LLM. Prevents context overrun mid-response.

💬

On New User Message

Each new message is counted immediately after being appended to get ahead of threshold breaches early.

💡 Recommended Improvements

Beyond the core spec — these significantly increase robustness, cost-efficiency, and quality.

🏷️ Message Importance Scoring

Tag messages at write-time as pinned. System prompts, key decisions, and errors survive even Zone C full compaction. Prevents silent loss of critical context.

High Impact

📈 Async Pre-emptive Compaction

At 70% threshold, trigger compaction in a background thread. When 80% is hit, compaction is already done — zero latency spike mid-conversation.

Latency Win

🔄 Rolling Summary State

Maintain a persistent rolling summary updated incrementally. Each compaction only processes the delta since the last summary — dramatically cheaper than re-compacting the whole buffer every time.

Cost Reduction

💾 Summary Versioning

Store each compaction's input, output, token delta, model used, and timestamp. Enables debugging, quality auditing, rollback, and conversation replay.

Observability

🛡️ Compaction Circuit Breaker

If the compaction LLM fails or exceeds latency SLA, fall back to hard truncation of Zone C (oldest messages). Prevents full system failure when the compaction path is unhealthy.

Resilience

📊 Compression Ratio Metrics

Track tokens-in vs tokens-out per compaction event, per model, per zone. Helps tune thresholds, evaluate compaction quality, and surface regressions over time.

Monitoring

Model Registry

Source of truth for context limits, thresholds, and compaction model assignments. Checked on every trigger.

🟢

OpenAI

gpt-4o128kprimary

gpt-4o-mini128kcompaction

o3-mini200kprimary

🟣

Anthropic

claude-sonnet-4-6200kprimary

claude-opus-4-6200kprimary

claude-haiku-4-5200kcompaction

🟡

Google

gemini-2.0-flash1Mprimary

gemini-2.0-pro2Mprimary

gemini-flash-lite1Mcompaction

🔵

Self-Hosted / Open Source

llama-3.3-70bconfigurableprimary

qwen2.5-7bconfigurablecompaction

Registry Config Schema

MODEL_REGISTRY = {
  "openai/gpt-4o": {
    "context_limit": 128_000,
    "soft_threshold": 0.70,  # async
    "hard_threshold": 0.80,  # block
    "compaction_model":
      "openai/gpt-4o-mini",
    "token_counter": "tiktoken",
    "zone_a_size": 5,
  },
  "anthropic/claude-sonnet-4-6": {
    "context_limit": 200_000,
    "soft_threshold": 0.70,
    "hard_threshold": 0.80,
    "compaction_model":
      "anthropic/claude-haiku-4-5",
    "token_counter": "anthropic-sdk",
    "zone_a_size": 5,
  },
  "self-hosted/*": {
    "context_limit": "env:MODEL_CTX",
    "hard_threshold": 0.80,
    "compaction_model":
      "env:COMPACT_MODEL",
    "token_counter": "transformers",
  }
}

Threshold Bands

0–70%

Normal — no action

70–80%

Soft — async pre-compact

80–95%

Hard — blocking compaction

95%+

Circuit breaker — truncate