⚙️ System Architecture · Click any node for details

Context Summarization Pipeline

Intelligent context compression that preserves conversation fidelity while preventing LLM context window overflow — with unlimited follow-up questions.

Python OpenAI Anthropic Google Self-Hosted
Core Pipeline Flow
💬
Chat Interface
Incoming message — human / ai / system / tool_call / tool_result
Entry Point
🧠
Context Manager
Appends message to buffer · orchestrates the entire pipeline
Orchestrator
🔢
Token Counter
Counts all tokens across message types · queries Model Registry for limit
Evaluator
Threshold Evaluator
Is token count ≥ 80% of this model's context limit?  ·  Runs on: new message · after tool call · before LLM call
Decision
✓ Below threshold
🤖
Primary LLM Call
Full buffer passed as-is
AI Response
Appended to buffer · tool results loop back to Token Counter
⚠ At / above threshold
🔥 Compaction Pipeline
✂️
Segmenter
Divide into 3 zones
🗜️
Tool Formatter
JSON → readable
Compact LLM
Cheap fast model
🔧
Assembler
Rebuild buffer
🤖
Primary LLM Call
Compacted buffer passed

💡 Click any node to see details, code snippets, and implementation notes

Message Buffer — 3-Zone Segmentation

When compaction triggers, the buffer is segmented into three zones based on recency. Each zone receives different treatment to balance token savings vs. fidelity.

🔴 Zone C — Full Compaction Everything before last user query
→ Compacted LLM → 1 summary chunk
🟣 Zone B — High Fidelity From last user query → now
Messages kept · tool calls compressed
🟢 Zone A — Preserved Latest 5 messages
Kept exactly as-is
SUMMARY
🔧
🔧
← Oldest messages (fully compacted into 1 chunk)
Current session (tool calls compressed)
Latest 5 unchanged →
Trigger Points — When Threshold Checks Run
🔨
After Every Tool Call
Token count re-evaluated after each tool result is appended. Tool outputs can be verbose JSON that eats tokens fast.
📤
Before Every LLM Call
Final guard before sending context to the primary LLM. Prevents context overrun mid-response.
💬
On New User Message
Each new message is counted immediately after being appended to get ahead of threshold breaches early.
💡 Recommended Improvements

Beyond the core spec — these significantly increase robustness, cost-efficiency, and quality.

🏷️ Message Importance Scoring
Tag messages at write-time as pinned. System prompts, key decisions, and errors survive even Zone C full compaction. Prevents silent loss of critical context.
High Impact
📈 Async Pre-emptive Compaction
At 70% threshold, trigger compaction in a background thread. When 80% is hit, compaction is already done — zero latency spike mid-conversation.
Latency Win
🔄 Rolling Summary State
Maintain a persistent rolling summary updated incrementally. Each compaction only processes the delta since the last summary — dramatically cheaper than re-compacting the whole buffer every time.
Cost Reduction
💾 Summary Versioning
Store each compaction's input, output, token delta, model used, and timestamp. Enables debugging, quality auditing, rollback, and conversation replay.
Observability
🛡️ Compaction Circuit Breaker
If the compaction LLM fails or exceeds latency SLA, fall back to hard truncation of Zone C (oldest messages). Prevents full system failure when the compaction path is unhealthy.
Resilience
📊 Compression Ratio Metrics
Track tokens-in vs tokens-out per compaction event, per model, per zone. Helps tune thresholds, evaluate compaction quality, and surface regressions over time.
Monitoring
Model Registry

Source of truth for context limits, thresholds, and compaction model assignments. Checked on every trigger.

🟢
OpenAI
gpt-4o128kprimary
gpt-4o-mini128kcompaction
o3-mini200kprimary
🟣
Anthropic
claude-sonnet-4-6200kprimary
claude-opus-4-6200kprimary
claude-haiku-4-5200kcompaction
🟡
Google
gemini-2.0-flash1Mprimary
gemini-2.0-pro2Mprimary
gemini-flash-lite1Mcompaction
🔵
Self-Hosted / Open Source
llama-3.3-70bconfigurableprimary
qwen2.5-7bconfigurablecompaction
Registry Config Schema
MODEL_REGISTRY = { "openai/gpt-4o": { "context_limit": 128_000, "soft_threshold": 0.70, # async "hard_threshold": 0.80, # block "compaction_model": "openai/gpt-4o-mini", "token_counter": "tiktoken", "zone_a_size": 5, }, "anthropic/claude-sonnet-4-6": { "context_limit": 200_000, "soft_threshold": 0.70, "hard_threshold": 0.80, "compaction_model": "anthropic/claude-haiku-4-5", "token_counter": "anthropic-sdk", "zone_a_size": 5, }, "self-hosted/*": { "context_limit": "env:MODEL_CTX", "hard_threshold": 0.80, "compaction_model": "env:COMPACT_MODEL", "token_counter": "transformers", } }
Threshold Bands
0–70%
Normal — no action
70–80%
Soft — async pre-compact
80–95%
Hard — blocking compaction
95%+
Circuit breaker — truncate