Context Compression (BETA)
Context compression is currently in beta. It is disabled by default and requires explicit configuration to enable.
Automatically compress conversation context when token count exceeds threshold, reducing costs while preserving conversation quality.
Features
- Automatic compression — Triggered when token count exceeds threshold
- Smart summarization — Uses cheap model (claude-3-haiku) to summarize older messages
- Recent message preservation — Keeps recent messages intact for context continuity
- Token estimation — Accurate token counting before API calls
- Statistics tracking — Monitor compression effectiveness
- Transparent operation — Works seamlessly with all AI clients
How It Works
- Token estimation — Count tokens in conversation history
- Threshold check — Compare against configured threshold (default: 50,000)
- Message selection — Identify older messages for compression
- Summarization — Use cheap model to create concise summary
- Context replacement — Replace old messages with summary
- Request forwarding — Send compressed context to target model
Configuration
Enable Compression
{
"compression": {
"enabled": true,
"threshold_tokens": 50000,
"target_tokens": 20000,
"summarizer_model": "claude-3-haiku-20240307",
"preserve_recent_messages": 5,
"tokens_per_char": 0.25
}
}
Options:
| Option | Default | Description |
|---|---|---|
enabled | false | Enable context compression |
threshold_tokens | 50000 | Trigger compression when context exceeds this |
target_tokens | 20000 | Target token count after compression |
summarizer_model | claude-3-haiku-20240307 | Model used for summarization |
preserve_recent_messages | 5 | Number of recent messages to keep intact |
tokens_per_char | 0.25 | Estimation ratio for token counting |
Per-Profile Configuration
Enable compression for specific profiles:
{
"profiles": {
"long-context": {
"providers": ["anthropic"],
"compression": {
"enabled": true,
"threshold_tokens": 100000,
"target_tokens": 40000
}
},
"short-context": {
"providers": ["openai"],
"compression": {
"enabled": false
}
}
}
}
Token Estimation
GoZen uses character-based estimation for fast token counting:
estimated_tokens = character_count * tokens_per_char
Default ratio: 0.25 tokens per character (1 token ≈ 4 characters)
Accuracy: ±10% for English text, may vary for other languages
For exact token counting, GoZen uses the tiktoken-go library when available.
Compression Strategy
Message Selection
- System messages — Always preserved
- Recent messages — Last N messages preserved (default: 5)
- Older messages — Candidates for compression
Summarization Prompt
Summarize the following conversation history concisely while preserving key information, decisions, and context:
[older messages]
Provide a brief summary that captures the essential points.
Result
Original: 45,000 tokens (30 messages)
After compression: 22,000 tokens (summary + 5 recent messages)
Savings: 23,000 tokens (51%)
Web UI
Access compression settings at http://localhost:19840/settings:
- Navigate to "Compression" tab (marked with BETA badge)
- Toggle "Enable Compression"
- Adjust threshold and target tokens
- Select summarizer model
- Set number of recent messages to preserve
- Click "Save"
Statistics Dashboard
View compression statistics:
- Total compressions — Number of times compression was triggered
- Tokens saved — Total tokens saved across all compressions
- Average savings — Average token reduction per compression
- Compression rate — Percentage of requests that triggered compression
API Endpoints
Get Compression Stats
GET /api/v1/compression/stats
Response:
{
"enabled": true,
"total_compressions": 42,
"tokens_saved": 1250000,
"average_savings": 29761,
"compression_rate": 0.15,
"last_compression": "2026-03-05T10:30:00Z"
}
Update Compression Settings
PUT /api/v1/compression/settings
Content-Type: application/json
{
"enabled": true,
"threshold_tokens": 60000,
"target_tokens": 25000
}
Reset Statistics
POST /api/v1/compression/stats/reset
Use Cases
Long Coding Sessions
Scenario: Multi-hour coding session with Claude Code
Configuration:
{
"compression": {
"enabled": true,
"threshold_tokens": 80000,
"target_tokens": 30000,
"preserve_recent_messages": 10
}
}
Benefit: Maintain conversation continuity without hitting context limits
Batch Processing
Scenario: Processing multiple documents with AI
Configuration:
{
"compression": {
"enabled": true,
"threshold_tokens": 40000,
"target_tokens": 15000,
"preserve_recent_messages": 3
}
}
Benefit: Reduce costs while processing large document sets
Research & Analysis
Scenario: Long research sessions with multiple topics
Configuration:
{
"compression": {
"enabled": true,
"threshold_tokens": 100000,
"target_tokens": 40000,
"preserve_recent_messages": 8
}
}
Benefit: Keep conversation focused on recent topics while preserving earlier context
Best Practices
- Start with defaults — Default settings work well for most use cases
- Monitor statistics — Check compression rate and savings regularly
- Adjust threshold — Increase for long-context models (Claude Opus), decrease for short-context
- Preserve enough messages — Keep 5-10 recent messages for context continuity
- Use cheap summarizer — Haiku is fast and cost-effective for summarization
- Test before production — Verify compression quality with your specific use case
Limitations
- Quality loss — Summarization may lose nuanced details
- Latency increase — Adds summarization API call overhead
- Cost trade-off — Summarization costs vs. token savings
- Language support — Works best with English, may vary for other languages
- Context window — Cannot exceed model's maximum context window
Troubleshooting
Compression not triggering
- Verify
compression.enabledistrue - Check token count exceeds threshold
- Ensure conversation has enough messages to compress
- Review daemon logs for compression errors
Poor summarization quality
- Try different summarizer model (e.g., claude-3-sonnet)
- Increase
preserve_recent_messagesto keep more context - Adjust
target_tokensto allow longer summaries - Check if summarizer model is available and working
Increased latency
- Compression adds one extra API call (summarization)
- Use faster summarizer model (haiku is fastest)
- Increase threshold to compress less frequently
- Consider disabling for latency-sensitive applications
Unexpected costs
- Monitor summarization costs in usage dashboard
- Compare savings vs. summarization costs
- Adjust threshold to compress less frequently
- Use cheapest available model for summarization
Performance Impact
- Token estimation — ~1ms per request (negligible)
- Summarization — 1-3 seconds (depends on model and message count)
- Memory overhead — Minimal (~1KB per compression)
- Cost savings — Typically 30-50% token reduction
Advanced Configuration
Custom Summarization Prompt
{
"compression": {
"enabled": true,
"custom_prompt": "Create a technical summary of the following conversation, focusing on code changes, decisions, and action items:\n\n{messages}\n\nSummary:"
}
}
Conditional Compression
Enable compression only for specific scenarios:
{
"profiles": {
"default": {
"scenarios": {
"longContext": {
"providers": ["anthropic"],
"compression": {
"enabled": true,
"threshold_tokens": 100000
}
},
"default": {
"providers": ["anthropic"],
"compression": {
"enabled": false
}
}
}
}
}
}
Multi-Stage Compression
Compress multiple times for very long conversations:
{
"compression": {
"enabled": true,
"stages": [
{
"threshold_tokens": 50000,
"target_tokens": 30000
},
{
"threshold_tokens": 80000,
"target_tokens": 40000
}
]
}
}
Future Enhancements
- Semantic similarity matching for intelligent message selection
- Multi-model summarization for quality comparison
- Compression quality metrics and feedback
- Custom compression strategies per use case
- Integration with RAG for external context storage