Token Compression Strategies for Marianne¶
Status: Research / Not Implemented — This document describes strategies under consideration. None have been implemented.
Research Date: 2026-01-05 Priority: High (cost/latency impact)
Executive Summary¶
This document captures research on algorithmic token compression strategies applicable to Marianne AI Compose. The goal is reducing token usage (and thus cost/latency) while maintaining output quality.
Key Finding: Marianne's batch-oriented, declarative YAML model aligns naturally with prompt caching and prefix sharing strategies. The highest-impact, lowest-complexity strategy is Anthropic's native prompt caching, which can deliver 90% cost reduction with minimal code changes.
Who This Is For¶
Explicit Audience¶
- Developers running batch AI operations (code review, data processing, content generation)
- Teams needing cost optimization for high-volume Claude usage
- Marianne contributors implementing new optimization features
Implicit Audience (P5 Recognition)¶
- Marianne itself - as a self-evolving system, Marianne may use this research to inform its own evolution decisions. The opus evolution cycles (v1→v4) demonstrate Marianne's capacity to improve its own score based on discovered patterns.
Current State Analysis¶
Existing Compression Patterns in Marianne¶
Marianne already implements basic compression thinking:
# src/marianne/prompts/templating.py:222-224
if len(original_context) > 3000:
truncation_msg = "\n\n[... original prompt truncated for brevity ...]"
original_context = original_context[:3000] + truncation_msg
Problem: Naive truncation loses potentially important context. Smart compression would preserve semantics while reducing tokens.
Token Flow in Marianne¶
YAML Config → PromptBuilder → [COMPRESSION POINT] → Backend → Claude API/CLI
↓
Template + Stakes + Thinking Method + Variables
↓
MZT_OPERATOR_IMPERATIVE injection (claude_cli.py:51-104)
↓
Final prompt (~1000-5000 tokens typical)
Key Insight: The MZT_OPERATOR_IMPERATIVE (53 lines, ~800 tokens) is injected into EVERY prompt. This is a prime candidate for caching.
Strategy Ranking¶
| Rank | Strategy | Cost Savings | Latency Savings | Complexity | Marianne Fit |
|---|---|---|---|---|---|
| 1 | Anthropic Prompt Caching | 90% | 85% | Low | Excellent |
| 2 | Sheet Prefix Sharing | 30-50% | 20-40% | Medium | Excellent |
| 3 | Completion Prompt Summarization | 40-60% | 10-20% | Medium | Good |
| 4 | Semantic Deduplication | 30-40% | 5-10% | High | Moderate |
| 5 | LLMLingua/Extractive Compression | Up to 95% | Variable | High | Low |
Strategy 1: Anthropic Prompt Caching (Recommended First)¶
What It Is¶
Anthropic's native prompt caching allows marking portions of prompts as cacheable. Cached content is stored for 5-60 minutes and reused across API calls.
Why It Fits Marianne¶
Marianne's batch model has high cache potential:
| Prompt Component | Per-Sheet Variation | Cache Candidate |
|---|---|---|
| MZT_OPERATOR_IMPERATIVE | None | YES (100%) |
| Template body | None | YES (100%) |
| Stakes | None | YES (100%) |
| Thinking method | None | YES (100%) |
| Variables (custom) | Rarely | YES (95%) |
| sheet_num, start_item, end_item | Always | NO |
Estimated cacheable portion: 85-95% of tokens
Pricing Impact¶
| Operation | Cost (vs base input) |
|---|---|
| Write to cache | 125% (one-time) |
| Read from cache | 10% |
| Uncached content | 100% |
10-sheet job example: - Without caching: 10 Ă— 1000 tokens = 10,000 tokens billed - With caching: 1 Ă— 1250 (write) + 9 Ă— 100 (read) + 10 Ă— 50 (variable) = 2,650 tokens equivalent - Savings: 73%
Implementation Plan¶
# src/marianne/backends/anthropic_api.py (enhancement)
class AnthropicApiBackend(Backend):
def __init__(self, ..., enable_prompt_caching: bool = True):
self.enable_prompt_caching = enable_prompt_caching
def _build_messages_with_caching(self, prompt: str) -> list[dict]:
"""Build messages with cache breakpoints."""
if not self.enable_prompt_caching:
return [{"role": "user", "content": prompt}]
# Split prompt into cacheable and variable portions
static_portion, variable_portion = self._split_prompt(prompt)
return [{
"role": "user",
"content": [
{
"type": "text",
"text": static_portion,
"cache_control": {"type": "ephemeral"} # 5-min TTL
},
{
"type": "text",
"text": variable_portion
}
]
}]
Config Addition¶
# New config option
backend:
type: anthropic_api
prompt_caching: true
cache_ttl: "5m" # or "1h" for longer jobs
References¶
Strategy 2: Sheet Prefix Sharing¶
What It Is¶
When processing multiple sheets, identify the common prompt prefix and optimize how it's sent to the model. This is related to but distinct from API-level caching—it's about structuring prompts for optimal KV cache reuse at inference time.
Why It Fits Marianne¶
All sheets in a Marianne job share: - Same system instructions - Same template structure - Same stakes/thinking method - Only sheet numbers vary
Implementation Approach¶
# src/marianne/execution/runner.py (enhancement)
class JobRunner:
def _extract_common_prefix(self, prompts: list[str]) -> tuple[str, list[str]]:
"""Extract common prefix from all sheet prompts.
Returns:
(common_prefix, list_of_suffixes)
"""
if not prompts:
return "", prompts
# Find longest common prefix
prefix = prompts[0]
for prompt in prompts[1:]:
while not prompt.startswith(prefix):
prefix = prefix[:-1]
if not prefix:
return "", prompts
suffixes = [p[len(prefix):] for p in prompts]
return prefix, suffixes
Batch API Optimization¶
When using the Anthropic API, batch multiple sheet requests:
# Batch request structure
requests = [
{
"custom_id": f"sheet-{i}",
"params": {
"messages": [{"role": "user", "content": suffix}],
"system": common_prefix, # Shared across batch
}
}
for i, suffix in enumerate(suffixes)
]
References¶
- BatchLLM Paper - 1.3x-10.8x throughput improvement
- SGLang RadixAttention - KV cache reuse patterns
Strategy 3: Completion Prompt Summarization¶
What It Is¶
Replace naive truncation with intelligent summarization when building completion prompts for partial failures.
Current Problem¶
# Current naive approach (templating.py:222)
if len(original_context) > 3000:
original_context = original_context[:3000] + "[truncated]"
This loses potentially crucial context from the end of the prompt.
Proposed Solution¶
# Enhanced approach
async def _summarize_context(self, original: str, max_tokens: int = 500) -> str:
"""Use small model to summarize long context."""
if len(original) < 3000:
return original
# Use Haiku for cost-efficient summarization
summary = await self.summary_backend.execute(
f"Summarize this task context in {max_tokens} tokens, "
f"preserving key requirements and constraints:\n\n{original}"
)
return summary.stdout
Adaptive Focus Memory (AFM) Pattern¶
Research shows a three-tier fidelity approach works well:
| Fidelity | When to Use | Compression |
|---|---|---|
| FULL | Recent/critical content | 0% |
| COMPRESSED | Important but not critical | 50-70% |
| PLACEHOLDER | Old/tangential content | 90%+ |
Implementation Hooks¶
# src/marianne/prompts/compression.py (new file)
from enum import Enum
from dataclasses import dataclass
class Fidelity(Enum):
FULL = "full"
COMPRESSED = "compressed"
PLACEHOLDER = "placeholder"
@dataclass
class ContentBlock:
content: str
fidelity: Fidelity
importance: float # 0.0-1.0
class ContextCompressor:
"""Compress context using adaptive fidelity levels."""
def compress(
self,
blocks: list[ContentBlock],
target_tokens: int
) -> str:
"""Compress blocks to fit target token budget."""
# Sort by importance
sorted_blocks = sorted(blocks, key=lambda b: b.importance, reverse=True)
result = []
current_tokens = 0
for block in sorted_blocks:
block_tokens = self._estimate_tokens(block.content)
if current_tokens + block_tokens <= target_tokens:
result.append(block.content)
current_tokens += block_tokens
elif block.fidelity != Fidelity.PLACEHOLDER:
# Try to compress
compressed = self._compress_block(block, target_tokens - current_tokens)
result.append(compressed)
current_tokens += self._estimate_tokens(compressed)
return "\n\n".join(result)
References¶
Strategy 4: Semantic Deduplication¶
What It Is¶
Detect and remove semantically redundant content before sending to the model.
Where Redundancy Occurs in Marianne¶
| Source | Example | Typical Redundancy |
|---|---|---|
| Learned patterns | Similar patterns from multiple sheets | 20-40% |
| Stakes + template | Repeated themes/instructions | 10-20% |
| Completion prompts | Original task echoed in failed validations | 30-50% |
Implementation Approach¶
# src/marianne/prompts/deduplication.py (new file)
from typing import Protocol
import hashlib
class EmbeddingProvider(Protocol):
"""Protocol for embedding providers."""
def embed(self, texts: list[str]) -> list[list[float]]: ...
class SemanticDeduplicator:
"""Remove semantically redundant content."""
def __init__(
self,
embeddings: EmbeddingProvider,
similarity_threshold: float = 0.85
):
self.embeddings = embeddings
self.threshold = similarity_threshold
def deduplicate(self, segments: list[str]) -> list[str]:
"""Remove redundant segments using cosine similarity."""
if len(segments) <= 1:
return segments
# Get embeddings
vectors = self.embeddings.embed(segments)
# Keep first occurrence, skip similar ones
kept = [segments[0]]
kept_vectors = [vectors[0]]
for i, (segment, vector) in enumerate(zip(segments[1:], vectors[1:])):
if not self._is_redundant(vector, kept_vectors):
kept.append(segment)
kept_vectors.append(vector)
return kept
def _is_redundant(
self,
vector: list[float],
existing: list[list[float]]
) -> bool:
"""Check if vector is too similar to any existing."""
for existing_vec in existing:
similarity = self._cosine_similarity(vector, existing_vec)
if similarity > self.threshold:
return True
return False
Simpler Alternative: MinHash¶
For environments without embedding models, MinHash provides fast approximate deduplication:
from datasketch import MinHash, MinHashLSH
def minhash_deduplicate(segments: list[str], threshold: float = 0.5) -> list[str]:
"""Deduplicate using MinHash LSH."""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
kept = []
for i, segment in enumerate(segments):
mh = MinHash(num_perm=128)
for word in segment.split():
mh.update(word.encode('utf8'))
# Check for similar existing segments
result = lsh.query(mh)
if not result:
lsh.insert(f"seg_{i}", mh)
kept.append(segment)
return kept
References¶
Strategy 5: LLMLingua / Extractive Compression¶
What It Is¶
Use a smaller language model (GPT-2 Small, LLaMA-7B) to identify and remove non-essential tokens while preserving meaning.
The LLMLingua Family¶
| Version | Focus | Compression | Speed |
|---|---|---|---|
| LLMLingua v1 | General prompts | Up to 20x | Baseline |
| LongLLMLingua | Long contexts | 4-6x (quality focus) | Baseline |
| LLMLingua-2 | Speed + accuracy | 3-6x | 3-6x faster |
Why Lower Priority for Marianne¶
| Consideration | Impact |
|---|---|
| Requires external model | Added complexity, latency |
| Best for RAG/documents | Marianne prompts are structured, not documents |
| Diminishing returns | Prompt caching already gives 90% savings |
| Dependency management | Additional Python packages |
If Implementing¶
# src/marianne/prompts/llmlingua.py (optional)
try:
from llmlingua import PromptCompressor
LLMLINGUA_AVAILABLE = True
except ImportError:
LLMLINGUA_AVAILABLE = False
class LLMLinguaCompressor:
"""Optional LLMLingua integration for aggressive compression."""
def __init__(self, model_name: str = "microsoft/llmlingua-2-xlm-roberta-large"):
if not LLMLINGUA_AVAILABLE:
raise ImportError("llmlingua not installed. Run: pip install llmlingua")
self.compressor = PromptCompressor(model_name=model_name)
def compress(
self,
prompt: str,
target_ratio: float = 0.5,
force_tokens: list[str] | None = None
) -> str:
"""Compress prompt to target ratio.
Args:
prompt: Original prompt text
target_ratio: Target compression (0.5 = 50% of original)
force_tokens: Tokens that must be preserved
"""
result = self.compressor.compress_prompt(
prompt,
rate=target_ratio,
force_tokens=force_tokens or [],
)
return result["compressed_prompt"]
References¶
- Microsoft LLMLingua - EMNLP'23, ACL'24
- LongLLMLingua Paper
- PCToolkit - Unified compression toolkit
Implementation Roadmap¶
Phase 1: Anthropic Prompt Caching (Low Effort, High Impact)¶
Effort: 1-2 days Impact: 70-90% cost reduction for API backend users
Tasks:
1. Add prompt_caching config option to BackendConfig
2. Implement _build_messages_with_caching() in AnthropicApiBackend
3. Add cache metrics to execution result (cache_hit, tokens_saved)
4. Update outcome tracking to record caching effectiveness
5. Document in README
Phase 2: Smart Completion Summarization (Medium Effort, Medium Impact)¶
Effort: 2-3 days Impact: 40-60% reduction in completion prompt tokens
Tasks:
1. Create src/marianne/prompts/compression.py module
2. Implement ContextCompressor with fidelity levels
3. Replace naive truncation in build_completion_prompt()
4. Add summarization backend option (Haiku for cost)
5. A/B test completion success rates
Phase 3: Prefix Optimization (Medium Effort, Variable Impact)¶
Effort: 3-5 days Impact: 20-40% latency reduction for multi-sheet jobs
Tasks:
1. Implement _extract_common_prefix() in JobRunner
2. Restructure prompt building to separate static/variable
3. Optimize batch API calls to share system prompts
4. Add prefix caching metrics
Phase 4: Optional Advanced Compression (High Effort, Diminishing Returns)¶
Effort: 1-2 weeks Impact: Additional 10-30% on top of previous phases
Tasks:
1. Add optional llmlingua dependency
2. Create SemanticDeduplicator protocol
3. Implement MinHash fallback for no-embedding environments
4. Add compression config section to YAML
Metrics to Track¶
When implementing compression, track these metrics:
@dataclass
class CompressionMetrics:
"""Metrics for compression effectiveness."""
original_tokens: int
compressed_tokens: int
compression_ratio: float # compressed / original
cache_hits: int = 0
cache_misses: int = 0
cache_hit_rate: float = 0.0
tokens_saved: int = 0
estimated_cost_saved_usd: float = 0.0
compression_latency_ms: float = 0.0
quality_score: float | None = None # If validated
Add to outcome recording:
# In .marianne-outcomes.json
{
"job_id": "my-job",
"sheet_num": 1,
"compression": {
"strategy": "prompt_caching",
"original_tokens": 4500,
"compressed_tokens": 500,
"cache_hit": true,
"tokens_saved": 4000
}
}
TDF Analysis Summary¶
| Domain | Key Insight |
|---|---|
| COMP | Clear integration points exist (PromptBuilder, Backend). Prompt caching requires API backend, not CLI. |
| SCI | Benchmarks show 90% savings achievable. RAG systems report 30-40% redundancy. |
| CULT | Anthropic's caching is Claude-native (no external dependencies). DSPy philosophy differs (signatures vs prompts). |
| EXP | Prompt caching "feels right" for Marianne's declarative model. LLMLingua feels like overkill. |
| META | This research itself demonstrates P5—Marianne documenting how to improve Marianne. |
References¶
Academic Papers¶
Industry Documentation¶
Practical Guides¶
Document created: 2026-01-05 Research conducted by: Claude (via Marianne parallel session) For use by: Marianne evolution cycles, human contributors