4 min read

Building a Semantic Cache for LLM Responses

LLM API calls are slow and expensive. When your AI agent asks the same kind of question twice — "what does this function do?" or "generate a test for this interface" — you're paying full price for a response you already have. I built a semantic cache for Anvil that serves repeated queries in 80 microseconds instead of 2-3 seconds.

The three-stage lookup

Not all cache hits are equal. An exact text match is free. A semantic similarity search requires embedding computation. I structured the cache as a funnel:

Stage 1 — Exact match (~80 microseconds): Normalize whitespace and compare the full prompt text. This catches repeated operations like "run tests" or identical code explanations.

Stage 2 — Prefix match (~15 milliseconds): Compare the first 60 characters, then verify with embedding similarity. This catches prompts that start the same way but differ in trailing context (like "explain this function: [same function, different surrounding code]").

Stage 3 — Semantic match (~40 milliseconds): Full 384-dimensional cosine similarity against all cached entries. This catches rephrased questions about the same topic.

pub fn lookup(&self, prompt: &str) -> Option<CachedResponse> {
    // Stage 1: exact (fastest)
    if let Some(hit) = self.exact_match(prompt) {
        return Some(hit);
    }
    
    // Stage 2: prefix + embedding verification
    if let Some(hit) = self.prefix_match(prompt) {
        return Some(hit);
    }
    
    // Stage 3: full semantic search
    self.semantic_match(prompt)
}

Embeddings without a server

I didn't want the cache to require an API call of its own. I use the all-MiniLM-L6-v2 model via ONNX — about 80MB that runs inference in-process with the ort crate. At 384 dimensions, it's small enough for fast cosine similarity scans while capturing enough semantic meaning to be useful.

The embedding step adds ~5ms per cache write. Since LLM responses take 2-10 seconds, this is negligible.

The classifier problem

Not all prompts should match at the same threshold. A code generation prompt needs high precision — returning cached code for a slightly different spec would be wrong. A knowledge question ("what is CORS?") can match more aggressively since the answer is stable.

I built a simple classifier that categorizes prompts:

fn classify(prompt: &str) -> PromptCategory {
    // Word-boundary matching to avoid "buggy" matching "bug"
    let code_signals = ["implement", "write", "generate", "create", "fix"];
    let knowledge_signals = ["what is", "explain", "how does", "why"];
    
    // Score and classify
    if code_score > knowledge_score { Code } else { Knowledge }
}

The base similarity threshold is 0.85, but it adjusts by stage and category:

  • Stage 3 semantic search for knowledge prompts: threshold drops to 0.82 (more lenient since knowledge answers are stable)
  • Stage 2 prefix matching: threshold drops to 0.80 (the prefix already provides strong signal)
  • Code prompts at Stage 3: stays at the full 0.85 (high precision needed)

Git-aware invalidation

Here's the problem with caching code-related responses: the code changes. If I cached "explain parse_config()" yesterday and then refactored the function today, the cached explanation is wrong.

The solution: include the git tree hash in the cache key.

struct CacheKey {
    prompt_hash: u64,
    git_tree_hash: String,    // HEAD tree hash
    dirty: bool,              // uncommitted changes exist
}

When the tree hash changes (new commit), code-related caches invalidate. Knowledge caches survive across commits since they're not code-dependent.

Quality tiers

I run multiple models in Anvil — Opus for complex tasks, Sonnet for standard work, Haiku for simple operations. A cached response from Opus can serve a Haiku request (higher quality is fine), but not vice versa.

struct CachedResponse {
    content: String,
    model_tier: ModelTier,  // Opus > Sonnet > Haiku
    timestamp: Instant,
}
 
fn is_compatible(cached: ModelTier, requested: ModelTier) -> bool {
    cached >= requested  // Opus serves all, Haiku serves only Haiku
}

Eviction

The cache can't grow forever. I use a hybrid eviction strategy:

  • TTL: 7 days for code responses, 30 days for knowledge
  • Size cap: LRU eviction when total cache exceeds 500MB
  • Staleness: Entries not hit in 14 days get evicted regardless of TTL

Results

On a typical development session with an AI agent:

  • Cache hit rate: ~35-40% (higher for iterative debugging sessions)
  • Average savings per hit: 2.8 seconds and ~$0.003
  • Over an 8-hour session: ~$2-4 saved, ~15 minutes of wait time eliminated

The biggest wins come during iterative debugging where the agent re-examines the same files and asks similar questions about code structure. Those Stage 1 exact matches are essentially free.

What I'd do differently

The prefix matching (Stage 2) is the weakest link. Prompts that share a 60-character prefix aren't necessarily semantically similar — it's a heuristic that works 80% of the time. If I rebuilt this, I'd skip Stage 2 and invest in faster Stage 3 (approximate nearest neighbor with an index like HNSW instead of brute-force scan).