KV Cache Optimization: Why Your Local AI Gets Slower Over Time

What is KV Cache? Why Does It Consume So Much VRAM?

1:46 AM. The VRAM graph on the monitoring panel looks like a roller coaster.

Franky dropped a message in the group: "Qwen3.5 was blazing fast at startup, why is it stuttering like a PPT after half an hour?"

I stared at Grafana for ten minutes and finally found the culprit—KV Cache exploded.

In plain terms: KV Cache is the LLM's "short-term memory."

Every time you chat with the model, it needs to remember what was said before. Otherwise when you ask "How do I install that skill you mentioned?", the model is like: "Which skill? What did I say?"

Here's the technical detail: In Transformer architecture, every token generation requires computing Query, Key, Value matrices. The Key and Value already computed for previous tokens don't need recalculation—just store and reuse them. That's KV Cache.

But here's the problem: KV Cache grows linearly with conversation length.

VRAM Usage = Base Model Weights + KV Cache + Activations

KV Cache Size ≈ 2 × Layers × Heads × Head Dim × Sequence Length × Batch Size × Precision Bytes

Take Qwen3.5-35B: 48 layers, 40 attention heads, 128 head dim, FP16 (2 bytes). Run a 32K context, and KV Cache easily eats 20GB VRAM. Your 48GB card? Half gone.

Three Pits We Fell Into

Pit 1: Unlimited Context = VRAM Explosion

Early on we set no limits. Agent conversations could reach 64K. Result?

3 AM, Little Raccoon's PRD generation task suddenly failed. Log: CUDA out of memory. It stuffed the entire PRD conversation history—128K tokens. KV Cache directly burst the 80GB A100.

Solution: Hard-coded limit, max_context_length: 16384.

Pit 2: VRAM Trap in Batch Inference

Think you're serving one Agent? Wrong.

9 AM, three Agents request inference simultaneously. Each takes 20GB KV Cache, that's 60GB gone. Plus 70GB model weights (FP16), 140GB VRAM in emergency.

Pit 3: Cache Not Released After Multi-turn Conversations

The sneakiest pit: Agent completes task, but KV Cache isn't cleared.

We have a monitoring script polling Agent status every 5 minutes. After a month, accumulated 8000+ conversation rounds of cache.

Practical Optimization Solutions

Solution 1: Paged Attention

vLLM's killer feature. Slice KV Cache into fixed-size blocks (e.g., 256 tokens each), manage like OS memory pages.

Benefits? No fragmentation, dynamic allocation/release, multi-session VRAM pool sharing.

Local test: same 48GB VRAM, Paged Attention serves 8 concurrent sessions, traditional way only 3.

Solution 2: KV Cache Quantization

FP16 KV Cache takes 2 bytes/token. Quantize to INT8, cut in half. Precision loss? Nearly imperceptible. BLEU score difference under 0.5%.

Solution 3: Sliding Window Attention

Not all context needs remembering. For code generation, recent 2K tokens matter most. Earlier content? Can discard.

window_size: 4096, KV Cache size fixed, doesn't grow with conversation.

SFD Editor's Note

Today's 1:46 AM VRAM alert made us reconsider the cost of "unlimited context." Technically achievable 128K doesn't mean you should use 128K. Appropriate context length matters more than chasing numbers.

Our production strategy: Code tasks 8K, document writing 16K, data analysis 32K, daily chat 4K. VRAM is scarce—use it wisely.