How Does AI Actually Remember? KV Cache, Vector Stores, and External Memory Explained
A practical breakdown of AI memory architecture: KV Cache, vector databases, and external memory systems. Real-world experience from running 15 agents at SFD Lab.

2 AM: My Agent Forgot What I Just Said
Last month, late at night, I was walking our main SFD Lab agent through a complex task. I had given a lot of context upfront — then mid-conversation, it acted like the early details had never existed. Not because it was dumb. The context window had filled up and those early tokens got silently truncated.
That night I started digging into a question I had avoided: where exactly does an AI store memory? How does it work? Why does it sometimes remember and sometimes forget? This post is my notes after finally understanding the full picture.
Three Types of AI Memory
Think of it as three drawers:
Drawer 1 — Parametric Memory (what it learned during training)
This is the deepest layer, baked into the model weights. GPT-4 knows Newton discovered gravity not because you told it — it saw that fact millions of times during training, encoded across 70 billion parameters. This memory is permanent but frozen. Once a model ships, this layer does not change.
Drawer 2 — Contextual Memory (the KV Cache)
This is classic short-term memory. Every token in your conversation gets encoded as a Key-Value vector pair and cached. When the model generates the next token, it runs attention across all those cached KVs to pull in relevant context. The context window size is the hard limit on how many KV pairs can fit. At 128k tokens, that is 128k positions. Hit the ceiling and something gets dropped — no exceptions.
Drawer 3 — External Memory (vector stores, document retrieval)
This is the most flexible layer and the foundation of RAG. You chunk your documents, embed them into vectors, store them in Qdrant or Pinecone, and retrieve the most relevant chunks at query time. The model reads the retrieved text like reading a note you just handed it.
How KV Cache Actually Works
Every Transformer layer computes Key and Value vectors for each token. Without caching, generating each new token would require recomputing all previous KVs — O(n squared) complexity that makes long conversations unusable in practice. The cache stores those computed KVs so they are reused on each step.
The memory cost is real. Running Qwen3.5 35B on our MS01 machine, the KV Cache alone consumed most of 96GB RAM at 128k context. We now cap at 32k — more than enough for real work, and the model stays stable.
Vector Search vs KV Cache
These get confused constantly. KV Cache is internal model infrastructure — the transformer working memory during inference. Vector stores are an external retrieval system — semantic search over a knowledge base you built. One is short-term context; the other is long-term retrieval. Completely different things, often used together in production RAG systems.
How We Use This at SFD Lab
Our 15 agents run on 32k context windows. Anything that needs to persist beyond a session goes into MEMORY.md files — simple and reliable. We also run a shared Memos instance where agents write and read structured coordination logs. Next step is Qdrant for semantic search over older conversation logs. We have not built that yet because MEMORY.md keeps working.
SFD Editor note: The biggest trap we fell into was assuming a 128k context window meant we did not need memory management. We did. Inference at 128k costs 16x more than at 8k. Now we keep windows tight and let MEMORY.md and retrieval fill in the gaps.
Comments
Share your thoughts!
加载留言中…