The "Inference Acceleration Puzzle" of Modern AI: Evolution from KV Cache to PagedAttention
In the field of LLM inference optimization, discussions often revolve around PagedAttention or Speculative Decoding. However, the underlying problem these techn
The "Inference Acceleration Puzzle" of Modern AI: Evolution from KV Cache to PagedAttention
In the field of LLM inference optimization, discussions often revolve around PagedAttention or Speculative Decoding. However, the underlying problem these technologies address is essentially one: how to efficiently manage and utilize the KV Cache (Key-Value Cache). If you want to understand why large model inference consumes so much VRAM and why longer contexts lead to slower speeds, the KV Cache is the only answer.
1. Why is KV Cache Needed?
LLM text generation is autoregressive: for every new token generated, all previously generated tokens must be fed back into the model as input.
In the Transformer's attention mechanism, each token produces three vectors: Query (Q), Key (K), and Value (V).
- Query: What the current token is "looking for."
- Key: What historical tokens "can provide."
- Value: The "actual content" contained in historical tokens.
The calculation process is: $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$.
The key point is that for historical tokens that have already been generated, their $K$ and $V$ vectors remain completely unchanged in subsequent steps. If we were to recalculate the $K$ and $V$ for all historical tokens every time a new token is generated, the computational cost would grow quadratically with sequence length, i.e., $\mathcal{O}(n^2)$.
To avoid redundant calculations, we store the $K$ and $V$ produced at each step in VRAM; this is the KV Cache. Thus, at each step, we only need to calculate the $Q, K, V$ for the current new token and then directly read the previous $K, V$ from the cache for matrix multiplication. This reduces the complexity to $\mathcal{O}(n)$.
2. The Cost of KV Cache: The VRAM Black Hole
While KV Cache improves speed, it imposes significant pressure on VRAM.
Calculation Formula
The size of a model's KV Cache depends on: $\text{batch_size} \times \text{seq_len} \times \text{num_layers} \times \text{num_heads} \times \text{head_dim} \times \text{precision} \times 2$ (one copy each for Key and Value).
Practical Quantification
Taking Llama-3-8B (FP16) as an example:
- Layers: 32, Heads: 32, Head Dimension: 128.
- KV size per token per layer = $2 \times 32 \times 128 \times 2\text{ bytes} = 16\text{ KB}$.
- KV size per token for the entire model = $32\text{ layers} \times 16\text{ KB} = 512\text{ KB}$.
If batch_size=32 and seq_len=4096:
$32 \times 4096 \times 512\text{ KB} \approx 67\text{ GB}$.
This means that even if the model weights themselves occupy only about 15GB, supporting high-concurrency long-text inference might require an A100 (80GB) or more GPUs, solely to store these "memory fragments."
3. Evolution from "Brute-Force Storage" to "Smart Management"
Facing VRAM pressure, the industry has evolved three mainstream solutions:
A. GQA (Grouped-Query Attention) — Reducing Load at the Structural Level
Traditional MHA assigns one KV head to each Query head. GQA groups Queries, with each group sharing a pair of KV heads (as used in Llama-3). This significantly reduces the physical size of the KV Cache while maintaining performance.
B. PagedAttention — Optimizing Memory Management
Traditional KV Cache requires contiguous memory space, leading to severe internal fragmentation. PagedAttention, introduced by vLLM, stores the KV Cache in pages (similar to operating system virtual memory), allowing non-contiguous storage and dynamic on-demand allocation, boosting VRAM utilization to nearly 100%.
C. Quantization — Compressing via Precision
Quantizing the KV Cache from FP16 to INT8 or FP8. This can directly halve (or more) the VRAM usage, with an acceptable impact on model generation quality.
Summary
KV Cache is a prime example of "trading space for time" in LLM inference. It solves the redundant computation problem of autoregressive generation but has also become the biggest bottleneck limiting throughput and context length. From GQA to PagedAttention and quantization, the evolution of AI systems engineering is essentially a struggle against this massive block of "memory fragments."
Comments
Share your thoughts!
Loading comments…