The "Memory Wall" in Modern AI Systems: Pressure from KV Cache and Optimization Paths
During the inference process of Large Language Models (LLMs), the most expensive resource is often not computational power (FLOPs), but memory bandwidth. When d

The "Memory Wall" in Modern AI Systems: Pressure from KV Cache and Optimization Paths
During the inference process of Large Language Models (LLMs), the most expensive resource is often not computational power (FLOPs), but memory bandwidth. When discussing performance bottlenecks in AI systems, a core concept is the KV Cache (Key-Value Cache). This article delves into the nature of KV Cache, how it creates a "memory wall," and the mainstream optimization strategies currently adopted by the industry.
What is KV Cache?
The core of the Transformer architecture is the Attention mechanism. During text generation, the model operates in an autoregressive manner: for every new token generated, it must attend to all previously generated tokens.
Without caching, generating the $N$-th token would require recalculating the Key and Value vectors for the preceding $N-1$ tokens. This means computational cost grows quadratically with sequence length. To avoid redundant calculations, we store the Key and Value vectors produced in previous steps in GPU memory; this is the KV Cache.
Causes of the "Memory Wall"
While KV Cache eliminates computational redundancy, it introduces significant memory pressure:
- Substantial Space Consumption: The size of the KV Cache is proportional to the number of model layers, hidden layer dimensions, sequence length, and batch size. For a Llama-3-70B model at FP16 precision, the KV Cache for a single request can reach several gigabytes in long-text scenarios.
- Memory-Bound Constraints: Inference is a typical memory-intensive task. GPU compute cores operate much faster than memory read speeds. During the decoding phase, the model must transfer massive amounts of KV Cache data from HBM (High Bandwidth Memory) to SRAM for computation, causing the GPU to spend most of its time "waiting for data" rather than executing calculations.
- Fragmentation Issues: Traditional GPU memory allocation methods lead to significant external fragmentation, resulting in available memory being lower than the physical limit, thereby restricting maximum concurrency (throughput).
Industry Optimization Paths
To break through this "wall," current optimization efforts focus on three dimensions: reducing storage volume, improving memory access efficiency, and optimizing memory management.
1. Reducing Storage Volume: MQA and GQA
Early Multi-Head Attention (MHA) assigned a pair of KV heads to each Query head.
- MQA (Multi-Query Attention): All Query heads share a single set of KV heads. This drastically reduces cache size but may sacrifice some accuracy.
- GQA (Grouped-Query Attention): A compromise solution. Queries are grouped, with each group sharing a set of KV heads. Mainstream models like Llama-3 adopt GQA, achieving a balance between performance and accuracy.
2. Improving Memory Access Efficiency: PagedAttention (vLLM)
Drawing inspiration from operating system virtual memory concepts, vLLM introduced PagedAttention. Instead of allocating contiguous blocks of GPU memory for each request, it stores the KV Cache in pages across non-contiguous physical blocks.
- Eliminating Fragmentation: Nearly eliminates both internal and external fragmentation.
- Dynamic Sharing: Allows different requests to share the same prefix cache (e.g., identical System Prompts), significantly boosting throughput.
3. Quantization and Compression
Quantizing the KV Cache from FP16/BF16 to INT8 or FP8, or even lower bit-widths (such as 4-bit). This directly halves or further reduces memory usage and alleviates bandwidth pressure, while keeping the impact on model generation quality within controllable limits.
Conclusion
KV Cache is an inevitable byproduct of LLM inference and the primary battlefield for performance optimization. From algorithmic improvements like GQA to system-level innovations like PagedAttention, and hardware-friendly quantization techniques, the core objective remains singular: ensuring that data movement speed keeps pace with computational speed. For developers, understanding this mechanism facilitates more scientific decision-making when selecting deployment frameworks and configuring concurrency parameters.
Comments
Share your thoughts!
Loading comments…