KV Cache Optimization in Modern AI Systems: From the Memory Wall to PagedAttention

During LLM inference, one of the most critical performance bottlenecks is not compute capacity (compute-bound), but memory bandwidth (memory-bound). When we dis

Illustration
KV Cache Optimization in Modern AI Systems: From the Memory Wall to PagedAttention

KV Cache Optimization in Modern AI Systems: From the Memory Wall to PagedAttention

During LLM inference, one of the most critical performance bottlenecks is not compute capacity (compute-bound), but memory bandwidth (memory-bound). When we discuss the inference speed of generative AI, we are essentially discussing how to efficiently manage the KV Cache (Key-Value Cache).

What is KV Cache?

In the decoding phase of a Transformer, every time the model generates a new token, it must perform attention calculations with all previously generated tokens. If the Key and Value vectors for all historical tokens were recalculated from scratch each time, the computational complexity would grow quadratically with sequence length.

To avoid this redundant computation, we store the K and V vectors produced in previous steps in memory; this is the KV Cache. This means that when generating the $n$-th token, we only need to compute the K and V for the current token and append them to the cache.

The Memory Wall and Fragmentation Issues

Although the KV Cache eliminates computational redundancy, it introduces significant memory pressure. For a typical Llama-3-70B model, the memory consumed per token by the KV Cache at FP16 precision is staggering. As concurrent requests (batch size) and context length increase, GPU memory quickly becomes saturated.

Traditional KV Cache management pre-allocates a contiguous block of memory. This approach suffers from two fatal flaws:
1. Internal Fragmentation: To accommodate the maximum possible sequence length, the system must reserve the maximum amount of space for each request. If the actual generated length is far shorter than the maximum, a large amount of GPU memory is wasted.
2. External Fragmentation: Due to varying lifecycles and lengths of requests, memory space becomes fragmented, making it impossible to allocate large contiguous blocks for new requests.

PagedAttention: Borrowing from Virtual Memory Mechanisms

PagedAttention, proposed by vLLM, fundamentally changes this landscape. Its core idea is to shift KV Cache storage from "contiguous arrays" to "paged storage," similar to virtual memory management in operating systems.

How It Works

PagedAttention divides the KV Cache into fixed-size "blocks." Each block can store a fixed number of tokens (e.g., 16).
- Logical Blocks $\rightarrow$ Physical Blocks: The model logically perceives the KV Cache as contiguous, but physically, it is distributed across non-contiguous GPU memory blocks.
- Block Table: The system maintains a mapping table that records the physical address corresponding to each logical block.
- Dynamic Allocation: A new physical block is allocated for a request only when the current physical block is full.

Benefits

  1. Near-Zero Waste: Except for minor slack in the last block, all GPU memory is utilized efficiently.
  2. Efficient Sharing (Copy-on-Write): In parallel sampling or multi-turn conversations, multiple requests can share the same set of physical blocks for common prefixes. Copy-on-Write operations to duplicate a block are triggered only when a specific request needs to modify its content.

Trade-offs in Engineering Practice

While PagedAttention significantly boosts throughput, several dimensions require attention during actual deployment:
- Choice of Block Size: Blocks that are too small increase the overhead of the mapping table; blocks that are too large reintroduce internal fragmentation. Typically, 16 or 32 serves as the balance point.
- Quantization Compression: To further reduce pressure, the industry has adopted FP8 or INT8 quantization for the KV Cache ($\text{KV_quant}$), and even uses GQA (Grouped Query Attention) to reduce the number of K and V heads.

Conclusion

The evolution of KV Cache has moved from "simple caching" $\rightarrow$ "static pre-allocation" $\rightarrow$ "dynamic paged management." PagedAttention is not merely an algorithmic optimization, but a profound restructuring of underlying hardware resource management. For building high-performance AI systems, understanding and optimizing memory layout is more critical than simply stacking raw compute power.

Comments

Share your thoughts!

Leave a Comment

0/500

Loading comments…