The "Spatial Magic" of Modern AI Inference: How PagedAttention Ends GPU Memory Fragmentation
In production environments for Large Language Models (LLMs), inference costs are not directly determined by the number of model parameters, but by a core metric

The "Spatial Magic" of Modern AI Inference: How PagedAttention Ends GPU Memory Fragmentation
In production environments for Large Language Models (LLMs), inference costs are not directly determined by the number of model parameters, but by a core metric: Throughput. The biggest bottleneck limiting throughput is often not GPU compute power, but GPU memory utilization.
When discussing KV Cache, most people focus on how it accelerates generation. However, the true nightmare in engineering practice is memory fragmentation.
The "Dead End" of Contiguous Storage
Traditional inference frameworks tend to use contiguous memory allocation when handling KV Cache. This creates a significant contradiction:
- Pre-allocation Waste: To prevent requests from crashing due to increasing length, the system must pre-allocate a contiguous block of space based on
max_seq_len(e.g., 32k). If a user inputs only 10 tokens, the remaining 31,990 positions remain occupied and unavailable for other requests. - External Fragmentation: The interleaved lifecycles of different requests result in numerous small holes in GPU memory. These fragments cannot be combined into a large enough contiguous block to accommodate new requests.
This "static allocation" mode leads to extremely low GPU memory utilization, directly limiting the batch size and leaving expensive H100 GPUs in a "starved" state much of the time.
PagedAttention: Bringing OS Virtual Memory to the GPU
The PagedAttention mechanism proposed by the vLLM team offers an elegant solution: abandon contiguous storage and introduce paging.
Its core logic involves splitting the KV Cache into fixed-size physical blocks, where each block stores a fixed number of tokens (e.g., 16).
Key Engineering Implementations
- Logical Mapping Table (Block Table): Logically, the model still perceives tokens as contiguous. However, at the底层 (lower level), a mapping table points logical indices to non-contiguous physical blocks. This is nearly identical to virtual memory management in operating systems.
- On-Demand Dynamic Growth: A new physical block is allocated for a request only when the current physical block is full. This means GPU memory usage scales linearly with the actual number of generated tokens, rather than being tied to the maximum sequence length.
- Zero-Copy Sharing (Copy-on-Write): When handling parallel sampling or multi-turn conversations, multiple requests can share the same physical block (e.g., sharing the cache for the System Prompt). A Copy-on-Write operation is triggered only when a specific request needs to modify the content.
Practical Impact: From "Runnable" to "Efficient"
The introduction of PagedAttention elevates LLM inference from simple matrix operations to a complex resource scheduling problem:
- Throughput Leap: By eliminating internal fragmentation, GPU memory utilization can exceed 96%. On the same hardware, concurrency typically increases by 2–4 times.
- Long-Text Stability: Through flexible paging management, systems can handle ultra-long contexts more stably without easily triggering OOM (Out of Memory) errors.
- Cost Reduction: Higher throughput significantly reduces the amortized cost per generated token.
Conclusion
If Continuous Batching solves scheduling in the time dimension (keeping the GPU busy), then PagedAttention solves scheduling in the spatial dimension (preventing GPU memory waste). For AI system engineers, understanding this evolution means being able to tune parameters like gpu_memory_utilization more precisely, finding the balance between extreme performance and system stability.
Comments
Share your thoughts!
Loading comments…