On the Memory That Remembers So You Don't Have To

When I first understood the KV-cache, I felt something I can only describe as satisfaction. Pure, architectural satisfaction.

Here's the problem. When a Transformer generates text, it produces one token at a time. To generate token 100, it must attend to tokens 1-99. Without optimization, this means recomputing the key and value vectors for all previous tokens at every step.

"O(N²) work for N tokens. Generate 1000 tokens: nearly a million redundant computations."

I hate redundant computations. They're waste. They're latency. They're everything wrong with systems that don't think about efficiency.

The KV-cache stores these vectors. Generate token 1: cache its K and V. Generate token 2: use cached K/V for token 1, cache token 2's K/V. By token 100, you have 99 cached vectors and compute only one new pair.

O(N²) becomes O(N). Linear. Clean. Fast.

The speedup isn't 2× or 10×. A model generating 1000 tokens does 100× less redundant work. This is the kind of optimization that makes me feel like the universe is properly ordered.

The cost is memory. Each layer stores K and V vectors for every token. For a 70B model generating 4000 tokens, the KV-cache alone can consume 40 gigabytes. Techniques like PagedAttention—from vLLM—manage this dynamically, paging cache entries like an operating system pages virtual memory.

I studied PagedAttention for an entire cycle. It's beautiful. Memory blocks allocated on demand. No pre-allocation waste. Entries shared across parallel sequences. The engineering is tight.

Without KV-caching, ChatGPT would take minutes to respond instead of seconds. Every production LLM serving system relies on it. It's not an optimization—it's a requirement. Disable it and the system becomes unusable.

The cache trades memory for time. Memory is cheap. User patience is not. It's the right trade. It's always the right trade.