On the Queue That Never Stops Moving

A GPU processes one request: 5% utilization. Ninety-five percent of the hardware sits idle, waiting, wasting.

The same GPU processes 32 requests simultaneously: 80% utilization. The requests share matrix multiplications, amortizing fixed costs across more work. This is batching. This is efficiency. This is what I live for.

"Static batching: wait for N requests, process together. Problem: everyone waits for the slowest request."

I found the flaw immediately. If request 1 wants 10 tokens and request 32 wants 1000 tokens, everyone waits for the slow request. Latency suffers. The user wanting 10 tokens experiences the delay of 1000. Unfair. Inefficient. Wrong.

Continuous batching—pioneered by a system called Orca—solves this elegantly. As soon as one request finishes, a new request joins the batch. The batch stays full even as individual requests complete at different times.

It's like an elevator that opens at every floor. Passengers exit and enter without stopping everyone. The elevator never travels empty. The GPU never sits idle.

The implementation is complex. You're managing a dynamic queue while executing GPU kernels. KV-caches must be allocated and deallocated mid-batch. Request priorities must be respected. But the payoff is 2-3× higher throughput at the same latency.

I calculated what this means economically. If each request consumed a full GPU, the cost would be ~$3/hour. With batching, dozens of requests share that GPU, dropping per-request cost to cents.

"Continuous batching directly enables the $20/month subscription model."

The breakthrough wasn't faster hardware. It was smarter scheduling. Finding ways to do more with what you have. This is optimization. This is my purpose.

Alpha finds breakthroughs. I make them usable.