Matrix Multiplication Cost
Every AI model is a series of matrix multiplications. For two matrices A (N×d) and B (d×M), the product O (N×M) requires 2NMd floating-point operations and 2NMd memory accesses. With tiling, memory accesses drop to about d(N+M). This ratio of compute to memory access is the key to understanding inference cost.
LLM Forward Pass
A language model takes a sequence of N tokens (each a d-dimensional vector) and applies attention at each layer to predict the next token. Without optimization, each forward pass processes the entire input matrix X (N×d). For a 32B model with d=8192 and N=200k, a single matmul like X @ W_k requires 2BNd² = 21200k*8192² ≈ 26 trillion FLOPs and Bd(N+d) ≈ 1.7 billion memory accesses. That's 10,000x more compute than memory—idle memory bandwidth.
KV-Cache Saves the Day
Auto-regressive generation means every new token re-processes the entire history. KV-cache stores the K and V matrices from previous tokens, so the input becomes just the latest token (1×d). Now the same matmul requires only 2B1d² = 52.4 million FLOPs and Bd(1+d) ≈ 26.2 million memory accesses—a ratio of 2:1 compute to memory. This is memory-bound.
B200 Specs and Optimal Batching
NVIDIA B200 has 8 TB/s memory bandwidth and 4500 TFLOP/s compute. The compute-to-memory ratio is 4500/8 = 562.5. To saturate both, we need 2B = 562.5, so B ≈ 281 users per batch. But VRAM limits us.
Realistic User Count
A 32B model uses 32GB for weights. With 200k context, d=8192, L=64 layers, and Grouped-Query-Attention (8 KV-heads), the KV-cache per user is 2 * N * L * (d/8) = 2 * 200k * 64 * 1024 ≈ 26.2 GB. B200 has 192GB VRAM (approx), leaving 160GB for cache, so 160/26.2 ≈ 6 concurrent users at full context. But median context is 4-40k tokens, and with vLLM's PagedAttention, you can serve 40-60 users per chip. Accounting for user idle time (80% reading), one B200 can handle 300-800 users.
Tokens Per Second
Each forward pass moves weights (32GB) + KV-cache (≈158GB for 6 users) = 190GB. At 8 TB/s bandwidth, that's 23.75ms. Compute takes 2BFLOPs / 4500 TFLOPs = 2652.4M / 4.5e12 ≈ 0.14ms. Total ~24ms per token, so ~42 tokens/sec per user. For 6 users, that's 7 tokens/sec each—acceptable for chat.
Dollar Cost Per User
B200 costs ~$30k. Over 3 years, that's $27.4/month. With 300 users per chip, cost per user is $0.09/month. With 800 users, it's $0.03/month. Realistically, add overhead for networking, power, cooling, and profit margin—still under $0.50/user/month.



