Lossless KV Cache Compression Hits ~4× on Top of FP8

LLM context windows keep growing. KV caching makes long contexts affordable by trading compute for memory, but as agentic workflows push contexts longer, storing and moving the cache dominates. Lossy compression like TurboQuant drops bit-width but risks quality loss you can't control upfront. Lossless compression sidesteps that entirely.

Speculative KV coding, introduced in a recent blog post, losslessly compresses the KV cache of a large target model by up to ~4× using a cheaper predictor model. The gross benefit, combined with lossy fp8 compression, reaches ~8×.

How It Works

By analogy with speculative decoding (Leviathan et al., 2022), a faster predictor model runs in parallel on both encode and decode sides. An arithmetic coder then encodes the true cache at a bitrate set by how well the predictor fits the target.

The KV cache is deterministic given the prompt and weights. The "true" distribution is a delta, which has zero entropy. Every bit the coder spends is pure KL divergence: -ln q(KV_true). The bitrate directly measures how much weight the model q gives the correct KV cache.

The method models each scalar as a Gaussian centered on the predictor's output μ with variance σ². The cost per scalar splits into two terms:

  • Spread cost: ½ ln(2πσ²)
  • Miss cost: (KV_full - μ)² / (2σ²)

The optimal σ² is the expected squared error, yielding bitrate ½ ln(2πe σ²). Better μ buys bits directly; miscalibrated σ² wastes them.

Predictor Choice

The natural predictor is an optimized version of the same model: same architecture, same prompt, with structure-preserving optimization (e.g., quantization). The residual KV_full - KV_opt is small and structured. For instance, using the FP8 version of the target as predictor, μ = KV_quant, and σ² fitted once on training data as per-(kv, head, channel) empirical residual variance, pooled across positions.

A three-component mixture further boosts compression:

q(x) = 0.95 * N(x; μ, σ²) + 0.03 * N(x; μ, (3σ)²) + 0.02 * p_bf16(x)

where p_bf16 is the empirical bf16-symbol distribution. The wide Gaussian covers moderate mispredictions; the empirical marginal absorbs outliers.

Results on Qwen3

Using the Qwen3 model family (0.6B, 1.7B, 4B, 8B, 14B, 32B) with off-the-shelf fp8 block-quants, calibrated on 128 train examples from C4 at 1024 tokens each, the held-out C4-validation results:

TargetN(μ,σ) bitrateMixture bitrateRatio vs bf16
0.6B6.866.742.37×
1.7B6.646.532.45×
4B6.426.332.53×
8B6.266.182.59×
14B6.086.012.66×
32B5.985.922.70×

Bitrate falls monotonically with target size — bigger models compress better.

Stacking with FP8 Caches

FP8 KV caches are increasingly default (vLLM, SGLang, TRT-LLM, DeepSeek V4). The method is actually more effective on pre-quantized caches. Encoding FP8 symbols under the bin-integrated N(μ,σ²) predictor:

Targetb/FP8 elementvs raw FP8 (8 b)
0.6B2.593.08×
1.7B2.453.26×
4B2.323.44×
8B2.223.60×
14B2.113.79×
32B2.053.90×

Composed with bf16 → FP8 quantization, that's 6× to 8× total compression.

Implementation Details

The pipeline: both sides re-run the predictor on the prompt to reconstruct (μ, σ). The encoder runs the target model, feeds (KV_full, μ, σ) into the arithmetic coder, emits bits. The decoder consumes the bitstream and locally-reconstructed (μ, σ) to recover KV_full exactly. The scheme is lossless because the coder is and both sides reconstruct (μ, σ) deterministically.

For the predictor, quantized variants of major open-weights models ship alongside their full-precision counterparts, so both sides can pull the same artifact off the shelf. No training step needed; per-channel statistics of the residual can be measured once on a small calibration set and frozen.

What This Means

This approach directly addresses the growing memory bottleneck in long-context LLM serving. By combining a cheap predictor with arithmetic coding, it achieves substantial compression without quality loss. The method is complementary to existing quantization techniques and works best when stacked with them.