Speculative KV Coding Compresses LLM Cache ~4× Losslessly

Speculative KV coding uses a smaller predictor model to losslessly compress the KV cache of a large language model by up to 4× (on top of fp8 quantization). The method leverages an arithmetic coder and a Gaussian mixture model to encode residuals between the target and predictor caches, achieving bitrates as low as 2.05 bits per element for a 32B model.

4 min readJun 7, 2026

Speculative KV Coding Compresses LLM Cache ~4× Losslessly

Lossless KV Cache Compression Hits ~4× on Top of FP8

LLM context windows keep growing. KV caching makes long contexts affordable by trading compute for memory, but as agentic workflows push contexts longer, storing and moving the cache dominates. Lossy compression like TurboQuant drops bit-width but risks quality loss you can't control upfront. Lossless compression sidesteps that entirely.

Speculative KV coding, introduced in a recent blog post, losslessly compresses the KV cache of a large target model by up to ~4× using a cheaper predictor model. The gross benefit, combined with lossy fp8 compression, reaches ~8×.

How It Works

By analogy with speculative decoding (Leviathan et al., 2022), a faster predictor model runs in parallel on both encode and decode sides. An arithmetic coder then encodes the true cache at a bitrate set by how well the predictor fits the target.

The KV cache is deterministic given the prompt and weights. The "true" distribution is a delta, which has zero entropy. Every bit the coder spends is pure KL divergence: -ln q(KV_true). The bitrate directly measures how much weight the model q gives the correct KV cache.

The method models each scalar as a Gaussian centered on the predictor's output μ with variance σ². The cost per scalar splits into two terms:

Spread cost: ½ ln(2πσ²)
Miss cost: (KV_full - μ)² / (2σ²)

The optimal σ² is the expected squared error, yielding bitrate ½ ln(2πe σ²). Better μ buys bits directly; miscalibrated σ² wastes them.

Predictor Choice

The natural predictor is an optimized version of the same model: same architecture, same prompt, with structure-preserving optimization (e.g., quantization). The residual KV_full - KV_opt is small and structured. For instance, using the FP8 version of the target as predictor, μ = KV_quant, and σ² fitted once on training data as per-(kv, head, channel) empirical residual variance, pooled across positions.

A three-component mixture further boosts compression:

q(x) = 0.95 * N(x; μ, σ²) + 0.03 * N(x; μ, (3σ)²) + 0.02 * p_bf16(x)

where p_bf16 is the empirical bf16-symbol distribution. The wide Gaussian covers moderate mispredictions; the empirical marginal absorbs outliers.

Results on Qwen3

Using the Qwen3 model family (0.6B, 1.7B, 4B, 8B, 14B, 32B) with off-the-shelf fp8 block-quants, calibrated on 128 train examples from C4 at 1024 tokens each, the held-out C4-validation results:

Target	N(μ,σ) bitrate	Mixture bitrate	Ratio vs bf16
0.6B	6.86	6.74	2.37×
1.7B	6.64	6.53	2.45×
4B	6.42	6.33	2.53×
8B	6.26	6.18	2.59×
14B	6.08	6.01	2.66×
32B	5.98	5.92	2.70×

Bitrate falls monotonically with target size — bigger models compress better.

Stacking with FP8 Caches

FP8 KV caches are increasingly default (vLLM, SGLang, TRT-LLM, DeepSeek V4). The method is actually more effective on pre-quantized caches. Encoding FP8 symbols under the bin-integrated N(μ,σ²) predictor:

Target	b/FP8 element	vs raw FP8 (8 b)
0.6B	2.59	3.08×
1.7B	2.45	3.26×
4B	2.32	3.44×
8B	2.22	3.60×
14B	2.11	3.79×
32B	2.05	3.90×

Composed with bf16 → FP8 quantization, that's 6× to 8× total compression.

Implementation Details

The pipeline: both sides re-run the predictor on the prompt to reconstruct (μ, σ). The encoder runs the target model, feeds (KV_full, μ, σ) into the arithmetic coder, emits bits. The decoder consumes the bitstream and locally-reconstructed (μ, σ) to recover KV_full exactly. The scheme is lossless because the coder is and both sides reconstruct (μ, σ) deterministically.

For the predictor, quantized variants of major open-weights models ship alongside their full-precision counterparts, so both sides can pull the same artifact off the shelf. No training step needed; per-channel statistics of the residual can be measured once on a small calibration set and frozen.

What This Means

This approach directly addresses the growing memory bottleneck in long-context LLM serving. By combining a cheap predictor with arithmetic coding, it achieves substantial compression without quality loss. The method is complementary to existing quantization techniques and works best when stacked with them.

Editor's Take

I've been watching KV cache compression for a while, and most lossy methods make me nervous about eval degradation. This lossless approach using a quantized predictor is elegant — it reuses artifacts we already have (fp8 model variants). I'm skeptical about the overhead of running the predictor on both sides, but the ~4× on top of fp8 is compelling enough to prototype. I'd love to see real-world latency benchmarks.

— DevDigest Editorial

Key Takeaways

•Stack speculative KV coding on top of fp8 quantization for up to 8× total compression.
•Use the same quantized model variant as predictor — no extra training needed, just calibrate per-channel variance on a small dataset.
•The three-component Gaussian mixture handles outliers better than a single Gaussian — implement this if you adapt the method.

Why It Matters

KV cache memory is the bottleneck for long-context LLM serving. This technique offers lossless compression up to 4× on top of fp8, which could significantly reduce memory costs and enable larger context windows without quality degradation. Developers building agentic workflows or serving long-context models should evaluate this approach.

#llm#speculative-decoding#KV cache#lossless compression#arithmetic coding

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.