The GPU Bubble Problem
Moondream's Photon inference engine achieves near-realtime VLM inference at ~33ms on an NVIDIA B200. The key insight: GPUs often sit idle not because they lack work, but because the CPU hasn't told them what to do next. This idle time is called a GPU bubble.
In autoregressive decoding, each token depends on the previous one, so generation is sequential. The GPU runs the model forward to produce logits, but the CPU must then sample a token, update state, and plan the next step. For one token, the GPU work is small, but the CPU housekeeping is a fixed cost. In a naive loop, the GPU waits for the CPU to finish before starting the next forward pass.
Pipelined Decoding: Overlap CPU and GPU Work
Photon hides these bubbles by overlapping CPU bookkeeping with GPU compute. The technique is called pipelined decoding: launch the next forward pass while the current step's token is still being copied back and committed.
Why can we do this? The sampled token doesn't have to leave the GPU immediately. The next forward reads it directly from GPU memory as its input. The CPU eventually needs a copy for detokenization and streaming, but that copy can happen in the background while the next forward already runs. That is the move that removes the bubble.
Mechanism 1: Ping-Pong Slots
To run a decode step, the GPU needs a working set of buffers: input, output (logits), sampled token, and KV cache bookkeeping. Photon keeps pinned (page-locked) host buffers so copies run as background DMA transfers. These buffers are allocated once and reused; fixed addresses also allow CUDA graph capture for reduced kernel launch overhead.
But overlapping two steps requires two independent working sets. Photon keeps two slots and alternates between them ping-pong style. Both slots put their forwards onto the same compute stream (they don't run in parallel), but each step's device-to-host copy goes on a separate copy stream, so it can run while the GPU is busy with the next forward. A slot is only freed once its results have been read, preventing corruption from overwriting an in-flight copy.
Mechanism 2: Forward Now, Sample Later
Constrained decoding (e.g., returning structured output like coordinates or boxes) imposes dependencies: the mask for step t+1 depends on the token sampled at t. But the dependency is in sampling, not in the forward pass.
Photon's scheduler uses a three-phase loop:
- Launch the forward for t+1 (no mask dependency).
- Commit step t: wait on the in-flight copy and advance decode state.
- Finalize sampling for t+1: with state current, build the mask and sample.
This "commit-before-finalize" ordering means the GPU runs the t+1 forward through steps 2 and 3, so the commit disappears from the critical path. For plain text (no mask), forward and sampling can both run a step ahead. For constrained sequences, sampling waits on the previous commit, but the forward still runs ahead.
Mechanism 3: Zombies — Finalize Early, Release Late
When a sequence hits its stop token at step t, it may already be baked into step t+1's forward. You can't un-launch GPU work. Photon handles this with two per-sequence fields: finalized and inflight_refs. At step t's commit, the sequence is marked finalized and its result is emitted, but it remains in the batch until inflight_refs drops to zero. The zombie is harmlessly along for the ride — it occupies its slot and writes KV that nobody reads. Only when inflight_refs hits 0 are its KV pages released.
Prefill Rides the Same Pipeline
Photon doesn't separate prefill and decode. A prefill is just another launch in the same two-slot pipeline. Because the pipeline only cares that a slot is free, a prefill forward can be launched into one slot while a decode step from the other slot is being committed. This is crucial for workloads with many short requests, where prefill dominates: the pipeline overlaps prefill's CPU bookkeeping with decode's GPU work.
A Cost Model for the Bubble
A decode step comprises three pieces: forward (GPU compute), sampling (GPU work + device-to-host copy), and bookkeeping (CPU planning, launch, commit). In a blocking loop, these run in series, so the GPU sits idle during bookkeeping. With pipelining, forward and sampling overlap with bookkeeping from the next step, eliminating the bubble. The speedup depends on the ratio of bookkeeping time to forward+sampling time. For small models or fast memory, bookkeeping becomes a larger fraction, making pipelining more valuable.
Conclusion
Photon's pipelined decoding is a practical, production-ready technique to maximize GPU utilization during autoregressive inference. By overlapping CPU bookkeeping with GPU compute using ping-pong slots, forward-now/sample-later ordering, and zombie handling, Moondream achieves up to 35% higher decode throughput. The same pipeline handles prefill and decode seamlessly, making it ideal for real-world workloads with mixed-length requests.
If you're building an inference server, consider implementing pipelined decoding. Start with two slots, move device-to-host copies to a separate stream, and decouple sampling from the forward pass. The gains are real and the complexity is manageable.



