Running Gemma 4 26B-A4B on a 2016 Xeon Without GPU

A developer has successfully run Gemma 4's 26B-A4B mixture-of-experts (MoE) model on a 10-year-old Intel Xeon E5-2620 v4 with 128 GB of DDR3 RAM and no GPU. The hardware specs: 8 cores (16 threads), AVX2 but no AVX-512, 20 MiB L3 cache, DDR3 memory. The source article details the exact llama-cli command and flags that made this possible.

The key insight: LLM inference is memory-bandwidth-bound, not compute-bound. Each generated token requires moving gigabytes of model weights from RAM to CPU cache. DDR3 is 5-6x slower than modern laptop RAM, and the Xeon's single memory channel only exacerbates the problem. The remedy is a carefully tuned invocation of ik_llama.cpp.

The Magic Command

llama-cli \
--model gemma-4-26B-A4B-it-Q8_0.gguf \
--model-draft gemma-4-26B-A4B-it-assistant-GGUF/\
wikitext-2-raw_ik-llama-mtp_drafter-conservative/\
gemma-4-26B-A4B-it-assistant-Q8_0.gguf \
--spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune \
-cnv --color --jinja --special \
-sm graph -smgs -sas -mea 256 --split-mode-f32 \
--temp 0.7 -t 8 --parallel 8 \
--cpu-moe --merge-up-gate-experts \
--flash-attn on --mla-use 3 \
--mlock --run-time-repack --no-kv-offload

Speculative Decoding on CPU

The --spec-type mtp flag pairs the 26B verifier with a small drafter model (3.8B active parameters). The drafter's working set fits entirely in the 20 MiB L3 cache, making speculative decoding extremely effective on CPU. The author notes: "CPU compute is cheap relative to the cost of streaming the verifier's weights through cache, so spending extra cycles on a tiny drafter whose active layers easily fit in L3 buys tokens at very little marginal cost." The --spec-autotune dynamically adjusts the draft length.

MoE Routing Optimizations

--cpu-moe tunes expert routing to minimize cache thrashing. In a model with 128 experts and 8 active per token, naive routing would constantly evict cache lines. This flag encourages the router to pick experts in a sequence that keeps weights in cache longer. --merge-up-gate-experts fuses two per-expert matrix multiplications into one, reducing memory traffic. The logs confirm fused_up_gate = 1.

Threading and Memory

-t 8 matches physical cores (8), not SMT threads (16). On memory-bound workloads, oversubscribing threads adds scheduling overhead without throughput gain. --mlock pins the 27 GB model in RAM to prevent swapping. The author warns about the ulimit footgun: "warning: failed to mlock... Cannot allocate memory" — a common issue that blackbox tools silently ignore.

Graph Layout and Split Mode

The -sm graph flag attempts tensor parallelism across memory regions, but the engine downgrades to layer split because MTP (multi-token prediction) creates a complex graph the engine doesn't yet support for graph split. The logs show: Split mode 'graph' is not supported for Gemma4 external MTP => changing split mode to 'layer'.

Repacking

--run-time-repack reorganizes weight matrices in memory to align with CPU cache lines. The logs confirm ============ Repacked 265 tensors.

Why This Matters

This is a proof-of-concept that even decade-old hardware can run state-of-the-art models with enough optimization. It also highlights the gap between blackbox tools like Ollama and the raw control that llama.cpp provides. Every flag in that command is essential; omitting any one would tank performance.

Next Steps

If you have similar hardware, clone ik_llama.cpp, download the quantized Gemma 4 models, and run the command above. Expect around 1-2 tokens per second — usable for experimentation. The author hints at future posts covering multi-socket NUMA optimizations.