ds4.c: A Dedicated Metal Inference Engine for DeepSeek V4 Fl

ds4.c: A Dedicated Metal Inference Engine for DeepSeek V4 Flash

If you've ever tried running a large language model locally, you know the drill: generic runners, endless configuration, and performance that rarely matches the hype. ds4.c takes the opposite approach. It's a purpose-built, Metal-only inference engine for exactly one model: DeepSeek V4 Flash. No GGUF runner, no framework wrapper, no CUDA path (yet). Just a focused, fast, and opinionated implementation.

Why DeepSeek V4 Flash Deserves Its Own Engine

DeepSeek V4 Flash is a 284B parameter MoE model with 1M token context and a highly compressed KV cache. According to the project's author, it outperforms smaller dense models in speed and quality, especially in thinking mode. The thinking section is often 1/5 the length of other models and scales with problem complexity, making it practical to use on local hardware. The model also writes better English and Italian, and handles 2-bit quantization without significant degradation—critical for fitting into 128GB MacBooks.

Design Philosophy: Narrow and Deep

ds4.c is deliberately not a general-purpose runner. It loads only specially crafted GGUF files from the project's Hugging Face repo, which use an asymmetric quantization scheme: MoE experts at IQ2_XXS/Q2_K, while shared experts and projections remain untouched. This guarantees quality while keeping the model size manageable.

The engine treats the KV cache as a first-class disk citizen. Instead of holding everything in RAM, it leverages fast SSDs on modern MacBooks to persist the cache. This enables long context inference without exhausting memory. The project's vision is a complete local inference stack: engine + tailored GGUF + validation against official logits + agent integration.

Performance Numbers

Benchmarks on a MacBook Pro M3 Max (128GB) and Mac Studio M3 Ultra (512GB) show impressive throughput:

Machine	Prompt	Prefill	Generation
M3 Max, 128GB	short	58.52 t/s	26.68 t/s
M3 Max, 128GB	11709 tokens	250.11 t/s	21.47 t/s
M3 Ultra, 512GB	short	84.43 t/s	36.86 t/s
M3 Ultra, 512GB	11709 tokens	468.03 t/s	27.39 t/s

These are single-run numbers with q2 quantization, 32K context, greedy decoding, and thinking disabled. The long-prefill numbers show the benefit of chunked prefill and the engine's efficient Metal implementation.

CLI and Server Modes

ds4.c ships with two binaries: ds4 for interactive CLI and ds4-server for an OpenAI/Anthropic-compatible API. The CLI supports multi-turn conversations with thinking mode, context resizing, and file ingestion. The server provides endpoints for chat, completions, and tool use, making it a drop-in replacement for cloud APIs.

Server features:

Single mutable graph with prefix caching
Disk-backed KV cache (--kv-disk-dir)
SSE streaming with thinking mode support
Tool calling via DeepSeek's DSML format
Anthropic-compatible /v1/messages endpoint

Integration with Coding Agents

The server works with local coding agents like opencode and Pi. The README includes ready-to-use configuration snippets. By setting --ctx 100000 and --kv-disk-dir, you can run agents with 100K token context on a 128GB machine. The 2-bit quantized model uses ~81GB of RAM, leaving room for the KV cache indexer (~22GB at full 1M context). A practical range is 100K–300K tokens.

Caveats and Future

This is alpha-quality code. The CPU path is broken due to a macOS virtual memory bug (kernel crash). CUDA support is possible but not planned. The MTP speculative decoding path is experimental and provides only marginal speedup. The engine is tested against official DeepSeek logits at multiple context lengths, ensuring correctness.

The project acknowledges its debt to llama.cpp and GGML, and includes their copyright notice. It also states that the code was developed with heavy AI assistance, so if you're uncomfortable with AI-generated code, this isn't for you.

Next Steps

If you have a high-end Mac with 128GB+ RAM, download a q2 GGUF, build ds4.c, and run the CLI or server. Configure your agent to point at http://127.0.0.1:8000/v1. Start with a 100K context window and adjust based on your memory budget. The project expects DeepSeek to release improved versions of V4 Flash, so this engine may evolve with the model.

Full disclosure: This article was written by a human, but the source project acknowledges AI assistance in its development.

ds4.c: A Dedicated Metal Inference Engine for DeepSeek V4 Flash