14× Faster Embeddings: Manticore 27.1.5 Ships ONNX Runtime B

14× Faster Embeddings: How Manticore Rebuilt the ONNX Path

Manticore Search 27.1.5 ships a new ONNX Runtime backend for auto-embeddings that delivers ~14× the throughput of the previous SentenceTransformers/Candle path. On a 16-core/32-thread server with all-MiniLM-L12-v2, the old path managed 5–11 docs/sec across all thread and batch configurations. The new path lives in the 70–230 docs/sec range.

Why ONNX Runtime?

The old Candle path (Hugging Face's pure-Rust inference) left CPU on the floor: workloads sat in low-double-digit docs/sec, and concurrent calls serialised on a single model session. ONNX Runtime (ORT) — Microsoft's hand-tuned C++ inference engine — does graph fusion, constant folding, and kernel autotuning. Most popular embedding models (MiniLM, BGE, E5) already publish a pre-fused model.onnx in their HuggingFace directory.

The Session Sharing Hack

The key insight: ORT's C Run() API is thread-safe on Linux and macOS. The Rust wrapper hides this behind borrow-checker rules. Manticore wraps the session in an UnsafeCell and implements Sync/Send manually:

#[cfg(not(target_os = &#34;windows&#34;))]
struct SessionWrapper {
    inner: std::cell::UnsafeCell,
}
#[cfg(not(target_os = &#34;windows&#34;))]
unsafe impl Sync for SessionWrapper {}
#[cfg(not(target_os = &#34;windows&#34;))]
unsafe impl Send for SessionWrapper {}

impl SessionWrapper {
    fn with_session(&amp;self, f: impl FnOnce(&amp;mut Session) -&gt; R) -&gt; R {
        f(unsafe { &amp;mut *self.inner.get() })
    }
}

This single shared session eliminates lock contention and pool overhead. On Windows, a Mutex serialises access due to known ORT threading issues.

Batching Was a Trap

Textbook advice says batch inputs for throughput. Manticore tried batching 8, 16, 32 documents per inference call — and got lower throughput than processing one at a time. Two reasons:

Padding tax: A batch of mixed-length texts pads every row to the longest. Real inputs vary wildly: one 60-token outlier forces seven 8-token rows to pay for padding. The model does work proportional to batch_size * max_len * hidden_dim, most of it on padding.
Spinning: ORT's intra-op thread pool defaults to busy-waiting between dispatches. With one big batch, threads stay busy. With many concurrent small calls, every worker's pool pins cores at 100% CPU — stealing resources from tokenizers, HNSW builds, and the rest of searchd. Flipping with_intra_op_spinning(false) immediately raised throughput and dropped CPU usage.

The Final Design

One shared session, no pool.
One document per inference call, no batching inside the worker.
Many concurrent callers, scaled to CPU count.
No spinning between calls — yield the CPU.

The predict_pipelined function has two branches:

fn predict_pipelined(&amp;self, texts: &amp;[&amp;str]) -&gt; Result&gt;, _&gt; {
    let bs = batch_size();
    if texts.len() &lt;= bs {
        // Fast path: single tokenize + infer, no thread overhead
        return Self::tokenize_and_infer(&amp;self.session, &amp;self.tokenizer, texts, ...);
    }
    // Large input: split across workers, each running 1-doc-at-a-time
    // through the SHARED session
    let num_workers = (texts.len() / bs).min(available_cpus()).max(1);
    let docs_per_worker = texts.len().div_ceil(num_workers);
    std::thread::scope(|s| {
        for worker_texts in texts.chunks(docs_per_worker) {
            s.spawn(move || {
                for text in worker_texts {
                    Self::tokenize_and_infer(&amp;session, &amp;tokenizer,
                        std::slice::from_ref(text), ...)?;
                }
                Ok(())
            });
        }
    });
    // ...
}

Single-row INSERTs take the fast path with zero coordination overhead. Bulk REPLACE INTO takes the parallel branch.

Performance Numbers

All runs on a 16-core/32-thread server with all-MiniLM-L12-v2-onnx, 1000 documents per run.

Configuration	Old Candle (docs/sec)	New ONNX (docs/sec)
1 thread, batch=1	5	72
1 thread, batch=64	11	233
8 threads, batch=1	8	130
32 threads, batch=1	8	100

Single-insert latency: ~14 ms with one client, ~56 ms under 8-way concurrent load (vs 200+ ms for Candle).

What Changed (and What Didn't)

No user-facing API changes. Tables pointing at an ONNX-capable model pick up the new path automatically. To switch models without recreating a table: add a new column with the new model, rebuild embeddings, drop the old column.

The two biggest performance wins: with_intra_op_spinning(false) and giving up on batching documents inside the worker. The reverted commit 980b24b marks the moment the team stopped fighting the profiler.

Why This Matters for Developers

Auto-embeddings run the model on every INSERT — embedding speed is ingest speed. The old path capped throughput at 5–11 docs/sec regardless of hardware. The new path raises the floor to 70+ docs/sec and gives meaningful tuning options (batch size, thread count). For bulk indexing, peak throughput hit 233 docs/sec on a single client thread with batch=64.

Manticore Search 27.1.5 is available now. If you're using auto-embeddings, upgrade and watch your INSERT throughput jump an order of magnitude.

14× Faster Embeddings: Manticore 27.1.5 Ships ONNX Runtime Backend

14× Faster Embeddings: How Manticore Rebuilt the ONNX Path

Why ONNX Runtime?

The Session Sharing Hack

Batching Was a Trap

The Final Design

Performance Numbers

What Changed (and What Didn't)

Why This Matters for Developers

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

crustc: The Entire rustc Compiler Translated to 46M Lines of C

2025 Developer Surveys: AI Amplifies Giants, Not Solos

Deming's 94% Rule: Why Your Dev Team Feels Slow (and How to Fix It)

Build an Offline AI Platform with K3s, vLLM, and Argo CD

Chrome 151 Ships <usermedia>: Declarative Camera/Mic Access

Underhanded C Contest 2015: NaN Poisoning and Nuclear Verification