blqsort: Branchless Quicksort Beats std::sort by 30% on M1 a

blqsort: Branchless Quicksort Beats std::sort by 30% on M1 and Ryzen

blqsort is a new branchless quicksort implementation that outperforms std::sort and pdqsort on both Apple M1 and AMD Ryzen. It uses sorting networks for small arrays and a buffer-based branchless partitioning technique. Single-header C and C++ libraries are available on GitHub.

3 min readJun 5, 2026

blqsort: Branchless Quicksort Beats std::sort by 30% on M1 and Ryzen

A new quicksort implementation called blqsort claims to outperform both std::sort and pdqsort by using branchless programming and sorting networks. Benchmarks on Apple M1 and AMD Ryzen show consistent speedups of 30-45% for sorting 50 million doubles.

The Numbers

On an Apple M1 with Clang, sorting 50 million doubles takes:

std::sort: 1.33s
pdqsort: 1.33s
blqsort (single-threaded): 0.97s

On an AMD Ryzen system with GCC:

std::sort: 5.56s
pdqsort: 2.81s
blqsort: 2.06s

The multi-threaded versions of blqsort are 3-4x faster on M1.

How It Works

Branchless programming avoids CPU branch mispredictions by replacing conditional branches with arithmetic. For example, instead of:

if (numbers[i] &lt; 500) {
    small_numbers[smlen] = numbers[i];
    smlen += 1;
}

the branchless version does:

small_numbers[smlen] = numbers[i];
smlen += (numbers[i] &lt; 500);

This eliminates the branch, making the code faster on modern CPUs.

blqsort uses an auxiliary buffer (a 1024-element stack array, not heap memory) for branchless partitioning. It copies 1024 elements to the buffer, then alternately copies blocks to the left or right based on comparisons, incrementing pointers branchlessly. This doubles the copy operations but is cheaper than branch mispredictions for cheap-to-copy types like doubles.

For small arrays (2-12 elements), blqsort uses custom sorting networks. These are hardcoded sequences of comparisons and swaps that sort with minimal operations, using a branchless sort-2 primitive. The source for sorting networks is included.

To avoid O(n²) worst-case behavior on bad input, blqsort groups equal elements together and switches to heapsort if partitioning is severely imbalanced. It also detects already-sorted partitions. For larger partitions, it uses median-of-medians pivot selection and explicitly unrolls critical partitioning loops.

API and Usage

blqsort is available as four single-header files:

blqs.h: C++ single-threaded
blqs_thr.h: C++ multi-threaded (uses C++ threads)
blqsort.h: C single-threaded
blqsort_thr.h: C multi-threaded (uses POSIX threads)

C++ usage is identical to std::sort:

#include &#34;blqs.h&#34;
double data[SIZE];
blqs::sort(data, data + SIZE);

For C, you define the comparison macro and type before including the header:

#define BLQS_CMP(a, b) ((a) &lt; (b))
#define BLQS_TYPE double
#include &#34;blqsort.h&#34;
double data[SIZE];
blqsort(data, SIZE);

Custom structs work too. For C++:

struct entry {
    int id;
    int value;
    bool operator&lt;(const entry&amp; other) const { return id &lt; other.id; }
};
blqs::sort(data, data + SIZE);

For C:

#define BLQS_CMP(a, b) (((a).id) &lt; ((b).id))
#define BLQS_TYPE struct entry
#include &#34;blqsort.h&#34;
blqsort(data, SIZE);

Benchmarks for sorting 50 million entry structs:

Apple M1: std::sort 3.46s, pdqsort 3.46s, blqsort 0.96s
AMD Ryzen: std::sort 4.75s, pdqsort 4.72s, blqsort 2.20s

When to Use It

blqsort shines for trivially copyable types (e.g., numbers, small structs). For high-copy-cost types like std::string, the buffer-based branchless approach is less efficient. In those cases, blqsort falls back to a BlockQuicksort variant (Edelkamp and Weiß) that processes indices branchlessly and moves data with fewer swaps—borrowing ideas from pdqsort.

Get the Code

Full source is on GitHub at tiki.li/blog/blqsort. The author, Christof Käser, provides links to an interactive sorting demo and a paper on branchless partitioning by Edelkamp and Weiß.

Bottom Line

If you sort large arrays of numbers or simple structs in C or C++, blqsort is worth a try. It's a drop-in replacement for std::sort with a significant speedup—no external dependencies, just a single header. Test it on your own hardware and data types.

Editor's Take

I've spent years tuning sort routines in C++ and have seen many 'faster than std::sort' claims fall flat on real-world data. blqsort's benchmarks are compelling, especially the struct sorting results—that's where most of my work lives. I'm skeptical of the branchless approach for non-trivial types, but the fallback to BlockQuicksort gives me confidence. I'll be testing this on our production workload next week.

— DevDigest Editorial

Key Takeaways

•Replace std::sort with blqs::sort for trivially copyable types to get a 30% speedup on M1 and Ryzen.
•Use the C single-header version (blqsort.h) for C projects—define BLQS_TYPE and BLQS_CMP, then call blqsort().
•For multi-threaded sorting, include blqs_thr.h (C++) or blqsort_thr.h (C) to leverage parallelism without extra effort.

Why It Matters

Sorting is a fundamental operation in countless applications, from databases to game engines. blqsort offers a drop-in replacement for std::sort that can cut sorting time by 30% or more on modern hardware, with no external dependencies. For performance-critical code, this is a free speedup.

#open-source#performance#C#sorting#branchless

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

blqsort: Branchless Quicksort Beats std::sort by 30% on M1 and Ryzen

blqsort: Branchless Quicksort Beats std::sort by 30% on M1 and Ryzen

The Numbers

How It Works

API and Usage

When to Use It

Get the Code

Bottom Line

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Using unsafe to eliminate Go bound checks for 2x speedup

OpenTelemetry & SigNoz: Instrumenting a Gemini-Powered GitHub Analyzer

DIY V-I Plots: Capturing Real Diode and MOSFET Curves at Home

Kiro CLI Context Rot: Why Sessions Degrade and How to Fix It

Xiaomi-Robotics-1: Scaling Robot Policies with 100K Hours of Data

Node's spawnSync ENOENT Error: CWD Missing, Not Git