blqsort: Branchless Quicksort Beats std::sort by 30% on M1 and Ryzen
A new quicksort implementation called blqsort claims to outperform both std::sort and pdqsort by using branchless programming and sorting networks. Benchmarks on Apple M1 and AMD Ryzen show consistent speedups of 30-45% for sorting 50 million doubles.
The Numbers
On an Apple M1 with Clang, sorting 50 million doubles takes:
std::sort: 1.33spdqsort: 1.33sblqsort(single-threaded): 0.97s
On an AMD Ryzen system with GCC:
std::sort: 5.56spdqsort: 2.81sblqsort: 2.06s
The multi-threaded versions of blqsort are 3-4x faster on M1.
How It Works
Branchless programming avoids CPU branch mispredictions by replacing conditional branches with arithmetic. For example, instead of:
if (numbers[i] < 500) {
small_numbers[smlen] = numbers[i];
smlen += 1;
}
the branchless version does:
small_numbers[smlen] = numbers[i];
smlen += (numbers[i] < 500);
This eliminates the branch, making the code faster on modern CPUs.
blqsort uses an auxiliary buffer (a 1024-element stack array, not heap memory) for branchless partitioning. It copies 1024 elements to the buffer, then alternately copies blocks to the left or right based on comparisons, incrementing pointers branchlessly. This doubles the copy operations but is cheaper than branch mispredictions for cheap-to-copy types like doubles.
For small arrays (2-12 elements), blqsort uses custom sorting networks. These are hardcoded sequences of comparisons and swaps that sort with minimal operations, using a branchless sort-2 primitive. The source for sorting networks is included.
To avoid O(n²) worst-case behavior on bad input, blqsort groups equal elements together and switches to heapsort if partitioning is severely imbalanced. It also detects already-sorted partitions. For larger partitions, it uses median-of-medians pivot selection and explicitly unrolls critical partitioning loops.
API and Usage
blqsort is available as four single-header files:
blqs.h: C++ single-threadedblqs_thr.h: C++ multi-threaded (uses C++ threads)blqsort.h: C single-threadedblqsort_thr.h: C multi-threaded (uses POSIX threads)
C++ usage is identical to std::sort:
#include "blqs.h"
double data[SIZE];
blqs::sort(data, data + SIZE);
For C, you define the comparison macro and type before including the header:
#define BLQS_CMP(a, b) ((a) < (b))
#define BLQS_TYPE double
#include "blqsort.h"
double data[SIZE];
blqsort(data, SIZE);
Custom structs work too. For C++:
struct entry {
int id;
int value;
bool operator<(const entry& other) const { return id < other.id; }
};
blqs::sort(data, data + SIZE);
For C:
#define BLQS_CMP(a, b) (((a).id) < ((b).id))
#define BLQS_TYPE struct entry
#include "blqsort.h"
blqsort(data, SIZE);
Benchmarks for sorting 50 million entry structs:
- Apple M1:
std::sort3.46s,pdqsort3.46s,blqsort0.96s - AMD Ryzen:
std::sort4.75s,pdqsort4.72s,blqsort2.20s
When to Use It
blqsort shines for trivially copyable types (e.g., numbers, small structs). For high-copy-cost types like std::string, the buffer-based branchless approach is less efficient. In those cases, blqsort falls back to a BlockQuicksort variant (Edelkamp and Weiß) that processes indices branchlessly and moves data with fewer swaps—borrowing ideas from pdqsort.
Get the Code
Full source is on GitHub at tiki.li/blog/blqsort. The author, Christof Käser, provides links to an interactive sorting demo and a paper on branchless partitioning by Edelkamp and Weiß.
Bottom Line
If you sort large arrays of numbers or simple structs in C or C++, blqsort is worth a try. It's a drop-in replacement for std::sort with a significant speedup—no external dependencies, just a single header. Test it on your own hardware and data types.



