Bazel Adds Content-Defined Chunking to Remote Cache

BuildBuddy's remote cache now uses Content-Defined Chunking (CDC) to make large build outputs behave more incrementally. When a binary, bundle, or archive is mostly unchanged, BuildBuddy can reuse chunks it has already seen instead of re-uploading or re-downloading the entire file.

In their Bazel chunking implementation PR, they observed 40% less data uploaded and a 40% smaller disk cache when benchmarked on BuildBuddy's own repo. To enable client-side CDC with BuildBuddy, use Bazel 8.7 or 9.1+ and pass --experimental_remote_cache_chunking.

The Problem: Transitive Actions and Large Outputs

Build caching has moved builds from O(size of repo) toward O(size of change). But "size of change" can be misleading. A small source change can ripple into many binaries, packages, and bundles, even when only a small part of each output actually changes.

Transitive actions—linking, bundling, packaging, archiving—are where this hits hardest. They combine many transitive inputs into one output. A typical compile action might compile one source file using a smaller set of direct inputs:

ctx.actions.run(
    inputs = [src] + direct_headers,
    outputs = [obj],
    executable = compiler,
    arguments = ["-c", src.path, "-o", obj.path],
)

A bundling or packaging action often looks more like this:

transitive_inputs = depset(
    direct = direct_files,
    transitive = [dep[MyInfo].files for dep in ctx.attr.deps],
)
ctx.actions.run(
    inputs = transitive_inputs,
    outputs = [bundle],
    executable = bundler,
    arguments = ["--output", bundle.path],
)

The second shape is where small source changes fan out into large output changes. The source edit might only change a small sequence of bytes in the final output, but the output digest is still new.

Without CDC, the cache treats that as a completely new blob, even when most of the binary is byte-for-byte identical to the previous version. If many final outputs depend on that changed input, they can all get new digests.

This creates two problems:

  • Uploads and downloads move the whole blob, even when only a small part changed.
  • Storage keeps another whole blob, even when most bytes are duplicates.

Case Study: Go Tests

A common example is a shared go_library, say foo, that is imported by many other libraries. Each go_test needs a test binary, produced by a GoLink action. If foo.a changes, many downstream test binaries can get new digests even when their source and compile actions did not change. Those test binaries are often large, and many of them are mostly the same bytes as before.

Content-Defined Chunking: How It Works

CDC is a repeatable process for splitting a file into chunks based on its contents rather than fixed byte offsets. The algorithm runs a rolling hash over a small window of bytes and splits when the hash matches a rare pattern. The hash behaves randomly enough that this happens only occasionally, but the process is deterministic: the same content produces the same chunk boundaries.

For example, if you want chunks around 512 KiB on average, choose a pattern that has about a 1 in 512 KiB chance of matching at each byte. If the pattern does not match, shift the window and try again.

For a toy example with a 4-byte window that splits when the hash ends in 00:

original:  aaaabbbbccccdddd
windows:       bbbb                   ccccc
cuts:      aaaa|bbbb|cccc|dddd

If we insert a few bytes inside bbbb, the nearby windows change, so that chunk changes:

updated:   aaaabbXXbbccccdddd

But once the rolling window moves past the inserted bytes and reaches cccc again, it sees the same 4-byte sequence as before. That sequence produces the same hash, so the algorithm finds the same cut point again. The later chunks can keep the same boundaries and hashes.

Real CDC uses a larger rolling window and a much rarer split pattern. One common algorithm is FastCDC.

Results: 85% Deduplication on Eligible Writes

In production, BuildBuddy's CDC deduplicated about 85% of written bytes across eligible cache writes. Over a two-week window, CDC skipped uploading ~300 TiB of duplicate chunk data on the write path, with peaks over 4 TiB per hour.

BuildBuddy currently applies chunking to blobs larger than 2 MiB. In one test, only about 4.2% of objects were above that threshold. Within that eligible subset, CDC deduplicated about 85% of written bytes. Across all cache traffic, overall savings are typically in the 20 to 40% range.

As a rule of thumb, CDC works best for outputs that are large and byte-stable across revisions. Linking and packaging tend to be good fits. Bundling is also a good fit when the output is not compressed, obfuscated, or randomized. Compressed formats like tar.gz and Docker image layers are often less chunkable because a small input change can rewrite more of the compressed byte stream.

Implementation: SplitBlob and SpliceBlob

To make CDC work end to end, the change lands in three places:

  • Remote APIs define the shared SplitBlob / SpliceBlob protocol so clients and caches can talk about chunks.
  • BuildBuddy implements the server-side cache behavior and executor-side chunked uploads and downloads.
  • Bazel implements the client-side combined cache path so the local disk cache and remote cache can share chunks.

SplitBlob is the read-side API: given the digest of a large blob, the client asks the cache if it already knows the chunk layout. If it does, the client can download only the chunks it does not already have.

SpliceBlob is the write-side API: after an action creates a large output, Bazel or the executor uploads any missing chunks and tells the cache how to reassemble the blob.

Why This Matters

This approach treats the problem as a generic output problem rather than requiring changes to individual tools like linkers or bundlers. It works with any action that produces large outputs, making builds more efficient without modifying build rules or toolchains.

Getting Started

To enable CDC with BuildBuddy, upgrade to Bazel 8.7 or 9.1+ and add --experimental_remote_cache_chunking to your .bazelrc. BuildBuddy handles the rest server-side. If you run your own remote cache, check the BuildBuddy blog for implementation details.

References