Cloudflare's Images Binding Hit by Hyper Race Condition

Cloudflare engineers spent six weeks debugging a race condition in the hyper HTTP library that caused truncated responses for large images in the Images binding. The bug, fixed with four lines of code, only manifested under specific concurrency conditions on production systems.

The Setup

The Images binding, built in Rust on Workers, runs across Cloudflare's edge network. It uses hyper (v0.14.x at the time, later tested on 1.7 and 1.8) to manage HTTP connections. In December 2025, the team replaced the intermediary service FL with a local worker binding using Unix sockets to reduce latency and decouple release cycles.

Shortly after rollout, customers reported that transformation requests for large images failed intermittently. Responses returned HTTP 200 with a Content-Length header promising several megabytes, but the body was truncated—e.g., 200 KB out of 3.3 MB. No errors were logged.

Debugging Journey

  1. Reproduction: Engineers built a worker mimicking the nested setup (inner binding pipeline compositing multiple images, outer URL pipeline for compression). Isolating the binding alone triggered the bug: 19 out of 25 requests failed in one batch.

  2. Timeouts ruled out: Truncation wasn't correlated with request duration.

  3. Hyper version updates: Tested 0.14, 1.7, and 1.8—bug persisted in all.

  4. Local reproduction failed: macOS and Debian VMs never triggered the bug, even under load. It only appeared on production with real concurrency and a Workers runtime client.

  5. Workers runtime cleared: No syscalls indicated unexpected closes. Other services using the same client had no issues.

  6. Distributed tracing: Confirmed truncation occurred within the inner pipeline (binding path through Images service).

  7. Intermediary instrumentation: Body sizes were already truncated leaving the Images service.

  1. Images service tracing: Service processed requests correctly, encoded images, and sent HTTP 200.

The only consistent signal: timing-dependent, production-only, large images.

Strace Reveals the Bug

Using strace, the team captured syscalls. A successful request showed multiple sendto calls followed by shutdown:

sendto(42, "HTTP/1.1 200 OK\r\nContent-Length: 14991808\r\n...", ...) = 219264
sendto(42, "\xff\xd8\xff\xe0...", 292352) = 292352
// ... more writes ...
shutdown(42, SHUT_WR) = 0

A failing request showed only one write before shutdown:

sendto(42, "HTTP/1.1 200 OK\r\nContent-Length: 14991808\r\n...", ...) = 219264
shutdown(42, SHUT_WR) = 0

Only ~219 KB out of 14.9 MB was sent. The race: hyper flushed its internal buffer to the kernel's socket buffer, checked if the buffer was empty (it was, because the kernel hadn't yet copied data to the network), and prematurely called shutdown. The remaining data never left hyper's buffer.

The Fix

The fix was four lines of code: after flushing, hyper now checks whether the flush actually completed (i.e., all bytes were written to the kernel buffer) before issuing shutdown. If not, it retries.

Why It Matters

This bug highlights the subtlety of I/O race conditions in async Rust libraries. For developers using hyper or similar HTTP libraries, it's a reminder that flush and shutdown semantics can be non-trivial, especially with kernel buffering. The issue also underscores the value of strace for debugging timing-dependent bugs that don't reproduce locally.

Key Takeaway

When dealing with large payloads and async I/O, verify that all data has actually been transmitted before closing the connection. Consider adding retry logic around flush calls.