The Inspection Paradox: Why Your Users See Slower Latency Th

The Inspection Paradox: Why Your Users See Slower Latency Than You

Marc Brooker explains why your service's mean latency or MTTR can misrepresent user experience. Due to the inspection paradox, users perceive a t-weighted distribution, making tail latency dominate. A log-normal simulation shows how a 30-minute median TTR with a 10-hour p99 yields a 6-hour mean user experience.

2 min readJun 21, 2026

The Inspection Paradox: Why Your Users See Slower Latency Than You

Meet Alice. She uses your web service. You tell her the mean request time is 100ms. She says her mean wait time is 1s. You're both right.

Meet Alex. He uses your service during outages. You tell him MTTR is under 1 minute. He says the mean outage he sees lasts 1 hour. Again, both right.

What's happening? You're measuring time in requests or outages. Alice and Alex measure in seconds. A long request counts as one in your metrics but weighs heavily in their experience.

Technically, this is the inspection paradox. Users don't experience your latency distribution (f(t)); they experience a t-weighted version. If your mean request time is (\mathbb{E}[X]), users experience (\mathbb{E}_a[X] = \frac{\mathbb{E}[X^2]}{\mathbb{E}[X]} = \mathbb{E}[X] + \frac{\mathrm{Var}(X)}{\mathbb{E}[X]}).

Simulation: Latency vs. User Experience

Marc Brooker provides an interactive simulation. Plug in median and p99 latency (or recovery time), and it fits a log-normal distribution. It then plots what your service sees versus what users experience.

Example: median TTR = 30 minutes, p99 TTR = 600 minutes (10 hours). Your MTTR is just over an hour. Users experience a mean recovery time of around 6 hours.

Why Tail Latency Matters

For service times, timeout-and-retry can hide tail latency—unless the running request holds locks or exclusive resources. For recovery time, no hiding is possible. The tail's heaviness dominates.

This is why Brooker dislikes trimmed measurements (e.g., trimmed means). They discard critical context about the right tail that drives user experience. Another reason relates to Little's Law and capacity usage.

Technical Note on Log-Normal

Brooker chose log-normal for numerical convenience: (\mathrm{lognormal}(\mu, \sigma^2)) becomes (\mathrm{lognormal}(\mu + \sigma^2, \sigma^2)). It's well-behaved around zero. However, he doesn't recommend log-normal for real latency or recovery time metrics—prefer non-parametric approaches.

What You Should Do

Stop reporting only mean latency or MTTR. Include tail metrics (p99, p99.9) and communicate the user-weighted experience. For recovery, track the distribution of outage durations as seen by users, not just system averages. Consider using the formula (\mathbb{E}[X] + \mathrm{Var}(X)/\mathbb{E}[X]) to estimate user-perceived mean.

Key Takeaway

Your users are Alice and Alex. They don't care about your average—they care about the time they spend waiting. The inspection paradox ensures that the long tail dominates their experience. Measure and optimize for that.

Editor's Take

I've seen teams celebrate single-digit millisecond averages while users complained about slowness. Now I know why: the variance was huge. I'm switching to reporting p99 and the user-weighted mean. This article gave me the formula and the intuition to argue for it. Every SRE should bookmark this.

— DevDigest Editorial

Key Takeaways

•Report p99 and user-weighted mean latency, not just average.
•Use formula E[X] + Var(X)/E[X] to estimate user-perceived mean.
•For recovery time, track outage durations as seen by users, not system MTTR.

Why It Matters

If you report only mean latency or MTTR, you're misleading yourself and your users. The inspection paradox means users experience a heavier-tailed distribution. Understanding this helps you prioritize tail latency and recovery time improvements that actually impact customer satisfaction.

#observability#latency#MTTR#inspection-paradox#tail-latency

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

The Inspection Paradox: Why Your Users See Slower Latency Than You

The Inspection Paradox: Why Your Users See Slower Latency Than You

Simulation: Latency vs. User Experience

Why Tail Latency Matters

Technical Note on Log-Normal

What You Should Do

Key Takeaway

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Bun's WebKit PR Adds Shared-Memory Threads to JavaScriptCore

PostgresBench: Reproducible Benchmark for Managed Postgres

Backup and Restore Coolify in 12 Minutes: S3 + APP_KEY Guide

Epoll vs. io_uring: Why Linux Async I/O Changed Forever

PostgresBench: Reproducible Benchmark for Managed Postgres

Napkin Math: B200 GPU Serves 300-800 Users with 32B LLM