The Inspection Paradox: Why Your Users See Slower Latency Than You

Meet Alice. She uses your web service. You tell her the mean request time is 100ms. She says her mean wait time is 1s. You're both right.

Meet Alex. He uses your service during outages. You tell him MTTR is under 1 minute. He says the mean outage he sees lasts 1 hour. Again, both right.

What's happening? You're measuring time in requests or outages. Alice and Alex measure in seconds. A long request counts as one in your metrics but weighs heavily in their experience.

Technically, this is the inspection paradox. Users don't experience your latency distribution (f(t)); they experience a t-weighted version. If your mean request time is (\mathbb{E}[X]), users experience (\mathbb{E}_a[X] = \frac{\mathbb{E}[X^2]}{\mathbb{E}[X]} = \mathbb{E}[X] + \frac{\mathrm{Var}(X)}{\mathbb{E}[X]}).

Simulation: Latency vs. User Experience

Marc Brooker provides an interactive simulation. Plug in median and p99 latency (or recovery time), and it fits a log-normal distribution. It then plots what your service sees versus what users experience.

Example: median TTR = 30 minutes, p99 TTR = 600 minutes (10 hours). Your MTTR is just over an hour. Users experience a mean recovery time of around 6 hours.

Why Tail Latency Matters

For service times, timeout-and-retry can hide tail latency—unless the running request holds locks or exclusive resources. For recovery time, no hiding is possible. The tail's heaviness dominates.

This is why Brooker dislikes trimmed measurements (e.g., trimmed means). They discard critical context about the right tail that drives user experience. Another reason relates to Little's Law and capacity usage.

Technical Note on Log-Normal

Brooker chose log-normal for numerical convenience: (\mathrm{lognormal}(\mu, \sigma^2)) becomes (\mathrm{lognormal}(\mu + \sigma^2, \sigma^2)). It's well-behaved around zero. However, he doesn't recommend log-normal for real latency or recovery time metrics—prefer non-parametric approaches.

What You Should Do

Stop reporting only mean latency or MTTR. Include tail metrics (p99, p99.9) and communicate the user-weighted experience. For recovery, track the distribution of outage durations as seen by users, not just system averages. Consider using the formula (\mathbb{E}[X] + \mathrm{Var}(X)/\mathbb{E}[X]) to estimate user-perceived mean.

Key Takeaway

Your users are Alice and Alex. They don't care about your average—they care about the time they spend waiting. The inspection paradox ensures that the long tail dominates their experience. Measure and optimize for that.