Stop Drowning in Metrics: Focus on RED for Real System Healt

The RED Method: Request Rate, Errors, and Duration as Your Core SLIs

Monitoring distributed systems is a mess. You've got dashboards with a hundred charts, alerts firing for every blip, and everyone arguing over what matters. The RED method cuts through that noise. It says: track three things. That's it.

Rate. Errors. Duration.

Tom Wilkie (now at Grafana Labs) popularized this approach after his time at Google and Weaveworks. It's inspired by Google's SRE practices but simplified for the rest of us. Think of it as the three vital signs for any service.

Why Three Metrics?

Because everything else is vanity. CPU usage? It matters, but it's a resource metric, not a service health metric. Disk I/O? Same deal. The RED method focuses on what your users actually experience: are requests succeeding, how many are coming in, and how fast are they being handled?

Rate tells you traffic volume. Spikes might mean a DDoS attack or a successful product launch. Drops could indicate a routing failure or a dead service.

Errors tell you something's wrong. HTTP 500s, timeouts, failed database queries. If errors go up, you have a problem.

Duration tells you about performance. High latency frustrates users. Outliers (like 99th percentile) reveal issues that average numbers hide.

How to Implement RED

Start by instrumenting every service with these three metrics. Use a tool like Prometheus to collect them. Then set up dashboards that show RED for each service. Don't add more metrics unless you have a specific reason.

Here's the cynical developer take: you'll be tempted to add "just one more" metric. Don't. I've seen teams add database connection pool size and cache hit ratios to their RED dashboard. Before long, you're back to the same mess. Stick to the three. If you need more, create a separate dashboard for deep dives.

RED in Practice

Let's say you have a microservice that handles user authentication. Your RED dashboard shows:

Rate: 1000 requests/second
Errors: 0.1% (that's 1 in 1000 failing)
Duration: p50=50ms, p95=200ms, p99=500ms

That's healthy. Now imagine errors jump to 5%. You immediately know something's broken. You don't need to check CPU or memory first. You go straight to the service logs.

Or duration p99 shoots up to 2 seconds. Users are waiting. Time to look at dependencies like the database or external API.

RED vs. USE

You might have heard of the USE method (Utilization, Saturation, Errors) for infrastructure. RED is for services. USE is for resources. They work together: USE tells you if your server is overwhelmed, RED tells you if your service is healthy. Don't confuse them.

Common Pitfalls

Over-aggregation: Don't average across all services. Each service gets its own RED.
Ignoring rate changes: A sudden drop in rate might be worse than an error spike. It could mean traffic is being blocked or redirected.
Not setting thresholds: Without baselines, you won't know what 'normal' is. Start with historical data and adjust.

Is RED Enough?

For most services, yes. But not for everything. If you're running a batch processing system, rate might not apply. Use the method that fits. The key is to be intentional about what you measure.

Final Thoughts

The RED method is a tool, not a religion. It simplifies monitoring by forcing you to focus on what matters. If you're drowning in metrics, start here. You'll thank yourself when an alert goes off and you instantly know what's happening.

Now go instrument your services. And for the love of all that is holy, don't add a fourth metric.

Stop Drowning in Metrics: Focus on RED for Real System Health

The RED Method: Request Rate, Errors, and Duration as Your Core SLIs

Why Three Metrics?

How to Implement RED

RED in Practice

RED vs. USE

Common Pitfalls

Is RED Enough?

Final Thoughts

What does the 'R' in RED stand for?

Get the weekly digest

Your DevOps automation is invisible to AI. That's AI-Debt.

College Instructor Bans Laptops, Requires Typewriters to Block AI Cheating