The Problem: One API is a Single Point of Failure

Two months ago, a 503 error from an AI API provider killed user sessions mid-conversation. The developer’s app relied solely on OpenAI’s GPT-4 for real-time responses. When the outage hit, requests timed out, then failed. The manual fix—updating code and redeploying—took an hour. That’s an hour of downtime.

The Naive First Attempt

The first fix was a simple try-except fallback: try OpenAI, if it fails, try Anthropic. Here’s the code:

import openai
import anthropic

def generate_response(prompt):
    try:
        return openai.ChatCompletion.create(model="gpt-4", messages=[{"role": "user", "content": prompt}])
    except:
        try:
            return anthropic.complete(prompt=prompt, model="claude-v1")
        except:
            raise Exception("Both providers failed")

This approach had four flaws:

  • No retries for transient errors.
  • Fixed fallback order—if OpenAI is down, Anthropic takes all load, but what if it also fails?
  • No timeouts: a slow provider could hang the entire system.
  • No insight into failure rates.

The Solution: Weighted Multi-Provider Router

The developer built a Python library with three mechanisms:

  1. Weighted round-robin selection – Assign weights to providers (e.g., 3 for GPT-4, 1 for Claude, 1 for a free model). Requests are distributed proportionally. If a provider fails repeatedly, its weight is temporarily reduced.
  2. Exponential backoff with jitter – Retry failed requests with increasing delays, randomized to avoid thundering herd.
  3. Circuit breaker – If a provider fails 3 times in 60 seconds, stop sending requests to it for a cooldown period.

Here’s the core implementation:

import asyncio
import random
import time
from typing import Dict, List, Callable, Awaitable

class AIProvider:
    def __init__(self, name: str, weight: int, callable: Callable[[str], Awaitable[str]]):
        self.name = name
        self.weight = weight
        self.callable = callable
        self.failures = 0
        self.last_failure_time = 0
        self.circuit_open = False

class MultiProviderRouter:
    def __init__(self, providers: List[AIProvider], circuit_breaker_threshold: int = 3, circuit_breaker_timeout: int = 60):
        self.providers = providers
        self.circuit_breaker_threshold = circuit_breaker_threshold
        self.circuit_breaker_timeout = circuit_breaker_timeout

    def _select_provider(self):
        available = [p for p in self.providers if not p.circuit_open]
        if not available:
            raise RuntimeError("All providers are in circuit breaker mode")
        total_weight = sum(p.weight for p in available)
        r = random.uniform(0, total_weight)
        cumulative = 0
        for p in available:
            cumulative += p.weight
            if r <= cumulative:
                return p
        return available[-1]
async def call(self, prompt: str, max_retries: int = 3):
    for attempt in range(max_retries):
        provider = self._select_provider()
        try:
            result = await provider.callable(prompt)
            provider.failures = 0
            return result
        except Exception as e:
            provider.failures += 1
            provider.last_failure_time = time.time()
            if provider.failures >= self.circuit_breaker_threshold:
                provider.circuit_open = True
                asyncio.create_task(self._reset_circuit(provider))
            delay = (2 ** attempt) + random.random()
            await asyncio.sleep(delay)
    raise RuntimeError("All retries exhausted")

async def _reset_circuit(self, provider):
    await asyncio.sleep(self.circuit_breaker_timeout)
    provider.circuit_open = False
    provider.failures = 0

To use it, wrap API calls as async functions and instantiate the router:

```python
async def call_openai(prompt: str) -> str:
    # your real implementation
    ...

async def call_anthropic(prompt: str) -> str:
    ...

router = MultiProviderRouter([
    AIProvider("openai", weight=3, callable=call_openai),
    AIProvider("anthropic", weight=2, callable=call_anthropic),
    # AIProvider("local", weight=1, callable=call_local_small_model),
])

result = await router.call("Explain quantum entanglement like I'm 5")

The developer also added Prometheus metrics (counters and histograms) to track success/failure rates and adjust weights based on real data.

Lessons Learned

  • Quality vs. cost: Weighting GPT-4 higher kept quality high, but during slowdowns, the router used cheaper models, saving money. Trade-off: occasional lower-quality responses during outages.
  • Circuit breaker tuning: Too sensitive (low threshold) switches too often, losing context. Too lenient keeps hitting a dead provider. Settled on 3 failures in 60 seconds.
  • Idempotency: The router doesn’t guarantee exactly-once delivery. If a request times out but actually succeeded, downstream may get a duplicate. Handle that on your end.
  • Debugging is harder: When a response looks weird, you need to check which provider served it. Added a X-Provider header in responses.

What to Do Differently Next Time

Start with a simple fallback and add metrics first before building the full router. The circuit breaker and weights came from seeing real failure patterns. Also consider using a hosted service that does this for you (e.g., ai.interwestinfo.com—though the author hasn’t used it). The technique is the same whether you build or buy.

Currently, the router handles 10,000+ requests a day with zero manual intervention. During a 6-hour outage, users barely noticed because the router silently switched to Anthropic, then to a local model.

The Real Takeaway

Resilience isn’t about eliminating failures—it’s about surviving them gracefully. A smart fallback strategy is cheap to implement and pays for itself the first time your primary API goes down. Don’t wait until your phone buzzes with angry users.

What’s your backup plan for AI API failures? Share your setup—simple fallback, multi-provider, or something totally different?