American Express Core Payments Goes Cell-Based

American Express's core payments ecosystem processes live transactions every day. To achieve high availability, low latency, and predictable performance, the company uses a cell-based architecture. The approach isolates failures within independent cells, prevents cascading outages, and scales capacity without expanding the failure domain.

What Is Cell-Based Architecture?

Cell-based architecture groups microservices, databases, and other components into independent instances called cells. Each cell can function independently. If one cell fails, the blast radius stays within that cell. The trade-off is increased management overhead and architectural complexity.

For mission-critical payments, the benefits outweigh the complexity. Cells also reduce latency by minimizing external dependencies and network hops. Scaling is achieved by adding more cells.

How American Express Implements Cells

Each instance of the core payments ecosystem is a cell. A cell:

A cell is defined by its failure boundaries, not a specific infrastructure construct. Cells never span multiple regions. Everything required to process transactions (DNS, databases, microservices, supporting services) stays local.

Data and Processing Locality

Processing payments requires data like currency rates and merchant category codes. Static and semi-static data is replicated to each cell ahead of time. This avoids cache-miss latency and preserves critical-path isolation.

For dynamic data that changes per transaction, American Express uses deterministic routing. The Global Transaction Router (GTR) routes transactions to the cell where the authoritative data already resides. For example, routing can be based on partner, market, or payment type.

Microservice communication is restricted to pod-to-pod interactions within the cell's Kubernetes network. Failover data is synchronized asynchronously via message-based replication, outside the transaction path.

Enforced Boundaries via the Global Transaction Router

The GTR enforces local-only processing. All transactions enter a cell through the GTR. If a transaction needs to be rerouted, it must go through the GTR again. The GTR acts as a payments mesh, connecting cells globally.

By tightly controlling cross-cell communication, American Express prevents cells from forming strong dependencies. Cells cannot communicate directly; only the GTR can route across cells. This occasionally leads to duplicated services, but it preserves cell independence and reduces latency.

Observability follows the same principle. Each cell publishes logs, metrics, and traces to local observability components first. Global aggregation happens asynchronously, outside the critical path.

Failure Isolation and Rerouting

When a cell fails, its impact stays contained. New and in-flight transactions are rerouted to a healthy cell. The Payments Processing subsystem uses an orchestrator microservice. If a downstream service fails, the orchestrator detects it, halts processing, and sends the transaction back to the GTR for rerouting.

American Express does not resume partially processed transactions across cells. Instead, it restarts processing in another cell with the original transaction data. This restart is safe only while the transaction is still within the core payments ecosystem. Once a transaction is sent to an external system (e.g., card issuer), it cannot be rerouted.

Card authorizations structure the point of no return toward the end of processing. For other payment types, idempotency is managed through unique transaction identifiers. Downstream systems use these identifiers to detect and suppress duplicate requests.

The restart model avoids shared state between cells. Each cell has its own database clusters. Microservices communicate only with the local database cluster. When a cell fails, other cells continue normally.

Scaling and Latency Benefits

Cell-based architecture allows American Express to scale by adding cells. Each cell operates independently, so scaling doesn't increase the failure domain. Latency improves because processing stays local within the cell, reducing network hops.

Conclusion

American Express's cell-based architecture demonstrates a proven approach to building resilient payment systems at global scale. Developers designing high-availability systems should consider adopting cell-based patterns to isolate failures and maintain predictable performance. Start by defining clear failure boundaries and enforcing strict data locality.