American Express Core Payments Goes Cell-Based
American Express's core payments ecosystem processes live transactions every day. To achieve high availability, low latency, and predictable performance, the company uses a cell-based architecture. The approach isolates failures within independent cells, prevents cascading outages, and scales capacity without expanding the failure domain.
What Is Cell-Based Architecture?
Cell-based architecture groups microservices, databases, and other components into independent instances called cells. Each cell can function independently. If one cell fails, the blast radius stays within that cell. The trade-off is increased management overhead and architectural complexity.
For mission-critical payments, the benefits outweigh the complexity. Cells also reduce latency by minimizing external dependencies and network hops. Scaling is achieved by adding more cells.
How American Express Implements Cells
Each instance of the core payments ecosystem is a cell. A cell:
- Is an independently deployable unit that processes payments alone.
- Has its own microservices, databases, and other components.
- Is a single failure domain.
- Can be taken out of rotation for maintenance or failure without impacting the overall system.
- Has no synchronous cross-cell dependencies in the critical path.
A cell is defined by its failure boundaries, not a specific infrastructure construct. Cells never span multiple regions. Everything required to process transactions (DNS, databases, microservices, supporting services) stays local.
Data and Processing Locality
Processing payments requires data like currency rates and merchant category codes. Static and semi-static data is replicated to each cell ahead of time. This avoids cache-miss latency and preserves critical-path isolation.
For dynamic data that changes per transaction, American Express uses deterministic routing. The Global Transaction Router (GTR) routes transactions to the cell where the authoritative data already resides. For example, routing can be based on partner, market, or payment type.
Microservice communication is restricted to pod-to-pod interactions within the cell's Kubernetes network. Failover data is synchronized asynchronously via message-based replication, outside the transaction path.
Enforced Boundaries via the Global Transaction Router
The GTR enforces local-only processing. All transactions enter a cell through the GTR. If a transaction needs to be rerouted, it must go through the GTR again. The GTR acts as a payments mesh, connecting cells globally.
By tightly controlling cross-cell communication, American Express prevents cells from forming strong dependencies. Cells cannot communicate directly; only the GTR can route across cells. This occasionally leads to duplicated services, but it preserves cell independence and reduces latency.
Observability follows the same principle. Each cell publishes logs, metrics, and traces to local observability components first. Global aggregation happens asynchronously, outside the critical path.
Failure Isolation and Rerouting
When a cell fails, its impact stays contained. New and in-flight transactions are rerouted to a healthy cell. The Payments Processing subsystem uses an orchestrator microservice. If a downstream service fails, the orchestrator detects it, halts processing, and sends the transaction back to the GTR for rerouting.
American Express does not resume partially processed transactions across cells. Instead, it restarts processing in another cell with the original transaction data. This restart is safe only while the transaction is still within the core payments ecosystem. Once a transaction is sent to an external system (e.g., card issuer), it cannot be rerouted.
Card authorizations structure the point of no return toward the end of processing. For other payment types, idempotency is managed through unique transaction identifiers. Downstream systems use these identifiers to detect and suppress duplicate requests.
The restart model avoids shared state between cells. Each cell has its own database clusters. Microservices communicate only with the local database cluster. When a cell fails, other cells continue normally.
Scaling and Latency Benefits
Cell-based architecture allows American Express to scale by adding cells. Each cell operates independently, so scaling doesn't increase the failure domain. Latency improves because processing stays local within the cell, reducing network hops.
Conclusion
American Express's cell-based architecture demonstrates a proven approach to building resilient payment systems at global scale. Developers designing high-availability systems should consider adopting cell-based patterns to isolate failures and maintain predictable performance. Start by defining clear failure boundaries and enforcing strict data locality.
