On September 4, 2024, a significant Azure outage affected multiple services in the East US region. Companies that had built for single-region deployments experienced production downtime. Companies with properly designed multi-region architectures failed over automatically. Their customers noticed nothing.
Resilient architecture doesn't eliminate failure — it eliminates the impact of failure. Multi-region design is how you build systems that survive the inevitable.
Understanding the Resilience Hierarchy
Azure provides resilience at three levels. Understanding which level protects against which failure mode is the foundation of resilience design.
| Level | Protects Against | Azure Mechanism |
|---|---|---|
| Availability Sets | Single hardware rack failure | Fault domains + update domains |
| Availability Zones | Data centre failure within a region | Deploy across AZs (typically 3 per region) |
| Region Pairs | Full regional outage | Deploy to two paired regions |
Availability Sets are the oldest and weakest mechanism — they're only relevant for VM deployments and protect only against hardware-level failures within a data centre. Modern workloads should use Availability Zones instead.
Availability Zones protect against the failure of an entire Azure data centre within a region. Each AZ is a separate physical data centre with independent power, cooling, and networking. An AZ failure is rare but has occurred. This should be your minimum for any production workload.
Regional pairs protect against the failure of an entire Azure region. This is the most extreme failure scenario — it has occurred — and it requires genuinely distributed architecture across geographically separate regions.
Availability Zones vs. multi-region
Many companies skip straight to asking about multi-region before ensuring they're properly using Availability Zones. For most workloads, a properly architected multi-AZ deployment within a single region delivers 99.99% availability — which is likely your SLA requirement. Multi-region adds cost and complexity and should only be added when you have a specific requirement for regional fault tolerance.
Multi-Region Architecture Patterns
There are three fundamentally different approaches to multi-region architecture, each with different cost, complexity, and RTO/RPO characteristics.
Pattern 1: Active-Passive (Warm Standby)
One region handles all traffic. A second region runs a reduced-capacity deployment that is continuously synchronised but only activated during a regional failure.
Characteristics:
- Lower cost than active-active (standby region runs at reduced capacity)
- Higher RTO than active-active (15–30 minutes for failover activation)
- Suitable when your SLA permits 15–30 minute recovery time
Data synchronisation: Databases use asynchronous geo-replication to the standby region. Azure SQL Geo-Replication, Cosmos DB multi-region writes, Azure Cache for Redis Geo-Replication.
Failover trigger: Azure Front Door or Traffic Manager performs health checks and automatically redirects traffic when the primary region is unavailable.
On Azure:
- Azure Front Door + priority routing (primary region priority, standby as failover)
- Azure SQL Active Geo-Replication (up to 4 readable secondaries)
- Azure Site Recovery for VM-based workloads
Pattern 2: Active-Active (Hot Standby)
Both regions handle traffic simultaneously. Load is distributed between regions. If one region fails, the other absorbs 100% of traffic with no failover delay.
Characteristics:
- Highest cost (full capacity in both regions)
- Zero RTO — no failover needed (traffic redistributes automatically)
- RPO of zero for read operations; near-zero for writes with synchronous replication
- Suitable for mission-critical workloads where any downtime is unacceptable
Data synchronisation: Writes must be consistent across regions. This requires either synchronous replication (introduces cross-region latency into the write path) or eventual consistency with conflict resolution (Cosmos DB multi-master).
On Azure:
- Azure Front Door + weighted routing (50/50 distribution)
- Azure Cosmos DB multi-region multi-write (natively active-active)
- Azure Service Bus Geo-Disaster Recovery for messaging
The data consistency challenge in active-active
Active-active with consistent writes is one of the hardest problems in distributed systems. If a user writes data to Region A and simultaneously reads from Region B, they may see stale data unless synchronous replication is used (which adds 5–20ms of cross-region latency to every write). Understand your consistency requirements before committing to active-active.
Pattern 3: Active-Passive (Cold Standby / Pilot Light)
The standby region is mostly dormant — perhaps with just the data layer running and the application layer shut down. In a failure event, you spin up the application layer from scratch.
Characteristics:
- Lowest cost (minimal standby resources running)
- Highest RTO (30–60 minutes or more to bring up application layer)
- Acceptable when your SLA permits 1+ hour recovery time
- Suitable for less critical workloads where cost is a primary constraint
Reference Architecture: Active-Passive on Azure
For a typical web application with SLA requirements of 99.9–99.99%:
- ✓Azure Front Door Premium
- ✓Health probes to both regions
- ✓Priority routing (primary / failover)
- ✓WAF policy applied globally
- ✓AKS cluster (3 AZs)
- ✓Azure SQL (zone-redundant)
- ✓Azure Cache for Redis
- ✓Azure Service Bus
- ✓AKS cluster (warm standby)
- ✓Azure SQL geo-replica (read-only)
- ✓Azure Cache for Redis (geo-replication)
- ✓Service Bus geo-DR namespace
- ✓SQL async geo-replication
- ✓Storage GRS (geo-redundant)
- ✓Redis passive geo-replication
- ✓Free cross-pair egress
Traffic flows through Azure Front Door to primary region. On regional failure, Front Door automatically routes to secondary.
Designing for Failover
Multi-region architecture only works if failover is automated, tested, and understood. Many organisations build multi-region infrastructure and then discover during an actual failure that it doesn't work as designed.
Failover Checklist
DNS and routing:
- Azure Front Door or Traffic Manager configured with health probes to both regions
- Health probe interval: ≤ 30 seconds
- Failover is automatic (no human required to initiate)
- TTL on DNS records is low enough that failover propagates quickly
Data:
- Database geo-replication is enabled and monitored
- Failover to read replica promotes it to primary automatically or with documented manual steps
- RTO and RPO targets are documented and validated in drills
Application:
- Application is stateless (session state in Redis, not in-process)
- All configuration is environment-variable-driven (no region-specific hardcoding)
- Connection strings, endpoints, and secrets are resolved from Key Vault (same vault, accessible from both regions)
Automation:
- Failover runbook is written, reviewed, and accessible during an incident
- Failover drill has been run in the last 6 months
- Alerting notifies on-call when failover occurs
Failure Drills
The only way to know your multi-region architecture works is to test it. Run a failover drill quarterly:
- Pre-drill: confirm both regions are healthy and replication is current
- Simulate regional failure: update Front Door routing rules to exclude the primary region, or disable the primary region's health endpoint
- Observe: does traffic move automatically? How long does it take?
- Measure: is application performance acceptable in the failover state?
- Failback: restore primary region routing, verify data consistency
- Post-drill: document findings, fix issues, update the runbook
The failover you don't test doesn't work
I've reviewed failover architectures that looked correct on paper but failed during a real incident because a firewall rule hadn't been updated, a certificate had expired in the standby region, or a database credential only existed in Key Vault in the primary region. The only way to find these issues is to test regularly. Treat failover drills with the same seriousness as production deployments.
Cost Management for Multi-Region
Multi-region architecture inherently increases cost. The strategies to manage it:
Right-size the standby: For active-passive, the standby region doesn't need to run at full production capacity continuously. Use smaller SKUs or fewer replicas in the standby, sized to handle traffic for the expected failover duration (during which you can scale up if needed).
Leverage Azure's free paired-region egress: Data replication between Azure paired regions (e.g., West Europe ↔ North Europe) incurs no egress charges. Design your replication topology to use paired regions.
Identify your truly multi-region requirements: Not every component needs multi-region coverage. Your customer-facing API might need it; your internal admin portal probably doesn't. Apply multi-region only to components with SLA requirements that justify the cost.
Consider Azure Availability Zones first: For many workloads, multi-AZ within a single region achieves the availability SLA at a fraction of the cost of full multi-region. Reserve multi-region for components where a full regional outage would be unacceptable.
Cloud architecture and Azure resilience design are core areas of my practice. If you're designing a multi-region architecture or need to assess your current availability posture, let's talk.