On September 4, 2024, a significant Azure outage affected multiple services in the East US region. Companies that had built for single-region deployments experienced production downtime. Companies with properly designed multi-region architectures failed over automatically. Their customers noticed nothing.

Resilient architecture doesn't eliminate failure — it eliminates the impact of failure. Multi-region design is how you build systems that survive the inevitable.

99.99%

Azure SLA with AZs + regions

Achievable with correct design

99.9%

Single-region VM SLA

= 8.7hrs downtime/year acceptable

~$0

Azure paired region egress

No charge for cross-region replication within pairs

Typical cost multiplier

For active-active vs. single region

Understanding the Resilience Hierarchy

Azure provides resilience at three levels. Understanding which level protects against which failure mode is the foundation of resilience design.

Level	Protects Against	Azure Mechanism
Availability Sets	Single hardware rack failure	Fault domains + update domains
Availability Zones	Data centre failure within a region	Deploy across AZs (typically 3 per region)
Region Pairs	Full regional outage	Deploy to two paired regions

Availability Sets are the oldest and weakest mechanism — they're only relevant for VM deployments and protect only against hardware-level failures within a data centre. Modern workloads should use Availability Zones instead.

Availability Zones protect against the failure of an entire Azure data centre within a region. Each AZ is a separate physical data centre with independent power, cooling, and networking. An AZ failure is rare but has occurred. This should be your minimum for any production workload.

Regional pairs protect against the failure of an entire Azure region. This is the most extreme failure scenario — it has occurred — and it requires genuinely distributed architecture across geographically separate regions.

ℹ️

Availability Zones vs. multi-region

Many companies skip straight to asking about multi-region before ensuring they're properly using Availability Zones. For most workloads, a properly architected multi-AZ deployment within a single region delivers 99.99% availability — which is likely your SLA requirement. Multi-region adds cost and complexity and should only be added when you have a specific requirement for regional fault tolerance.

Multi-Region Architecture Patterns

There are three fundamentally different approaches to multi-region architecture, each with different cost, complexity, and RTO/RPO characteristics.

Pattern 1: Active-Passive (Warm Standby)

One region handles all traffic. A second region runs a reduced-capacity deployment that is continuously synchronised but only activated during a regional failure.

Characteristics:

Lower cost than active-active (standby region runs at reduced capacity)
Higher RTO than active-active (15–30 minutes for failover activation)
Suitable when your SLA permits 15–30 minute recovery time

Data synchronisation: Databases use asynchronous geo-replication to the standby region. Azure SQL Geo-Replication, Cosmos DB multi-region writes, Azure Cache for Redis Geo-Replication.

Failover trigger: Azure Front Door or Traffic Manager performs health checks and automatically redirects traffic when the primary region is unavailable.

On Azure:

Azure Front Door + priority routing (primary region priority, standby as failover)
Azure SQL Active Geo-Replication (up to 4 readable secondaries)
Azure Site Recovery for VM-based workloads

Pattern 2: Active-Active (Hot Standby)

Both regions handle traffic simultaneously. Load is distributed between regions. If one region fails, the other absorbs 100% of traffic with no failover delay.

Characteristics:

Highest cost (full capacity in both regions)
Zero RTO — no failover needed (traffic redistributes automatically)
RPO of zero for read operations; near-zero for writes with synchronous replication
Suitable for mission-critical workloads where any downtime is unacceptable

Data synchronisation: Writes must be consistent across regions. This requires either synchronous replication (introduces cross-region latency into the write path) or eventual consistency with conflict resolution (Cosmos DB multi-master).

On Azure:

Azure Front Door + weighted routing (50/50 distribution)
Azure Cosmos DB multi-region multi-write (natively active-active)
Azure Service Bus Geo-Disaster Recovery for messaging

⚠️

The data consistency challenge in active-active

Active-active with consistent writes is one of the hardest problems in distributed systems. If a user writes data to Region A and simultaneously reads from Region B, they may see stale data unless synchronous replication is used (which adds 5–20ms of cross-region latency to every write). Understand your consistency requirements before committing to active-active.

Pattern 3: Active-Passive (Cold Standby / Pilot Light)

The standby region is mostly dormant — perhaps with just the data layer running and the application layer shut down. In a failure event, you spin up the application layer from scratch.

Characteristics:

Lowest cost (minimal standby resources running)
Highest RTO (30–60 minutes or more to bring up application layer)
Acceptable when your SLA permits 1+ hour recovery time
Suitable for less critical workloads where cost is a primary constraint

Reference Architecture: Active-Passive on Azure

For a typical web application with SLA requirements of 99.9–99.99%:

Multi-Region Active-Passive Architecture

🌐

Routing Layer

Azure Front Door Premium
Health probes to both regions
Priority routing (primary / failover)
WAF policy applied globally

⭐

Primary Region

AKS cluster (3 AZs)
Azure SQL (zone-redundant)
Azure Cache for Redis
Azure Service Bus

🔄

Secondary Region

AKS cluster (warm standby)
Azure SQL geo-replica (read-only)
Azure Cache for Redis (geo-replication)
Service Bus geo-DR namespace

💾

Data Replication

SQL async geo-replication
Storage GRS (geo-redundant)
Redis passive geo-replication
Free cross-pair egress

Traffic flows through Azure Front Door to primary region. On regional failure, Front Door automatically routes to secondary.

Designing for Failover

Multi-region architecture only works if failover is automated, tested, and understood. Many organisations build multi-region infrastructure and then discover during an actual failure that it doesn't work as designed.

Failover Checklist

DNS and routing:

Azure Front Door or Traffic Manager configured with health probes to both regions
Health probe interval: ≤ 30 seconds
Failover is automatic (no human required to initiate)
TTL on DNS records is low enough that failover propagates quickly

Data:

Database geo-replication is enabled and monitored
Failover to read replica promotes it to primary automatically or with documented manual steps
RTO and RPO targets are documented and validated in drills

Application:

Application is stateless (session state in Redis, not in-process)
All configuration is environment-variable-driven (no region-specific hardcoding)
Connection strings, endpoints, and secrets are resolved from Key Vault (same vault, accessible from both regions)

Automation:

Failover runbook is written, reviewed, and accessible during an incident
Failover drill has been run in the last 6 months
Alerting notifies on-call when failover occurs

Failure Drills

The only way to know your multi-region architecture works is to test it. Run a failover drill quarterly:

Pre-drill: confirm both regions are healthy and replication is current
Simulate regional failure: update Front Door routing rules to exclude the primary region, or disable the primary region's health endpoint
Observe: does traffic move automatically? How long does it take?
Measure: is application performance acceptable in the failover state?
Failback: restore primary region routing, verify data consistency
Post-drill: document findings, fix issues, update the runbook

🔍

The failover you don't test doesn't work

I've reviewed failover architectures that looked correct on paper but failed during a real incident because a firewall rule hadn't been updated, a certificate had expired in the standby region, or a database credential only existed in Key Vault in the primary region. The only way to find these issues is to test regularly. Treat failover drills with the same seriousness as production deployments.

Cost Management for Multi-Region

Multi-region architecture inherently increases cost. The strategies to manage it:

Right-size the standby: For active-passive, the standby region doesn't need to run at full production capacity continuously. Use smaller SKUs or fewer replicas in the standby, sized to handle traffic for the expected failover duration (during which you can scale up if needed).

Leverage Azure's free paired-region egress: Data replication between Azure paired regions (e.g., West Europe ↔ North Europe) incurs no egress charges. Design your replication topology to use paired regions.

Identify your truly multi-region requirements: Not every component needs multi-region coverage. Your customer-facing API might need it; your internal admin portal probably doesn't. Apply multi-region only to components with SLA requirements that justify the cost.

Consider Azure Availability Zones first: For many workloads, multi-AZ within a single region achieves the availability SLA at a fraction of the cost of full multi-region. Reserve multi-region for components where a full regional outage would be unacceptable.

Cloud architecture and Azure resilience design are core areas of my practice. If you're designing a multi-region architecture or need to assess your current availability posture, let's talk.

Multi-Region Azure Architecture: Building for High Availability and Disaster Recovery

Understanding the Resilience Hierarchy

Multi-Region Architecture Patterns

Pattern 1: Active-Passive (Warm Standby)

Pattern 2: Active-Active (Hot Standby)

Pattern 3: Active-Passive (Cold Standby / Pilot Light)

Reference Architecture: Active-Passive on Azure

Designing for Failover

Failover Checklist

Failure Drills

Cost Management for Multi-Region

Ready to put this into practice?