This document outlines the high availability (HA) and resilience architecture for establishing secure connectivity between Contoso’s Azure environment and Fabrikam-Customer’s infrastructure (both Azure and on-premises). The solution addresses both inbound connectivity (Fabrikam-Customer Azure → Contoso Azure) via Private Link and outbound connectivity (Contoso Azure → Fabrikam-Customer on-premises) via VPN with NAT.
| Component | Without HA | With HA | Availability Gain |
|---|---|---|---|
| Application Gateway | Single zone | Zone-redundant | 99.9% → 99.99% |
| VPN Gateway | Single tunnel | Active-Active | 99.9% → 99.95% |
| Private Link | N/A | Built-in HA | 99.99% (managed) |
| AKS Cluster | Single zone | Multi-zone | 99.9% → 99.95% |
Architecture Components:
Configuration Requirements:
Application Gateway:
SKU: Standard_v2 or WAF_v2
Zones: [1, 2, 3] # Deploy across all three zones
Minimum Instances: 2
Maximum Instances: 10
Autoscale: Enabled
Private Link Service:
Load Balancer: Standard SKU
Frontend IPs: Multiple for redundancy
NAT IP Configuration: Static allocation
Visibility: Restricted to Fabrikam-Customer subscription
Deployment Process:
CRITICAL CLARIFICATION: When we say “deploy AGW in all 3 zones,” we mean deploying ONE Application Gateway resource that automatically distributes its compute instances across multiple availability zones, NOT three separate AGW resources.
What You Create:
- ONE Application Gateway resource (e.g., agw-Contoso-prod)
- Single management plane
- Single configuration set
- One public/private IP address
What Azure Does Behind the Scenes:
- Creates multiple compute instances
- Distributes these instances across zones 1, 2, and 3
- Manages health monitoring and failover
- Synchronizes configuration automatically
- Handles traffic distribution transparently
Single AGW Resource (agw-Contoso-prod)
Management View in Azure Portal
│
│ Azure Manages Distribution
▼
┌──────────────────────────────────────────────────────┐
│ Availability Zones │
│ │
│ Zone 1 Zone 2 Zone 3 │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Instance│ │Instance│ │Instance│ │
│ │ #1 │ │ #2 │ │ #3 │ │
│ └────────┘ └────────┘ └────────┘ │
│ │
│ All instances share: │
│ - Same configuration (rules, backends, probes) │
│ - Same public/private IP (traffic distributed) │
│ - Same backend pools and routing rules │
│ - Synchronized state and session data │
└──────────────────────────────────────────────────────┘
| Scenario | What Happens | User Impact |
|---|---|---|
| Normal Operation | Traffic distributed across all 3 zones | Optimal performance |
| Zone 2 Fails | Traffic automatically redirects to Zone 1 & 3 | No downtime, slight capacity reduction |
| Zones 2 & 3 Fail | All traffic handled by Zone 1 | No downtime, reduced capacity |
| Zone 2 Recovers | Traffic automatically rebalances | Performance improves |
| Incorrect Understanding | Correct Understanding |
|---|---|
| “I need 3 separate AGW resources” | “I need 1 AGW resource configured for 3 zones” |
| “Each zone has different configuration” | “All zones share the same configuration” |
| “I must manually manage failover” | “Azure handles failover automatically” |
| “Each zone needs its own IP” | “Single IP address serves all zones” |
| “Zone deployment is complex” | “It’s just a configuration parameter” |
Cost Breakdown:
Single Zone AGW:
- Instances: 1 (minimum)
- Cost: ~$0.25/hour
- Monthly: ~$180
Zone-Redundant AGW (3 zones):
- Instances: 3 (minimum, 1 per zone)
- Cost: ~$0.75/hour
- Monthly: ~$540
- Additional cost: ~$360/month
- Benefit: 99.99% SLA vs 99.95%
Important Distinction:
Current Zone-Redundant Setup:
Protection Against:
[Yes] Single availability zone failure
[Yes] Multiple zone failures (if at least 1 zone survives)
[Yes] Zone-level maintenance and updates
[No] Complete regional failure
[No] Region-wide Azure outage
Scope: Single Region (e.g., East US)
SLA: 99.99% within the region
| Scenario | Zone-Redundant Sufficient? | Need Multi-Region? |
|---|---|---|
| Zone maintenance | Yes | No |
| Data center failure | Yes | No |
| Regional disaster | No | Yes |
| Compliance requirements | Depends | Maybe |
| Global user base | Partially | Yes |
Architecture Components:
VPN Gateway Configuration:
Gateway:
SKU: VpnGw2AZ or higher
Type: RouteBased
VPN Type: Active-Active
Zones: [1, 2, 3]
Connections:
Tunnel1:
Public IP: pip-vpn-primary
BGP: Enabled
ASN: 65001
Peer IP: Customer-A-Firewall-1
Tunnel2:
Public IP: pip-vpn-secondary
BGP: Enabled
ASN: 65001
Peer IP: Customer-A-Firewall-2
NAT Rules:
Type: Static
Mode: EgressSnat
Internal Subnet: 10.0.0.0/16
External Mapping: 192.168.1.0/24
┌─────────────────────────────────────────────────────────────┐
│ Fabrikam-Customer Azure │
│ │ │
│ Private Endpoint │
│ │ │
└───────────────────────┬─────────────────────────────────────┘
│ Private Link
▼
┌─────────────────────────────────────────────────────────────┐
│ Contoso Hub VNet │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Zone-Redundant Application Gateway v2 │ │
│ │ Zones: 1, 2, 3 | Autoscale: 2-10 instances │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Azure Firewall (Optional) │ │
│ │ Zones: 1, 2, 3 | For additional security │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Active-Active VPN Gateway │ │
│ │ 2 Public IPs | BGP Enabled | Zone-redundant │ │
│ └─────────────────────────────────────────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│ Dual IPsec Tunnels
▼
┌─────────────────────────────────────────────────────────────┐
│ Fabrikam-Customer On-Premises │
│ Redundant Firewalls/VPN Devices │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Contoso Spoke VNets │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Multi-Zone AKS Cluster │ │
│ │ Node Pools across Zones 1, 2, 3 │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Inbound Flow (Fabrikam-Customer → Contoso):
Outbound Flow (Contoso → Fabrikam-Customer On-Prem):
| Failure Type | Impact | Recovery Time | Automatic? |
|---|---|---|---|
| AGW Zone Failure | Traffic redirects to healthy zones | < 30 seconds | Yes |
| VPN Gateway Failure | Traffic switches to secondary tunnel | < 60 seconds | Yes |
| AKS Node Failure | Pods rescheduled to healthy nodes | < 2 minutes | Yes |
| Private Link Failure | Managed by Microsoft | < 30 seconds | Yes |
Monthly HA Validation Tests:
To reach 99.95%+ composite SLA:
This explains a common confusion:
“How can we achieve VPN High Availability—do we need two VPN gateways in one VNet?”
The short and correct answer is:
You only deploy ONE Azure VPN Gateway.
Azure creates TWO gateway instances behind it when Active-Active mode is enabled.
This gives full HA without needing two separate gateways.
Many engineers assume that HA requires:
But Azure does not allow multiple VPN Gateways in a single VNet.
Instead, the HA is provided inside the gateway resource itself.
When you deploy one VPN Gateway and enable Active-Active mode, Azure automatically creates:
Both exist inside the same GatewaySubnet in one VNet.