user-guides

High Availability Architecture for Contoso-Fabrikam-Customer Connectivity

Executive Summary

This document outlines the high availability (HA) and resilience architecture for establishing secure connectivity between Contoso’s Azure environment and Fabrikam-Customer’s infrastructure (both Azure and on-premises). The solution addresses both inbound connectivity (Fabrikam-Customer Azure → Contoso Azure) via Private Link and outbound connectivity (Contoso Azure → Fabrikam-Customer on-premises) via VPN with NAT.


1. Current Architecture Overview

1.1 What We Have

1.2 Key Challenges


2. What Constitutes True HA

Component Without HA With HA Availability Gain
Application Gateway Single zone Zone-redundant 99.9% → 99.99%
VPN Gateway Single tunnel Active-Active 99.9% → 99.95%
Private Link N/A Built-in HA 99.99% (managed)
AKS Cluster Single zone Multi-zone 99.9% → 99.95%

3.1 What: Zone-Redundant Application Gateway

Architecture Components:

3.2 Why: Benefits of This Approach

  1. Automatic Failover: No manual intervention required during zone failures
  2. Zero Downtime Maintenance: Rolling updates without service interruption
  3. Load Distribution: Traffic automatically distributed across zones
  4. Cost Efficiency: Pay for single resource, get multi-zone redundancy

3.3 How: Implementation Steps

Configuration Requirements:
  Application Gateway:
    SKU: Standard_v2 or WAF_v2
    Zones: [1, 2, 3]  # Deploy across all three zones
    Minimum Instances: 2
    Maximum Instances: 10
    Autoscale: Enabled
    
  Private Link Service:
    Load Balancer: Standard SKU
    Frontend IPs: Multiple for redundancy
    NAT IP Configuration: Static allocation
    Visibility: Restricted to Fabrikam-Customer subscription

Deployment Process:

  1. Create zone-redundant AGW v2 in Contoso subscription
  2. Configure Private Link Service on AGW
  3. Share Private Link Service alias with Fabrikam-Customer
  4. Fabrikam-Customer creates Private Endpoint in their VNet
  5. Configure DNS for private endpoint resolution
  6. Test failover scenarios

3.4 Understanding Zone-Redundant AGW Deployment

What Does “Deploy AGW in All 3 Zones” Mean?

CRITICAL CLARIFICATION: When we say “deploy AGW in all 3 zones,” we mean deploying ONE Application Gateway resource that automatically distributes its compute instances across multiple availability zones, NOT three separate AGW resources.

Technical Implementation

What You Create:
  - ONE Application Gateway resource (e.g., agw-Contoso-prod)
  - Single management plane
  - Single configuration set
  - One public/private IP address

What Azure Does Behind the Scenes:
  - Creates multiple compute instances
  - Distributes these instances across zones 1, 2, and 3
  - Manages health monitoring and failover
  - Synchronizes configuration automatically
  - Handles traffic distribution transparently

Visual Representation

                    Single AGW Resource (agw-Contoso-prod)
                    Management View in Azure Portal
                              │
                              │ Azure Manages Distribution
                              ▼
    ┌──────────────────────────────────────────────────────┐
    │                 Availability Zones                   │
    │                                                      │
    │  Zone 1              Zone 2              Zone 3      │
    │  ┌────────┐          ┌────────┐          ┌────────┐  │
    │  │Instance│          │Instance│          │Instance│  │
    │  │   #1   │          │   #2   │          │   #3   │  │
    │  └────────┘          └────────┘          └────────┘  │
    │                                                      │
    │  All instances share:                                │
    │  - Same configuration (rules, backends, probes)      │
    │  - Same public/private IP (traffic distributed)      │
    │  - Same backend pools and routing rules              │
    │  - Synchronized state and session data               │
    └──────────────────────────────────────────────────────┘

How Zone Failover Works

Scenario What Happens User Impact
Normal Operation Traffic distributed across all 3 zones Optimal performance
Zone 2 Fails Traffic automatically redirects to Zone 1 & 3 No downtime, slight capacity reduction
Zones 2 & 3 Fail All traffic handled by Zone 1 No downtime, reduced capacity
Zone 2 Recovers Traffic automatically rebalances Performance improves

Common Misconceptions Clarified

Incorrect Understanding Correct Understanding
“I need 3 separate AGW resources” “I need 1 AGW resource configured for 3 zones”
“Each zone has different configuration” “All zones share the same configuration”
“I must manually manage failover” “Azure handles failover automatically”
“Each zone needs its own IP” “Single IP address serves all zones”
“Zone deployment is complex” “It’s just a configuration parameter”

Cost Implications of Zone Redundancy

Cost Breakdown:
  Single Zone AGW:
    - Instances: 1 (minimum)
    - Cost: ~$0.25/hour
    - Monthly: ~$180
    
  Zone-Redundant AGW (3 zones):
    - Instances: 3 (minimum, 1 per zone)
    - Cost: ~$0.75/hour
    - Monthly: ~$540
    - Additional cost: ~$360/month
    - Benefit: 99.99% SLA vs 99.95%

Key Takeaways

  1. Single Resource: You manage ONE AGW, not three
  2. Automatic Distribution: Azure handles instance placement
  3. Transparent Failover: No manual intervention needed
  4. Unified Configuration: One set of rules for all zones
  5. Higher SLA: 99.99% availability with zone redundancy
  6. Increased Cost: Approximately 3x the single-zone cost for 3-zone redundancy

3.5 Regional Considerations for AGW

Zone Redundancy vs Regional Redundancy

Important Distinction:

Current Zone-Redundant Setup:
  Protection Against:
    [Yes] Single availability zone failure
    [Yes] Multiple zone failures (if at least 1 zone survives)
    [Yes] Zone-level maintenance and updates
    [No] Complete regional failure
    [No] Region-wide Azure outage
    
  Scope: Single Region (e.g., East US)
  SLA: 99.99% within the region

When to Consider Multi-Region

Scenario Zone-Redundant Sufficient? Need Multi-Region?
Zone maintenance Yes No
Data center failure Yes No
Regional disaster No Yes
Compliance requirements Depends Maybe
Global user base Partially Yes

4. VPN High Availability Solution

4.1 What: Active-Active VPN Gateway Configuration

Architecture Components:

4.2 Why: Advantages of Active-Active VPN

  1. Continuous Availability: Both tunnels active simultaneously
  2. Automatic Failover: Sub-second convergence with BGP
  3. Load Balancing: Distribute traffic across tunnels
  4. Maintenance Windows: Update one tunnel while other handles traffic
  5. SLA Guarantee: Microsoft-backed 99.95% availability

4.3 How: Implementation Architecture

VPN Gateway Configuration:
  Gateway:
    SKU: VpnGw2AZ or higher
    Type: RouteBased
    VPN Type: Active-Active
    Zones: [1, 2, 3]
    
  Connections:
    Tunnel1:
      Public IP: pip-vpn-primary
      BGP: Enabled
      ASN: 65001
      Peer IP: Customer-A-Firewall-1
      
    Tunnel2:
      Public IP: pip-vpn-secondary
      BGP: Enabled
      ASN: 65001
      Peer IP: Customer-A-Firewall-2
      
  NAT Rules:
    Type: Static
    Mode: EgressSnat
    Internal Subnet: 10.0.0.0/16
    External Mapping: 192.168.1.0/24

5.1 Hub-Spoke Topology with Gateway Transit

┌─────────────────────────────────────────────────────────────┐
│                     Fabrikam-Customer Azure                 │
│                           │                                 │
│                    Private Endpoint                         │
│                           │                                 │
└───────────────────────┬─────────────────────────────────────┘
                        │ Private Link
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                    Contoso Hub VNet                         │
│  ┌─────────────────────────────────────────────────────┐    │
│  │   Zone-Redundant Application Gateway v2             │    │
│  │   Zones: 1, 2, 3 | Autoscale: 2-10 instances        │    │
│  └─────────────────────────────────────────────────────┘    │
│                           │                                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │   Azure Firewall (Optional)                         │    │
│  │   Zones: 1, 2, 3 | For additional security          │    │
│  └─────────────────────────────────────────────────────┘    │
│                           │                                 │
│  ┌─────────────────────────────────────────────────────┐    │
│  │   Active-Active VPN Gateway                         │    │
│  │   2 Public IPs | BGP Enabled | Zone-redundant       │    │
│  └─────────────────────────────────────────────────────┘    │
└──────────────────────────┬──────────────────────────────────┘
                           │ Dual IPsec Tunnels
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Fabrikam-Customer On-Premises                  │
│         Redundant Firewalls/VPN Devices                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                   Contoso Spoke VNets                       │
│  ┌─────────────────────────────────────────────────────┐    │
│  │   Multi-Zone AKS Cluster                            │    │
│  │   Node Pools across Zones 1, 2, 3                   │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

5.2 Traffic Flow Patterns

Inbound Flow (Fabrikam-Customer → Contoso):

  1. Fabrikam-Customer application → Private Endpoint
  2. Private Link connection → Contoso AGW
  3. AGW (with zone failover) → Backend AKS pods
  4. Response follows reverse path

Outbound Flow (Contoso → Fabrikam-Customer On-Prem):

  1. AKS pods → Hub VNet (via peering)
  2. NAT translation at VPN Gateway
  3. Active tunnel selection via BGP
  4. Encrypted traffic → Fabrikam-Customer on-premises

6. Failover Scenarios and Recovery

6.1 Zone Failure Scenarios

Failure Type Impact Recovery Time Automatic?
AGW Zone Failure Traffic redirects to healthy zones < 30 seconds Yes
VPN Gateway Failure Traffic switches to secondary tunnel < 60 seconds Yes
AKS Node Failure Pods rescheduled to healthy nodes < 2 minutes Yes
Private Link Failure Managed by Microsoft < 30 seconds Yes

6.2 Testing Procedures

Monthly HA Validation Tests:

  1. Zone Failure Simulation
    • Disable one availability zone
    • Verify traffic continues flowing
    • Measure failover time
  2. VPN Tunnel Failure
    • Disconnect primary tunnel
    • Verify BGP reconvergence
    • Check NAT translations
  3. Load Testing
    • Simulate peak traffic conditions
    • Verify autoscaling triggers
    • Monitor performance metrics

7. Achieving Higher SLA

To reach 99.95%+ composite SLA:

  1. Implement multi-region failover (Active-Passive)
  2. Use Traffic Manager for DNS-based routing
  3. Deploy redundant Private Link endpoints
  4. Consider Azure Front Door for global load balancing

Addendum

Understanding VPN Gateway High Availability in Azure

(Why You Don’t Need Two Gateways in a VNet)

This explains a common confusion:
“How can we achieve VPN High Availability—do we need two VPN gateways in one VNet?”

The short and correct answer is:

You only deploy ONE Azure VPN Gateway.
Azure creates TWO gateway instances behind it when Active-Active mode is enabled.
This gives full HA without needing two separate gateways.


1. The Source of Confusion

Many engineers assume that HA requires:

But Azure does not allow multiple VPN Gateways in a single VNet.
Instead, the HA is provided inside the gateway resource itself.


2. How Azure Actually Provides HA

One Logical Gateway → Two Physical Instances

When you deploy one VPN Gateway and enable Active-Active mode, Azure automatically creates:

Both exist inside the same GatewaySubnet in one VNet.