Response to Azure Front Door Incident (YKYN-BWZ) - October 29-30, 2025
Following the Azure Front Door global outage on October 29-30, 2025 (Tracking ID: YKYN-BWZ), this document provides comprehensive guidance on implementing a highly available, cost-optimized architecture for customers running 30-40 App Services behind Azure Front Door.
Incident Date: October 29-30, 2025
Duration: 15:41 UTC (Oct 29) to 00:05 UTC (Oct 30) - ~8.5 hours
Impact: Global Azure Front Door and Azure CDN connectivity issues
A sequence of customer configuration changes across two different control plane build versions resulted in incompatible metadata that exposed a latent bug in the data plane. The configuration passed through validation safeguards because the crash occurred asynchronously (~5 minutes after deployment), allowing the problematic configuration to propagate globally.
Layer 1 (DNS): Traffic Manager (orchestrator)
|
┌─────────┴─────────┐
| |
Layer 2 (CDN/WAF): AFD Application Gateway
| |
└─────────┬─────────┘
|
Layer 3 (Origin): Backend Apps (30-40 App Services)
| Component | Monthly Cost approx (USD) | Notes |
|---|---|---|
| Traffic Manager | $35-50 | Based on 10M DNS queries/month |
| Azure Front Door Standard | $550-800 | 30-40 routing rules, 500GB outbound |
| Application Gateway v2 (Standby) | $250-350 | 2 minimum instances, minimal traffic |
| Private Link | $15-30 | 30-40 Private Endpoints @ $0.01/hour |
| Total (Proposed) | $850-1,230 | Full redundancy |
Traffic Manager (Priority Routing)
├── Priority 1: Azure Front Door (active)
└── Priority 2: Application Gateway (warm standby - minimum capacity)
└── Auto-scaling: min=2, max=10 instances
Pros:
Cons:
Traffic Manager (Priority Routing)
├── Priority 1: Azure Front Door (active)
└── Priority 2: Application Gateway (deployed on-demand via automation)
Pros:
Cons:
Azure Front Door Premium (No Traffic Manager needed)
├── Origin Group 1 (Priority 1): App Services via Private Link
└── Origin Group 2 (Priority 2): Application Gateway (standby) → App Services
└── Automatic health-based failover
└── Application Gateway auto-scales based on traffic
Pros:
Cons:
| Scenario | Option 1 (TM + AFD + AppGW) | Option 3 (AFD Premium) | Savings |
|---|---|---|---|
| Normal Operation | $835-1,200 | $930-980 | -$95 to +$220 |
| During Failover | $835-1,200 | $1,130-1,230 | Similar |
Verdict: Option 3 provides comparable cost with significant operational benefits
Traffic Manager (Priority Routing)
├── Priority 1: Azure Front Door → App Services
└── Priority 2: Akamai CDN → App Services
Pros:
Cons:
| Option | Monthly Cost | RTO | Complexity | Recommendation |
|---|---|---|---|---|
| 1. TM + AFD + AppGW (warm) | $835-1,200 | <1 min | Medium | Good for immediate failover needs |
| 2. AFD Premium + Origin Groups | $930-980 | <30 sec | Low-Medium | ** RECOMMENDED** |
| 3. IaC Rapid Deployment | $585-850 | 10-15 min | Medium-High | Only if RTO allows |
| 4. AFD + Akamai | $1,085-2,850+ | <1 min | High | Mission-critical only |
Rationale:
Customer Environment:
Problem:
App1.contoso.com → Certificate 1
App2.contoso.com → Certificate 2
App3.contoso.com → Certificate 3
...
App40.contoso.com → Certificate 40
Total: 40 certificates to manage, renew, and sync
Option A: Wildcard Certificates (Recommended if all apps use subdomains)
*.contoso.com → Certificate 1 (covers all subdomains)
*.internal.contoso.com → Certificate 2 (if using second-level subdomains)
Total: 1-2 certificates
Option B: Subject Alternative Name (SAN) Certificates
Certificate 1: app1.contoso.com, app2.contoso.com, ..., app10.contoso.com (10 SANs)
Certificate 2: app11.contoso.com, app12.contoso.com, ..., app20.contoso.com (10 SANs)
Certificate 3: app21.contoso.com, ..., app30.contoso.com
Certificate 4: app31.contoso.com, ..., app40.contoso.com
Total: 4 certificates (10 domains per cert)
Option C: Mixed Approach
Certificate 1: *.contoso.com (wildcard for most apps)
Certificate 2: special-app.differentdomain.com (SAN for apps on different domains)
Certificate 3: *.partner.contoso.com (separate subdomain space)
Total: 3-5 certificates
Recommended Consolidation:
Total Certificates: 3-5 (from 40)
Certificate 1: *.contoso.com (wildcard)
- Covers: app1.contoso.com, app2.contoso.com, ..., app35.contoso.com
- Provider: DigiCert or Let's Encrypt (if acceptable for production)
Certificate 2: SAN certificate for special cases
- Covers: 5-10 apps with different domain patterns
Benefits:
- 90% reduction in certificate management overhead
- Single certificate to sync across platforms
- Easier renewal automation
- Cost savings (1 wildcard cert vs. 40 individual certs)
Azure Key Vault (Central Source of Truth)
├── Certificate: wildcard-contoso-com
├── Certificate: san-special-apps
└── Certificate: wildcard-internal-contoso-com
Synchronized to ↓
├── Azure Front Door Premium
│ ├── Custom Domain: app1.contoso.com (uses wildcard-contoso-com)
│ ├── Custom Domain: app2.contoso.com (uses wildcard-contoso-com)
│ └── ... (all 40 domains configured)
│
├── Application Gateway
│ ├── Listener: app1.contoso.com (references Key Vault certificate)
│ ├── Listener: app2.contoso.com (references Key Vault certificate)
│ └── ... (all 40 listeners configured)
│
└── Akamai (if using Option 4)
└── Certificates uploaded via API/manually
1. Create Azure Key Vault
2. Upload/Import Certificate to Key Vault
3. Grant AFD Premium access to Key Vault
4. Grant Application Gateway access to Key Vault
# For each of the 30-40 custom domains:
for domain in app1 app2 app3 ... app40; do
Create custom domain in AFD
Associate with route
done
Reference Key Vault certificate
Create listeners for each app
Create backend pool (points to App Service)
Create rule
; Keep ALL validation records permanently in DNS
; This allows pre-validated certificates on all platforms
; Azure Front Door validation
_acme-challenge.contoso.com. TXT "afd-validation-token-12345"
_dnsauth.contoso.com. TXT "afd-validation-token-67890"
; Application Gateway / Let's Encrypt validation
_acme-challenge.contoso.com. TXT "letsencrypt-validation-abc123"
; Akamai validation (if used)
_acme-challenge.contoso.com. TXT "akamai-validation-xyz789"
; Note: Multiple TXT records for same name are allowed per RFC
Add permanent validation records
Result: All platforms can validate certificates at ANY time without DNS changes
Create Function App
Grant permissions
Deploy function code
Based on the October 29-30 Azure Front Door incident, Microsoft documented the following best practices:
Reference: https://learn.microsoft.com/azure/architecture/guide/networking/global-web-applications/mission-critical-global-http-ingress
Key Principles:
Microsoft migrated critical first-party services to active-active with fail-away:
Azure Portal
Azure Communication Services
Azure Marketplace
Linux Software Repository for Microsoft Products
Support ticket creation system
Architecture Pattern:
Primary: Azure Front Door (global edge)
Secondary: Independent infrastructure (different control plane)
Failover: Automated based on health signals
Recovery Time: < 5 minutes
Lesson: If Microsoft itself now uses multi-platform failover for critical services, customers should too
Layer 1 - Global Traffic Distribution (DNS):
Purpose: Route users to nearest/healthiest entry point
Options:
- Azure Traffic Manager (recommended for Azure-centric)
- AWS Route 53 with health checks
- Cloudflare Load Balancing
- NS1 Managed DNS
Configuration:
- Health Probe Interval: 10-30 seconds
- Failover Detection: 3 consecutive failures
- TTL: 60 seconds (balance between failover speed and DNS load)
- Multiple endpoints: 2-3 geographically diverse
Layer 2 - Edge Security and Acceleration (CDN/WAF):
Purpose: DDoS protection, WAF, SSL termination, caching
Primary Options:
- Azure Front Door Premium (recommended)
- Cloudflare
- Akamai
Secondary Options (for redundancy):
- Azure Application Gateway
- AWS CloudFront
- Different CDN provider
Configuration:
- WAF: OWASP Top 10 protection
- DDoS: L3/L4 (network) + L7 (application)
- Bot Protection: Challenge/rate limiting
- Caching: Based on Cache-Control headers
- Private Link: Direct connection to origins (no public internet)
Layer 3 - Regional Load Balancing (Optional):
Purpose: Distribute within region, handle origin failures
Options:
- Azure Load Balancer
- Application Gateway (if not in Layer 2)
Configuration:
- Health probes to backend instances
- Session affinity if stateful
Layer 4 - Origin (Application Tier):
Purpose: Actual application serving
Configuration:
- Multi-region deployment (2+ regions)
- Zone-redundant within region
- Auto-scaling based on demand
- Private endpoints (no public internet access)
- Accept traffic ONLY from Layer 2/3 (restrict by source IP/service tag)
Recommended Regional Distribution:
Primary Region: East US 2
- App Services: 20 apps (50% capacity)
- Configuration: Zone-redundant, Premium v3
- Private Endpoints: Connected to AFD Premium
Secondary Region: West US 2
- App Services: 20 apps (50% capacity)
- Configuration: Zone-redundant, Premium v3
- Private Endpoints: Connected to AFD Premium
Routing Strategy:
Normal Operation:
- AFD uses latency-based routing
- Users routed to nearest region
- Both regions active (true active-active)
Regional Failure:
- AFD automatically detects unhealthy region (health probes)
- Routes 100% traffic to healthy region
- Auto-scaling handles increased load
Capacity Planning:
- Each region sized for 60-70% of total traffic (N+1 redundancy)
- Auto-scale max: 150% of normal peak capacity
- Ensure sufficient quota in both regions
Based on all considerations (cost, complexity, reliability), here’s the recommended architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Azure DNS (contoso.com) │
│ - A records point to AFD │
│ - Permanent TXT validation records │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Azure Front Door Premium (Global) │
│ - Origin Group 1 (Priority 1): App Services via Private Link │
│ - Origin Group 2 (Priority 2): Application Gateway │
│ - WAF Premium: OWASP 3.2, Bot Protection │
│ - 40 Custom Domains, Wildcard Certificates │
└───────────────────┬────────────────────┬────────────────────────┘
│ │
┌───────────┴──────┐ ┌─────────┴──────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐
│ East US 2 │ │ West US 2 │ │ Application Gateway │
│ (Primary) │ │ (Secondary) │ │ (East US 2) │
│ │ │ │ │ - Auto-scale 0-10 │
│ App Service │ │ App Service │ │ - Standby failover │
│ 20 apps │ │ 20 apps │ │ - WAF enabled │
│ Zone-redun. │ │ Zone-redun. │ │ - 40 backend pools │
│ Private EP │ │ Private EP │ └──────────┬───────────────┘
└──────┬───────┘ └──────┬───────┘ │
│ │ │
└──────────────────┴──────────────────────┘
│
▼
┌───────────────────────┐
│ Azure Key Vault │
│ - 3-5 Certificates │
│ - Automated Renewal │
└───────────────────────┘
Layer 1 - Global Edge (Azure Front Door Premium):
SKU: Premium_AzureFrontDoor
Endpoints: 1 global endpoint
Custom Domains: 40 domains
Certificates: 3-5 wildcard/SAN certs from Key Vault
Origin Group 1 (Primary - Direct to App Services):
Priority: 1
Origins: 40 App Services (20 in East US 2, 20 in West US 2)
Connection: Private Link
Health Probe: /health every 30 seconds
Load Balancing: Latency-based (route to nearest region)
Origin Group 2 (Secondary - Application Gateway):
Priority: 2
Origins: 1 Application Gateway (East US 2)
Health Probe: /health every 30 seconds
Activation: Only if Origin Group 1 unhealthy
WAF:
Mode: Prevention
Ruleset: Microsoft_DefaultRuleSet_2.1 + Microsoft_BotManagerRuleSet_1.0
Custom Rules: Rate limiting (1000 req/min/IP)
Caching:
Policy: Cache static assets (images, CSS, JS) for 24 hours
Query String: Include in cache key for API responses
Routing:
Rules: 40 routes (one per app/domain)
HTTPS Redirect: Enabled
HTTP/2: Enabled
Layer 2 - Regional Compute (App Services):
Primary Region: East US 2
App Service Plans: 4 plans (10 apps each)
SKU: P1v3 (2 vCPU, 8 GB RAM)
Zone Redundancy: Enabled
Auto-Scale: Min 1, Max 5 instances per plan
Deployment Slots: 1 staging slot per app
Secondary Region: West US 2
Configuration: Identical to primary
Data Sync: Continuous (shared database/storage)
Security:
Public Access: Disabled
Private Endpoints: Enabled
VNet Integration: Enabled
Managed Identity: Enabled
Layer 3 - Standby Failover (Application Gateway):
Location: East US 2 (could be any region)
SKU: Standard_v2
Auto-Scale: Min 0, Max 10 (saves cost during normal operation)
Configuration:
Listeners: 40 HTTPS listeners (one per app)
Backend Pools: 40 pools (pointing to App Services)
Health Probes: HTTPS /health every 30 seconds
SSL Certificates: From Key Vault (same as AFD)
WAF:
SKU: WAF_v2
Ruleset: OWASP 3.2 (matching AFD)
Activation:
Trigger: AFD detects Origin Group 1 unhealthy
Action: AFD automatically routes to Origin Group 2
Scale: Application Gateway auto-scales from 0 to needed capacity
Layer 4 - Certificate Management (Azure Key Vault):
Certificates:
1. wildcard-contoso-com (*.contoso.com)
2. wildcard-internal-contoso-com (*.internal.contoso.com)
3. san-special-apps (5-10 special case domains)
Auto-Renewal:
Azure Function: Checks weekly, renews at 30 days before expiry
Sync: Automatically updates AFD and App Gateway (via managed identity reference)
Access:
AFD Premium: Key Vault Secrets User (read certificates)
Application Gateway: Key Vault Secrets User (read certificates)
Azure Function: Key Vault Certificates Officer (renew certificates)
Layer 5 - Monitoring (Azure Monitor):
Application Insights: One per App Service (40 instances)
Log Analytics Workspace: Centralized logs from all resources
Availability Tests: 40 tests (one per app from 5 global locations)
Alerts:
Critical: AFD origin health, certificate expiry <7 days
Warning: Increased latency, CPU >80%
Info: Deployment events, scaling events
Availability:
Target SLA: 99.95% (composite SLA)
- AFD Premium SLA: 99.99%
- App Services (Zone-redundant) SLA: 99.95%
- Application Gateway SLA: 99.95%
Expected Downtime: ~4.3 hours/year maximum
Actual Expected: <1 hour/year (with proper failover)
Performance:
Latency (P95):
- Global users → AFD edge: 20-50ms
- AFD → App Service (via Private Link): 5-15ms
- Total E2E latency: 100-200ms (application dependent)
Throughput:
- AFD: Unlimited (global scale)
- Single App Service instance: ~2,000 req/sec
- Total capacity: 80,000+ req/sec (40 apps × 2,000 req/sec)
Failover Characteristics:
AFD Origin Group Failover (Primary → Secondary):
- Detection time: 30-90 seconds (3 failed health probes @ 30s interval)
- Route update: <5 seconds (AFD edge updates)
- Total RTO: <2 minutes
Regional Failover (East US 2 → West US 2):
- Detection time: 30-90 seconds
- AFD routes to healthy region automatically
- Total RTO: <2 minutes
Certificate Failover:
- Validation delay: 0 seconds (pre-validated)
- HTTPS availability: Immediate
Scalability:
Auto-Scale Triggers:
- CPU > 70%: Scale out
- HTTP Queue Length > 100: Scale out
- CPU < 30% for 10 minutes: Scale in
Scale Limits:
- App Service: 1-5 instances per plan (40 plans = max 200 instances)
- Application Gateway: 0-10 instances
- Total concurrent users: 500,000+ (with caching)
Cost (Monthly):
Normal Operation: $930-980
During Failover: $1,130-1,230
Per-App Cost: ~$23-25/app/month
Recovery Time Objective (RTO):
Scenario 1: AFD Complete Outage
- Failover to Application Gateway
- RTO: <2 minutes (automated)
Scenario 2: Single Region Failure (East US 2)
- Failover to West US 2
- RTO: <2 minutes (automated via AFD)
Scenario 3: Complete Azure Outage (hypothetical)
- Manual failover to Akamai (if implemented)
- RTO: 5-10 minutes (DNS TTL + manual activation)
Scenario 4: Certificate Expiration
- Automated renewal + sync
- RTO: 0 (no downtime)
Recovery Point Objective (RPO):
- Application data: 0 (active-active, shared database)
- Configuration: 0 (IaC in Git)
- Logs: <5 minutes (Log Analytics ingestion delay)
Backup Strategy:
Infrastructure:
- Bicep/Terraform code in Git (versioned)
- Daily automated deployment to staging (validation)
Certificates:
- Stored in Azure Key Vault (geo-redundant)
- Exported to secure storage monthly (offline backup)
Configuration:
- AFD/App Gateway configuration exported daily
- Stored in Azure Storage (GRS - geo-redundant)
Application Data:
- Database: Automated backups per Azure SQL SLA
- File storage: GRS or ZRS based on criticality
This architecture provides:
High Availability: 99.95%+ SLA with multi-region, multi-platform redundancy
Fast Failover: <2 minute RTO with automated health-based routing
Zero Certificate Delay: Pre-validated certificates across all platforms
Cost Optimized: approx $930-980/month ($24/app) with auto-scaling
Security Hardened: Zero-trust with Private Link, WAF, managed identities
Operationally Mature: Full monitoring, alerting, IaC, runbooks
END OF DOCUMENT