Design systems that survive failures — from individual instance failures to entire region outages.
Availability and Recovery terms:
RTO (Recovery Time Objective): how long until systems are back online
RPO (Recovery Point Objective): maximum acceptable data loss (in time)
Availability targets:
- 99% = 87.6 hours downtime/year
- 99.9% = 8.76 hours downtime/year
- 99.99% = 52.6 minutes downtime/year
- 99.999% (five nines) = 5.26 minutes downtime/year
Multi-AZ vs Multi-Region:
- Multi-AZ: protects against single data centre failure (common practice)
- Multi-Region: protects against full region failure (for critical systems)
DR Strategies (from cheap to expensive):
1. Backup & Restore: S3/Glacier backups, no running standby. High RTO (hours)
2. Pilot Light: minimal critical resources running in DR region. RTO: minutes
3. Warm Standby: scaled-down but fully functional in DR region. RTO: minutes
4. Multi-Site Active-Active: both regions serve traffic. RTO: zero/near-zero
AWS services for HA:
- Route 53 health checks + DNS failover
- Aurora Global Database: cross-region replication, < 1 second RPO
- S3 Cross-Region Replication
- DynamoDB Global Tables
- Elastic Disaster Recovery (CloudEndure)
Chaos Engineering:
- Deliberately inject failures (Netflix Chaos Monkey)
- AWS Fault Injection Simulator (FIS)
- GameDays: simulate real incidents
RTO (Recovery Time Objective): how long until systems are back online
RPO (Recovery Point Objective): maximum acceptable data loss (in time)
Availability targets:
- 99% = 87.6 hours downtime/year
- 99.9% = 8.76 hours downtime/year
- 99.99% = 52.6 minutes downtime/year
- 99.999% (five nines) = 5.26 minutes downtime/year
Multi-AZ vs Multi-Region:
- Multi-AZ: protects against single data centre failure (common practice)
- Multi-Region: protects against full region failure (for critical systems)
DR Strategies (from cheap to expensive):
1. Backup & Restore: S3/Glacier backups, no running standby. High RTO (hours)
2. Pilot Light: minimal critical resources running in DR region. RTO: minutes
3. Warm Standby: scaled-down but fully functional in DR region. RTO: minutes
4. Multi-Site Active-Active: both regions serve traffic. RTO: zero/near-zero
AWS services for HA:
- Route 53 health checks + DNS failover
- Aurora Global Database: cross-region replication, < 1 second RPO
- S3 Cross-Region Replication
- DynamoDB Global Tables
- Elastic Disaster Recovery (CloudEndure)
Chaos Engineering:
- Deliberately inject failures (Netflix Chaos Monkey)
- AWS Fault Injection Simulator (FIS)
- GameDays: simulate real incidents