Rpo Rto Confusion — AWS DevOps Engineer (DOP-C02)
You confused recovery point objective (data loss tolerance) with recovery time objective (downtime tolerance). Different requirements, different architectures.
The Recovery Target You're Actually Being Tested On
A scenario gives you an RTO of 15 minutes and an RPO of 1 hour. You spot Multi-AZ immediately — automatic failover, sub-minute cutover — and that instinct is correct for RTO. But Multi-AZ replication is synchronous, not a data-age guarantee against corruption or logical failure. The exam is asking whether you can distinguish the time-to-restore constraint from the data-loss-tolerance constraint. Choosing the strategy that satisfies one while ignoring the other is the designed failure.
The Scenario
A financial application requires RPO of 1 hour and RTO of 15 minutes. You design a Pilot Light strategy with Aurora read replicas in the DR region using asynchronous replication. Pilot Light infrastructure can spin up in 15 minutes (meets RTO). But asynchronous replication to a read replica can lag by several hours during peak loads — if the primary fails during a replication lag spike, you lose more than 1 hour of transactions (violates RPO). The correct answer is Warm Standby with Aurora Global Database, which replicates with typical lag under 1 second and provides a pre-scaled environment for fast failover. You satisfied RTO but forgot to verify RPO independently.
How to Spot It
- •RPO drives your replication strategy: RPO of 0 requires synchronous replication (Multi-AZ RDS, Aurora Multi-AZ). RPO of 1 hour can use asynchronous replication if the lag is bounded under 1 hour (Aurora Global Database typical lag < 1 second). RPO of 24 hours can use daily snapshots.
- •RTO drives your failover infrastructure: RTO of minutes requires Warm Standby or Multi-Site Active-Active with pre-provisioned compute. RTO of hours allows Pilot Light (minimal infrastructure, scaled up on failover). RTO of days allows Backup and Restore from S3/snapshots.
- •When a question gives both RPO and RTO, evaluate each independently against every answer option. An answer that meets RTO but fails RPO is wrong. The exam specifically designs options that satisfy one but not the other.
Decision Rules
When both zero-downtime and a sub-60-second rollback RTO are explicit requirements for an ECS workload, blue/green deployment (retaining the original task set for instant ALB listener re-route) must be chosen over rolling update (which requires launching new tasks to roll back, making rollback latency minutes rather than seconds).
Whether to use CodeDeploy ECS rolling update or CodeDeploy ECS blue/green — the 60-second rollback RTO is the deciding constraint because rolling update has no independent stable target group to atomically reroute to, while blue/green can flip the ALB listener rule back in seconds.
Whether a 90-second rollback RTO can be satisfied by CodeDeploy in-place rolling deployment on EC2 or whether blue/green deployment with ALB traffic rerouting to the original Auto Scaling group is required to meet both the zero-downtime and sub-90-second rollback constraints simultaneously.
Whether to use per-service CodeDeploy blue/green deployments with independent rollback groups — which atomically restore the prior known-good target group (satisfying both rollback initiation speed and clean rollback-destination state) — versus a shared rolling deployment that meets rollback initiation time but restores services to a partially-applied mixed state rather than a clean prior artifact version.
Whether to use per-service CodePipelines with CodeDeploy blue/green traffic shifting or a single shared CodePipeline with CodeDeploy rolling updates — determined by whether the rollback mechanism provides a deterministic sub-2-minute recovery time regardless of fleet size.
Whether to replicate artifacts using CodeArtifact cross-region replication plus ECR cross-region replication — preserving dependency-resolution graphs, package metadata, and image-scan results alongside binaries — or to consolidate onto S3 CRR, which replicates object bytes quickly enough to satisfy the RTO but cannot replicate the metadata layer required to satisfy the 5-minute RPO for full build reproducibility.
Whether CloudFormation StackSets alone satisfies both a self-service provisioning RTO and a continuous drift-detection RPO, or whether two separate services — Service Catalog portfolios for RTO and AWS Config continuous evaluation for RPO — are required because StackSets drift detection is on-demand and provisioning is admin-gated.
Whether to enforce infrastructure parameter guardrails at authoring time (CDK constructs or CloudFormation modules, which harden defaults at synthesis or template-creation time) or at provisioning time (Service Catalog portfolios with launch constraints, which block parameter overrides at the moment a developer initiates a stack deployment) — the scenario's 'no parameter override during provisioning' constraint disqualifies authoring-time approaches because a developer with template access can still override values at stack-create time.
Whether detection latency (the 2-minute non-compliance visibility window, analogous to RPO) requires AWS Config change-triggered evaluation, independent of whether the chosen remediation path (analogous to RTO) can meet the 10-minute SLA.
Whether to retain EC2 Auto Scaling, which provides steady-state Multi-AZ HA but cannot deliver sub-60-second new-capacity provisioning during burst events, or migrate to Lambda fronted by API Gateway, which provides sub-second concurrent scaling that satisfies both the scale-out RTO and the zero-request-loss RPO in burst scenarios.
Whether to derive dashboard metrics through a continuous in-stream mechanism (CloudWatch Logs metric filter emitting a custom CloudWatch metric) or through a periodic batch query mechanism (CloudWatch Logs Insights scheduled query), given a hard sub-60-second dashboard freshness requirement and a lowest-operational-overhead constraint.
Whether to use AWS Config continuous evaluation paired with a managed SSM Automation document via Config auto-remediation — satisfying both the detection-frequency constraint (no polling gap, RPO-equivalent) and the automated end-to-end remediation SLA (deterministic, code-free execution, RTO-equivalent) — versus a solution that captures compliance state reliably but routes remediation through a notification or custom-code path that cannot guarantee the time-bound automated fix.
Whether EventBridge's native retry policy plus a Lambda dead-letter queue satisfies the zero-event-loss delivery guarantee within the 90-second latency ceiling, or whether inserting SQS between EventBridge and Lambda is required to prevent event loss—recognising that the SQS polling model trades deterministic push latency for buffered pull latency, potentially violating the 90-second processing window (RTO analog) while adding queue-management overhead the scenario does not justify.
Choose between AWS Config auto-remediation invoking a managed SSM Automation document (declarative, no custom code, purpose-built MTTR path) versus routing Config compliance-change events through EventBridge to a custom Lambda function (imperative, adds cold-start latency, custom IAM surface, and bespoke failure-handling code), where the stated MTTR constraint plus the zero-custom-code requirement together disqualify the Lambda path.
Match the rollback tool to the exact failure layer — application deployment (CodeDeploy) versus infrastructure stack (CloudFormation) — given a hard five-minute RTO that infrastructure-layer rollback cannot satisfy for running ECS task sets.
Whether the stated RPO of 30 seconds can be met by RDS cross-region read replica replication lag or requires Aurora Global Database's physical storage replication, which provides RPO under one second regardless of write throughput.
Given explicit RPO (15 min) and RTO (30 min) constraints against a large Aurora dataset with a cost-efficiency preference, determine which DR tier — backup-restore or continuous replication with automated promotion — satisfies both targets simultaneously, and recognize that snapshot recency addresses RPO but does not bound restore duration for the RTO.
Domain Coverage
Difficulty Breakdown
Related Patterns