Operational Complexity Underestimation — AWS DevOps Engineer (DOP-C02)
The answer is correct but operationally expensive. The exam prefers managed services over self-managed when both meet functional requirements.
More Features, More Failure Surface
The distractor introduces a flexible, multi-component design — Lambda triggers feeding an SQS queue driving an ECS task, say — and nothing in it is wrong. What the scenario quietly established is a small team with no dedicated platform engineering. Professional-level questions embed the operational overhead constraint in the context, not the question stem. Count the coordination points. If the architecture requires more operational attention than the scenario's team can provide, it fails the real test.
The Scenario
A team of 3 developers needs to run a containerized application with auto-scaling. You recommend Kubernetes on EC2 with kops for cluster management. The correct answer is ECS on Fargate. The scenario said "small team" and "minimize operational burden." Self-managed Kubernetes requires managing the control plane (etcd backups, API server upgrades, certificate rotation), node group updates, CNI plugin configuration, and ingress controller maintenance. ECS on Fargate eliminates all of that — AWS manages compute, scaling, and patching. The trade-off is less customization, but the scenario never asked for Kubernetes-specific features like custom operators or CRDs.
How to Spot It
- •"Minimize operational overhead," "small team," "reduce management burden" — these phrases are signals to choose the most managed option. ECS Fargate over EKS self-managed nodes. Aurora over self-managed PostgreSQL on EC2. Lambda over always-on containers for event-driven workloads.
- •EKS managed node groups reduce operational burden compared to self-managed nodes, but you still manage node AMI updates, pod scaling, and cluster upgrades. EKS with Fargate eliminates node management entirely but loses DaemonSet support and some storage options. The exam tests these operational trade-offs at each level.
- •Self-managed options (EC2, EKS self-managed, self-hosted databases, self-managed Kafka) are only correct when the scenario explicitly requires a capability that managed services cannot provide — custom kernel modules, specific OS versions, or unsupported database engines.
Decision Rules
When a compliance constraint requires that member accounts cannot perform a specific action against a centralized resource, an SCP applied at the Organizations root or OU provides preventive, zero-per-account-overhead enforcement that is automatically inherited by every newly vended account, whereas per-account Config rules with Lambda auto-remediation introduce compounding deployment burden, Lambda failure modes, and only detective (post-violation) coverage.
Whether native CodeBuild batch builds (or parallel CodeBuild actions in one pipeline stage) with CloudWatch test reports satisfies the parallel-execution, fail-fast, and dashboard requirements at lower operational cost than a custom Step Functions + Lambda orchestration layer.
Whether to route each artifact type to its purpose-built repository service (CodeArtifact for Maven, ECR for Docker images) or consolidate both types into a single general-purpose store and layer custom tooling for dependency resolution, metadata indexing, and vulnerability scanning on top.
Whether native CodeBuild test actions inside a CodePipeline stage satisfy the integration-test gating requirement through built-in exit-code-driven stage failure, or whether external Step Functions orchestration is necessary to meet the stated constraint.
Whether to publish approved infrastructure patterns as Service Catalog portfolios with launch constraints (governed self-service with scoped IAM, built-in drift tracking) or share CloudFormation modules via a private registry (composition-time reusability that still requires broad IAM for each team and leaves drift detection as an independently wired operational burden per account).
Whether native SSM Automation runbooks propagated via State Manager Organizations-level associations satisfy the compliance enforcement requirement with lower total operational cost than a custom Step Functions + Lambda orchestration deployed into each member account.
Whether to enforce patch compliance at scale using native SSM Automation runbooks propagated as State Manager associations via Organizations — requiring no per-account code deployment — or to build a Step Functions + Lambda orchestration that can achieve the same remediation but incurs per-account deployment, IAM role provisioning, and code maintenance overhead the two-person team cannot sustainably absorb.
Whether to implement real-time log-to-metric conversion via CloudWatch Logs metric filters feeding CloudWatch alarms, or route logs through a batch analytics pipeline (S3 export → Athena queries → SNS) to satisfy a strict sub-minute detection SLA.
Whether to close the alert-to-remediation gap using a native managed orchestration path (CloudWatch composite alarm → EventBridge → SSM Automation runbook) or to accept an architecturally simpler alerting-only path (CloudWatch alarms → SNS) that delegates remediation to manual operator response and violates the stated SLA.
Whether native CloudWatch Logs metric filters feeding CloudWatch alarms satisfy the sub-minute detection SLA at minimum operational cost, or whether an Athena-over-S3 batch query pipeline — which appears more analytically flexible — can meet the same constraint despite adding export scheduling, partition management, and query-orchestration burden.
Whether the stated RPO target (sub-minute data loss tolerance across Regions) and the team's operational capacity justify a managed global replication service over a custom-scripted cross-Region promotion architecture that appears equivalent but shifts hidden coordination burden onto the team.
Whether the chosen DR tier can simultaneously satisfy both the RPO ceiling and the RTO ceiling for a large dataset, with cost-efficiency ruling out full warm-standby parity and disqualifying backup-restore because restore duration for multi-TB snapshots far exceeds the 30-minute RTO regardless of backup frequency.
Domain Coverage
Difficulty Breakdown
Related Patterns