Observability Blind Spot — AWS DevOps Engineer (DOP-C02)
You missed a monitoring or logging requirement. The exam tests whether you know what to observe, not just what to build.
Visibility Into Symptoms Isn't Root Cause
CloudWatch Metrics show the spike. CloudWatch Logs capture the error. The scenario asks for a solution that traces a latency regression across three microservices to the specific downstream call causing it. Candidates reach for CloudWatch because it's already in the scenario. AWS X-Ray is the answer because distributed trace maps, subsegment timing, and service graph annotations are what "root cause across services" requires. The exam distinguishes between knowing something broke and knowing where and why.
The Scenario
A serverless application with API Gateway, Lambda, and DynamoDB returns slow responses. CloudWatch shows Lambda duration averages 5 seconds but no errors. You recommend CloudWatch alarms on duration metrics. The correct answer is enabling X-Ray tracing to identify which downstream call is slow. X-Ray traces the full request: API Gateway routing (200ms), Lambda cold start (800ms), DynamoDB query (3,200ms), response serialization (800ms). The bottleneck is a DynamoDB scan operation that should be a query. CloudWatch metrics tell you Lambda is slow; X-Ray tells you why. The scenario asked "diagnose latency issues" — that requires request-level tracing, not service-level metrics.
How to Spot It
- •CloudWatch Metrics shows aggregated health (error rates, duration, throttles). CloudWatch Logs shows individual event details (error messages, stack traces). X-Ray shows request flow across services (where time is spent, which service is the bottleneck). Match the diagnostic tool to the question: "Is it broken?" = Metrics. "What happened?" = Logs. "Where is the bottleneck?" = X-Ray.
- •CloudWatch Container Insights for ECS/EKS provides CPU, memory, and network metrics per container. But if the scenario asks why inter-service calls are slow, Container Insights shows resource utilization, not request-level latency. You need X-Ray or Application Signals for request tracing across services.
- •Distributed architectures (Lambda calling SQS calling another Lambda calling DynamoDB) create blind spots at every service boundary. Per-service metrics cannot show that the bottleneck is in the SQS consumer, not the producer. The exam tests whether you recognize when distributed tracing is required.
Decision Rules
When a scenario requires automated multi-step remediation within a hard MTTR window and forbids custom code, an EventBridge rule routing CloudWatch alarm state-change events to an SSM Automation runbook is the correct answer; a CloudWatch alarm plus SNS terminates at notification and leaves the multi-step remediation unexecuted.
Domain Coverage
Difficulty Breakdown
Related Patterns