Near-Right Architecture — AWS Data Engineer (DEA-C01)
Two options were architecturally valid — you picked the one that violates a constraint buried in the scenario. Read constraints before evaluating answers.
Technically sound architecture, wrong governing constraint
The exam frames these scenarios around 'cost-effective' or 'operationally efficient,' not raw throughput. Candidates over-focus on recognizing the service pairing — Glue plus S3 plus Athena looks correct — and miss that the answer choices differ on catalog refresh strategy or partition pruning behavior. Those details determine whether the stated query latency constraint is actually satisfied. Read the constraint word, then trace it forward to the architectural decision it implies.
The Scenario
A company needs a real-time analytics dashboard querying petabytes of log data. The question offers Athena with S3 and Redshift Serverless. Both query structured data at scale. But the scenario says "sub-second response times for repeated queries" — Athena scans S3 on every query (seconds to minutes), while Redshift caches results and returns sub-second on repeats. The constraint is latency on repeated queries, not raw query capability. You picked Athena because it is serverless and cheaper per query, but the access pattern eliminates it.
How to Spot It
- •When both answers use real AWS services that address the primary use case, re-read for the performance constraint. "Sub-second," "real-time," "single-digit millisecond" each eliminate different services. Athena is not sub-second. DynamoDB is not for complex joins. Aurora is not for petabyte-scale analytics.
- •Look for protocol-level constraints. If the scenario says TCP traffic with client IP preservation, that eliminates CloudFront (HTTP/HTTPS only) and points to Global Accelerator + NLB. If it says HTTP with caching, that eliminates Global Accelerator.
- •If you find yourself thinking "both could work," the exam is testing constraint reading. Check for: latency target, protocol, data volume, ordering requirement, or compliance region restriction.
Decision Rules
Choose between a streaming delivery pipeline (Kinesis Data Firehose) that is fully managed on the delivery side but requires a self-managed producer for SaaS sources, and a managed SaaS-native connector (AppFlow) that handles both the pull and delivery within a single no-code flow — when the dominant constraint is minimizing total operational overhead for periodic, low-volume SaaS ingestion.
When event-driven file payloads and execution durations fit within Lambda's constraints (15 min, 10 GB), choose Lambda deployed via SAM over EMR to satisfy the concurrency-vs-cost and deploy-repeatability constraints without incurring cluster lifecycle overhead.
Select AWS Glue over Amazon EMR when the transformation logic is limited to format conversion and field denormalization at moderate data volume, and the team has neither cluster-management capacity nor Spark expertise — Glue's serverless billing and zero cluster lifecycle win decisively on cost-performance balance.
Whether to maintain Glue Data Catalog freshness for predictable Hive-style partitions via scheduled Glue Crawlers or via programmatic catalog updates (BatchCreatePartition API or Lake Formation governed tables), given that the partition scheme is fully known at write time.
Which S3 Glacier storage class satisfies both the cost-reduction goal and the four-hour retrieval SLA — specifically, whether Deep Archive's 12-hour minimum standard retrieval disqualifies it despite offering the lowest per-GB price.
When schema evolution requires simultaneous partition-key changes and nested-attribute additions under a zero-downtime constraint, Lake Formation governed tables satisfy the constraint via transactional ACID commits and automatic compaction, whereas Glue crawlers require manual classifier tuning and can expose partially-updated catalog states to concurrent queries.
Whether to rely on scheduled Glue crawlers (which handle both partition discovery and schema-change detection automatically but impose full re-crawl overhead on predictable layouts) or Lake Formation governed tables (which provide native schema evolution tracking and partition registration through transactions, removing the crawler scheduling, IAM configuration, and re-crawl duration burden).
Whether the pipeline's sensor-wait dependencies and calendar scheduling requirements are best satisfied by a code-native DAG orchestrator (MWAA) or a managed state machine service (Step Functions), given the constraint of zero workflow infrastructure management.
Whether AWS Glue DataBrew's declarative quality-rule framework or a custom AWS Glue PySpark validation script better satisfies a data-quality SLA under a minimal-operational-overhead constraint.
Whether to satisfy cross-engine column-level least-privilege through centralized Lake Formation LF-tag policies or through per-engine IAM managed policies, when the dominant constraint is minimizing ongoing permission drift and policy duplication across multiple analytics services.
Select the observability layer that matches the audit requirement: CloudTrail captures control-plane API events required for compliance, while CloudWatch Logs captures application-level runtime diagnostics — the two layers are not interchangeable, and Athena enables cost-efficient ad-hoc querying of CloudTrail logs without provisioned infrastructure.
Determine whether column-level access restrictions (Lake Formation) satisfy the HIPAA requirement for irreversible PII masking before cross-account sharing, or whether a data-transformation service (Glue DataBrew) paired with a customer-managed KMS key is required to satisfy both the masking and the encryption-at-rest BYOK constraints simultaneously.
Domain Coverage
Difficulty Breakdown
Related Patterns