AWS DEA-C01Trap Reference

Commonly Confused Services on DEA-C01

Data Engineer Associate confusions cluster around the AWS data pipeline stack — services that all handle data movement, transformation, or storage but differ by query pattern, processing model, or operational overhead.

Each section below gives you the deciding signal, a quick check to run when you encounter the confusion, and why the wrong answer keeps looking right.

Amazon Kinesis Data StreamsAmazon Kinesis Data FirehoseAmazon MSK
#1

Custom real-time consumers vs. managed delivery vs. Apache Kafka

All three handle streaming data ingestion, so candidates pick based on whichever streaming service they remember first.

Deciding signal

Kinesis Data Streams is AWS-native real-time streaming with configurable retention and multiple consumers. Your application reads records and controls processing. Kinesis Data Firehose automatically delivers data to a destination (S3, Redshift, OpenSearch) without you building a consumer — it optionally transforms via Lambda. Amazon MSK (Managed Streaming for Apache Kafka) is a fully managed Apache Kafka service. It is the right answer when the scenario specifies Kafka compatibility, existing Kafka producer code, or the need for Kafka-specific features (topics, consumer groups, Kafka Streams). When an existing Kafka ecosystem is in scope, MSK. When the scenario requires automatic delivery to S3 or Redshift without consumer code, Firehose. When it requires custom consumer logic and replay in AWS-native streaming, Data Streams.

Quick check

Is this automatic delivery to a destination without consumer code (Firehose), custom AWS-native streaming with replay (Data Streams), or Kafka-compatible streaming (MSK)?

Why it looks right

Kinesis Data Streams and Firehose are both Kinesis services and candidates conflate them. MSK is the correct answer when Apache Kafka is explicitly mentioned — a signal that candidates sometimes overlook.

AWS GlueAmazon EMRAmazon Athena
#2

Serverless ETL vs. managed Spark/Hadoop cluster vs. serverless SQL on S3

All three appear in data pipeline questions, and both Glue and EMR run Spark, so candidates blur them.

Deciding signal

Athena is a serverless query engine — it runs SQL directly on S3 data without loading it into a database. No infrastructure to manage, and you pay per query. It is the right answer when the scenario involves interactive or scheduled SQL queries on S3 without data transformation. Glue is serverless ETL: you write PySpark or Python scripts, and Glue manages the compute. It catalogs data via the Glue Data Catalog and is optimal for transformation jobs that load data into S3 or a data warehouse. EMR provisions a managed cluster (EC2 or EKS) running Hadoop, Spark, HBase, or Presto. It is the right answer when the workload requires full cluster control, custom frameworks not supported in Glue, or cost optimization via Reserved Instances on large persistent clusters.

Quick check

Is this SQL queries directly on S3 with no ETL (Athena), serverless managed ETL jobs with Spark (Glue), or a managed Hadoop/Spark cluster with full configuration control (EMR)?

Why it looks right

Glue and EMR both run Spark and candidates treat them as equivalent. EMR is correct when the scenario requires custom cluster configuration, non-Glue frameworks, or persistent cluster workloads.

Amazon RedshiftAmazon AthenaAmazon DynamoDB
#3

Petabyte-scale data warehouse vs. ad-hoc S3 query vs. high-velocity NoSQL

All three handle large datasets, so candidates apply DynamoDB or Redshift as defaults without checking the query pattern.

Deciding signal

DynamoDB is a key-value and document NoSQL store optimized for high-throughput transactional access — single-digit millisecond reads and writes at any scale. It is not a query engine for complex analytics. Athena runs SQL on S3 data without loading it into a database — suited for ad-hoc exploration, infrequent queries, and log analysis where S3 is already the data store. Redshift is a columnar data warehouse for complex analytical queries over large structured datasets. It loads data into columnar storage, supports joins across billions of rows, and integrates with BI tools. When the scenario describes transactional key-based access with sub-millisecond latency, DynamoDB. When it describes complex SQL analytics on structured data loaded into a warehouse, Redshift. When it describes SQL queries on existing S3 data without loading, Athena.

Quick check

Is this high-speed key-value transactional access (DynamoDB), SQL analytics on loaded structured data in a warehouse (Redshift), or SQL queries directly on S3 without loading (Athena)?

Why it looks right

Redshift and Athena both run SQL and candidates conflate them. Athena is correct when data stays in S3; Redshift is correct when data is loaded into a warehouse for complex repeated analytics.

AWS Lake FormationAWS Glue Data Catalog
#4

Governed data lake with fine-grained access vs. metadata catalog for discovery

Lake Formation uses the Glue Data Catalog internally, so candidates treat them as the same thing.

Deciding signal

The Glue Data Catalog is a central metadata store — it holds database and table definitions, schemas, and partitions discoverable by Athena, EMR, and Glue ETL jobs. It does not enforce access control beyond IAM. Lake Formation builds a governance layer on top of the Glue Data Catalog: it adds table-level, column-level, and row-level access control, data location registration, and an audit trail. Lake Formation permissions are evaluated in addition to IAM and are required when the scenario involves restricting which users can see specific columns or rows in the data lake. When the scenario involves cataloging and discovering metadata, Glue Data Catalog. When it involves fine-grained access control over who sees which data lake tables or columns, Lake Formation.

Quick check

Is this cataloging data schemas for discovery (Glue Data Catalog), or enforcing fine-grained column- or row-level access control over data lake tables (Lake Formation)?

Why it looks right

The Glue Data Catalog is the underlying metadata store for Lake Formation, so candidates use Glue Data Catalog for access control scenarios. Lake Formation is the specific answer when the requirement involves restricting data access at the table, column, or row level.

AWS Glue DataBrewAWS Glue StudioAWS Glue ETL Scripts
#5

Visual data prep for analysts vs. visual ETL workflow builder vs. code-based ETL

All three are Glue products, so candidates treat them as equivalent data processing options.

Deciding signal

Glue ETL Scripts are Python or PySpark jobs you write directly — full control, maximum flexibility, required when the transformation logic is complex or custom. Glue Studio is a visual drag-and-drop interface for building ETL jobs that generates PySpark code — it reduces scripting for straightforward ETL patterns while still running on Glue infrastructure. DataBrew is a visual data preparation tool for data analysts and scientists — it provides 250+ built-in transforms for cleaning, normalizing, and enriching data without writing any code, oriented toward data exploration rather than production ETL pipeline construction.

Quick check

Is this a production ETL pipeline with custom logic (ETL Scripts), a visually constructed ETL job for engineers (Glue Studio), or no-code data cleaning and profiling for analysts (DataBrew)?

Why it looks right

DataBrew and Glue Studio both have visual interfaces and candidates conflate them. DataBrew is for analysis and data quality exploration; Glue Studio is for building production ETL pipelines without scripting.

AWS Database Migration ServiceAWS GlueAWS DataSync
#6

Live database migration vs. ETL transformation vs. bulk file/storage transfer

All three move data between systems, so candidates apply Glue as the default data movement tool.

Deciding signal

DMS migrates relational databases — it performs a full load and then continuously captures changes (CDC) from the source database to the target with minimal downtime. It supports homogeneous (MySQL to MySQL) and heterogeneous (Oracle to Aurora) migrations. Glue is an ETL service for transforming and loading data between storage systems — it is not designed for live database replication with CDC. DataSync transfers large volumes of data between on-premises storage (NFS, SMB, HDFS) and AWS storage (S3, EFS, FSx) or between AWS storage services — it is optimized for bulk file and object transfers, not database rows. When the scenario involves migrating a live database with change capture, DMS. When it involves transforming and loading data in a batch ETL pattern, Glue. When it involves bulk transfer of files or objects between storage systems, DataSync.

Quick check

Is this migrating a live relational database with CDC (DMS), transforming and loading data between storage systems in an ETL pattern (Glue), or bulk-transferring files between on-premises and AWS storage (DataSync)?

Why it looks right

Glue is the familiar data movement service and candidates apply it to database migration scenarios. DMS is specifically for live database replication and migration — a distinct use case that Glue does not support.

Amazon Redshift SpectrumAmazon Athena
#7

Querying S3 from within Redshift vs. standalone serverless SQL on S3

Both query data in S3 using SQL, so candidates treat them as identical options.

Deciding signal

Athena is a standalone serverless query engine — it queries S3 directly with no Redshift dependency. It is the right answer when the scenario involves ad-hoc SQL queries on S3 data without any existing Redshift warehouse. Redshift Spectrum extends a Redshift cluster to query external S3 data alongside Redshift tables in the same SQL query. It requires an existing Redshift cluster. The key use case is joining warehouse data stored in Redshift with historical or raw data stored in S3 in a single query. When there is no Redshift and the goal is SQL on S3, Athena. When Redshift already exists and the goal is querying S3 data alongside warehouse data, Spectrum.

Quick check

Is there an existing Redshift cluster that needs to join warehouse data with S3 data in one query (Spectrum), or is this a standalone SQL query on S3 with no Redshift dependency (Athena)?

Why it looks right

Spectrum and Athena both query S3 with SQL and candidates use them interchangeably. Spectrum requires a Redshift cluster; Athena is serverless and standalone. The presence or absence of Redshift is the deciding factor.

Amazon QuickSightAmazon Athena
#8

BI visualization and dashboards vs. SQL query engine

Both interact with data and both have "query" capabilities, so candidates treat them as visualization alternatives.

Deciding signal

Athena is a query engine — it executes SQL and returns results as data. It does not render charts or dashboards. QuickSight is a BI service for creating interactive visualizations, dashboards, and reports that can be shared with business users. It connects to Athena (and Redshift, S3, RDS, and others) as a data source. The relationship is not OR — they often work together: Athena queries the data; QuickSight visualizes it. When the scenario describes building dashboards or charts for business stakeholders, QuickSight. When it describes running SQL transformations or analysis on data, Athena.

Quick check

Is the goal to execute SQL queries and retrieve data results (Athena), or to build interactive charts, dashboards, and reports for business users (QuickSight)?

Why it looks right

Both are described as "analytics" services and candidates confuse query capability with visualization. QuickSight renders visuals; Athena retrieves data. They are complementary, not competing.

Amazon EventBridgeAmazon SQSAmazon SNS
#9

Rule-based event routing vs. durable queue vs. push fanout

All three trigger downstream processing in data pipelines, so candidates pick based on "events" language alone.

Deciding signal

SQS is a durable pull-based queue — messages are retained until consumed, enabling decoupled, worker-paced processing with guaranteed delivery. It suits pipeline stages where the consumer controls processing speed. SNS pushes messages simultaneously to all subscribers (Lambda, SQS, HTTP) — it is for fanout to multiple systems when every subscriber must receive every event. EventBridge routes events based on content-matching rules — it filters events by source, detail type, or specific fields and delivers them to different targets based on those rules. In data pipelines, EventBridge is the right answer when different types of data ingestion events should trigger different processing stages. SQS is the right answer when a processing stage must buffer and pace consumption independently.

Quick check

Does each pipeline stage need to consume independently at its own pace (SQS), do multiple downstream systems need every event simultaneously (SNS), or do different event types need to route to different targets (EventBridge)?

Why it looks right

EventBridge is the modern event-routing service and candidates apply it to all pipeline trigger scenarios. SQS is correct when pipeline stages need buffering and decoupled consumption — which EventBridge alone does not provide.

Amazon Managed Service for Apache FlinkAWS Glue Streaming ETL
#10

Continuous SQL/Flink analytics on streams vs. continuous Spark ETL on streams

Both process streaming data continuously, so candidates treat them as interchangeable streaming ETL options.

Deciding signal

Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics) runs Apache Flink applications for real-time stream processing — it supports complex event processing, windowed aggregations, and stateful streaming analytics in Flink or SQL. It is optimized for continuous real-time analytics with low latency. AWS Glue Streaming ETL runs Spark Structured Streaming jobs for continuous ETL — reading from Kinesis or Kafka and writing transformed data to S3 or other destinations. Glue Streaming is better suited for continuous ETL workloads where Spark ecosystem compatibility matters; Flink is better suited for real-time analytics, stateful computations, and complex event-driven patterns.

Quick check

Is this real-time stateful streaming analytics or complex event processing in Flink (Managed Flink), or continuous Spark-based ETL transforming and loading stream data (Glue Streaming)?

Why it looks right

Both run streaming jobs continuously. Glue Streaming is often the default answer because Glue is familiar — but Managed Apache Flink is the correct answer when real-time analytics, windowing, or Flink-native features are required.

Train these confusions, not just read them

10 DEA-C01 questions. Pattern-tagged with trap analysis. Free, no signup required.

Start DEA-C01 Mini-Trainer →