Spring Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exams65

ExamsBrite Dumps

AWS Certified Data Engineer - Associate (DEA-C01) Question and Answers

AWS Certified Data Engineer - Associate (DEA-C01)

Last Update Feb 28, 2026
Total Questions : 255

We are offering FREE Data-Engineer-Associate Amazon Web Services exam questions. All you do is to just go and sign up. Give your details, prepare Data-Engineer-Associate free exam questions and then go for complete pool of AWS Certified Data Engineer - Associate (DEA-C01) test questions that will help you more.

Data-Engineer-Associate pdf

Data-Engineer-Associate PDF

$36.75  $104.99
Data-Engineer-Associate Engine

Data-Engineer-Associate Testing Engine

$43.75  $124.99
Data-Engineer-Associate PDF + Engine

Data-Engineer-Associate PDF + Testing Engine

$57.75  $164.99
Questions 1

A company has an application that uses an Amazon API Gateway REST API and an AWS Lambda function to retrieve data from an Amazon DynamoDB instance. Users recently reported intermittent high latency in the application's response times. A data engineer finds that the Lambda function experiences frequent throttling when the company's other Lambda functions experience increased invocations.

The company wants to ensure the API's Lambda function operates without being affected by other Lambda functions.

Which solution will meet this requirement MOST cost-effectively?

Options:

A.  

Increase the number of read capacity unit (RCU) in DynamoDB.

B.  

Configure provisioned concurrency for the Lambda function.

C.  

Configure reserved concurrency for the Lambda function.

D.  

Increase the Lambda function timeout and allocated memory.

Discussion 0
Questions 2

A company needs a solution to manage costs for an existing Amazon DynamoDB table. The company also needs to control the size of the table. The solution must not disrupt any ongoing read or write operations. The company wants to use a solution that automatically deletes data from the table after 1 month.

Which solution will meet these requirements with the LEAST ongoing maintenance?

Options:

A.  

Use the DynamoDB TTL feature to automatically expire data based on timestamps.

B.  

Configure a scheduled Amazon EventBridge rule to invoke an AWS Lambda function to check for data that is older than 1 month. Configure the Lambda function to delete old data.

C.  

Configure a stream on the DynamoDB table to invoke an AWS Lambda function. Configure the Lambda function to delete data in the table that is older than 1 month.

D.  

Use an AWS Lambda function to periodically scan the DynamoDB table for data that is older than 1 month. Configure the Lambda function to delete old data.

Discussion 0
Questions 3

A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour.

Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)

Options:

A.  

Configure AWS Glue triggers to run the ETL jobs even/ hour.

B.  

Use AWS Glue DataBrewto clean and prepare the data for analytics.

C.  

Use AWS Lambda functions to schedule and run the ETL jobs even/ hour.

D.  

Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift.

E.  

Use the Redshift Data API to load transformed data into Amazon Redshift.

Discussion 0
Questions 4

A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes.

Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)

Options:

A.  

Use an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.

B.  

Create an AWS Step Functions workflow and add two states. Add the first state before the Lambda function. Configure the second state as a Wait state to periodically check whether the Athena query has finished using the Athena Boto3 get_query_execution API call. Configure the workflow to invoke the next query when the current query has finished running.

C.  

Use an AWS Glue Python shell job and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.

D.  

Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes to determine whether the current Athena query has finished running successfully. Configure the Python shell script to invoke the next query when the current query has finished running.

E.  

Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the Athena queries in AWS Batch.

Discussion 0
Questions 5

A company stores customer data in an Amazon S3 bucket. The company must permanently delete all customer data that is older than 7 years.

Options:

A.  

Configure an S3 Lifecycle policy to permanently delete objects that are older than 7 years.

B.  

Use Amazon Athena to query the S3 bucket for objects that are older than 7 years. Configure Athena to delete the results.

C.  

Configure an S3 Lifecycle policy to move objects that are older than 7 years to S3 Glacier Deep Archive.

D.  

Configure an S3 Lifecycle policy to enable S3 Object Lock on all objects that are older than 7 years.

Discussion 0
Questions 6

A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer needs to query only one column of the data.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Confiqure an AWS Lambda function to load data from the S3 bucket into a pandas dataframe- Write a SQL SELECT statement on the dataframe to query the required column.

B.  

Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.

C.  

Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column.

D.  

Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in Amazon Athena to query the required column.

Discussion 0
Questions 7

A data engineer runs Amazon Athena queries on data that is in an Amazon S3 bucket. The Athena queries use AWS Glue Data Catalog as a metadata table.

The data engineer notices that the Athena query plans are experiencing a performance bottleneck. The data engineer determines that the cause of the performance bottleneck is the large number of partitions that are in the S3 bucket. The data engineer must resolve the performance bottleneck and reduce Athena query planning time.

Which solutions will meet these requirements? (Choose two.)

Options:

A.  

Create an AWS Glue partition index. Enable partition filtering.

B.  

Bucket the data based on a column that the data have in common in a WHERE clause of the user query

C.  

Use Athena partition projection based on the S3 bucket prefix.

D.  

Transform the data that is in the S3 bucket to Apache Parquet format.

E.  

Use the Amazon EMR S3DistCP utility to combine smaller objects in the S3 bucket into larger objects.

Discussion 0
Questions 8

A company needs to store semi-structured transactional data in a serverless database.

The application writes data infrequently but reads it frequently, with millisecond retrieval required.

Options:

A.  

Store the data in an Amazon S3 Standard bucket. Enable S3 Transfer Acceleration.

B.  

Store the data in an Amazon S3 Apache Iceberg table. Enable S3 Transfer Acceleration.

C.  

Store the data in an Amazon RDS for MySQL cluster. Configure RDS Optimized Reads.

D.  

Store the data in an Amazon DynamoDB table. Configure a DynamoDB Accelerator (DAX) cache.

Discussion 0
Questions 9

During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script.

A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must securely store the credentials.

Which combination of steps should the data engineer take to meet these requirements? (Choose two.)

Options:

A.  

Store the credentials in the AWS Glue job parameters.

B.  

Store the credentials in a configuration file that is in an Amazon S3 bucket.

C.  

Access the credentials from a configuration file that is in an Amazon S3 bucket by using the AWS Glue job.

D.  

Store the credentials in AWS Secrets Manager.

E.  

Grant the AWS Glue job 1AM role access to the stored credentials.

Discussion 0
Questions 10

A company is building a new application that ingests CSV files into Amazon Redshift. The company has developed the frontend for the application.

The files are stored in an Amazon S3 bucket. Files are no larger than 5 MB.

A data engineer is developing the extract, transform, and load (ETL) pipeline for the CSV files. The data engineer configured a Redshift cluster and an AWS Lambda function that copies the data out of the files into the Redshift cluster.

Which additional steps should the data engineer perform to meet these requirements?

Options:

A.  

Configure the bucket to send S3 event notifications to Amazon EventBridge. Configure an EventBridge rule that matches S3 new object created events. Set the Lambda function as the target.

B.  

Configure the S3 bucket to send S3 event notifications to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the Lambda function to process the queue.

C.  

Configure AWS Database Migration Service (AWS DMS) to stream new S3 objects to a data stream in Amazon Kinesis Data Streams. Set the Lambda function as the target of the data stream.

D.  

Configure an Amazon EventBridge rule that matches S3 new object created events. Set an Amazon Simple Queue Service (Amazon SQS) queue as the target of the rule. Configure the Lambda function to process the queue.

Discussion 0
Questions 11

A data engineer is building an automated extract, transform, and load (ETL) ingestion pipeline by using AWS Glue. The pipeline ingests compressed files that are in an Amazon S3 bucket. The ingestion pipeline must support incremental data processing.

Which AWS Glue feature should the data engineer use to meet this requirement?

Options:

A.  

Workflows

B.  

Triggers

C.  

Job bookmarks

D.  

Classifiers

Discussion 0
Questions 12

A company has a data processing pipeline that includes several dozen steps. The data processing pipeline needs to send alerts in real time when a step fails or succeeds. The data processing pipeline uses a combination of Amazon S3 buckets, AWS Lambda functions, and AWS Step Functions state machines.

A data engineer needs to create a solution to monitor the entire pipeline.

Which solution will meet these requirements?

Options:

A.  

Configure the Step Functions state machines to store notifications in an Amazon S3 bucket when the state machines finish running. Enable S3 event notifications on the S3 bucket.

B.  

Configure the AWS Lambda functions to store notifications in an Amazon S3 bucket when the state machines finish running. Enable S3 event notifications on the S3 bucket.

C.  

Use AWS CloudTrail to send a message to an Amazon Simple Notification Service (Amazon SNS) topic that sends notifications when a state machine fails to run or succeeds to run.

D.  

Configure an Amazon EventBridge rule to react when the execution status of a state machine changes. Configure the rule to send a message to an Amazon Simple Notification Service (Amazon SNS) topic that sends notifications.

Discussion 0
Questions 13

A company has several new datasets in CSV and JSON formats. A data engineer needs to make the data available to a team of data analysts who will analyze the data by using SQL queries.

Which solution will meet these requirements in the MOST cost-effective way?

Options:

A.  

Create an Amazon RDS for MySQL DB cluster. Use AWS Glue to transform and load the CSV and JSON files into database tables. Provide the data analysts access to the DB cluster.

B.  

Create an AWS Glue DataBrew project that contains the new data. Make the DataBrew project available to the data analysts.

C.  

Store the data in an Amazon S3 bucket. Use an AWS Glue crawler to catalog the S3 data as tables. Create an Amazon Athena workgroup that has a data usage threshold. Grant the data analysts access to the Athena workgroup.

D.  

Load the data into SPICE (Super-fast, Parallel, In-memory Calculation Engine) in Amazon QuickSight. Allow the data analysts to create analyses and dashboards in QuickSight.

Discussion 0
Questions 14

A company stores time-series data that is collected from streaming services in an Amazon S3 bucket. The company must ensure that only workloads that are deployed within the company's VPC can access the data.

Which solution will meet this requirement?

Options:

A.  

Create an S3 bucket policy that uses a condition to allow access only to traffic that originates from the company's VPC.

B.  

Apply a security group to the S3 bucket that allows connections only from the company's VPC CIDR block.

C.  

Define an IAM policy that denies access to all users unless the request originates from within the company's VP

C.  

D.  

Use a network ACL on the VPC subnets to allow only specific resources to access the S3 bucket.

Discussion 0
Questions 15

A company uses Amazon RDS for MySQL as the database for a critical application. The database workload is mostly writes, with a small number of reads.

A data engineer notices that the CPU utilization of the DB instance is very high. The high CPU utilization is slowing down the application. The data engineer must reduce the CPU utilization of the DB Instance.

Which actions should the data engineer take to meet this requirement? (Choose two.)

Options:

A.  

Use the Performance Insights feature of Amazon RDS to identify queries that have high CPU utilization. Optimize the problematic queries.

B.  

Modify the database schema to include additional tables and indexes.

C.  

Reboot the RDS DB instance once each week.

D.  

Upgrade to a larger instance size.

E.  

Implement caching to reduce the database query load.

Discussion 0
Questions 16

A company uses Amazon S3 buckets, AWS Glue tables, and Amazon Athena as components of a data lake. Recently, the company expanded its sales range to multiple new states. The company wants to introduce state names as a new partition to the existing S3 bucket, which is currently partitioned by date.

The company needs to ensure that additional partitions will not disrupt daily synchronization between the AWS Glue Data Catalog and the S3 buckets.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Use the AWS Glue API to manually update the Data Catalog.

B.  

Run an MSCK REPAIR TABLE command in Athena.

C.  

Schedule an AWS Glue crawler to periodically update the Data Catalog.

D.  

Run a REFRESH TABLE command in Athena.

Discussion 0
Questions 17

A data engineer maintains custom Python scripts that perform a data formatting process that many AWS Lambda functions use. When the data engineer needs to modify the Python scripts, the data engineer must manually update all the Lambda functions.

The data engineer requires a less manual way to update the Lambda functions.

Which solution will meet this requirement?

Options:

A.  

Store a pointer to the custom Python scripts in the execution context object in a shared Amazon S3 bucket.

B.  

Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions.

C.  

Store a pointer to the custom Python scripts in environment variables in a shared Amazon S3 bucket.

D.  

Assign the same alias to each Lambda function. Call reach Lambda function by specifying the function's alias.

Discussion 0
Questions 18

A data engineer is building a new data pipeline that stores metadata in an Amazon DynamoDB table. The data engineer must ensure that all items that are older than a specified age are removed from the DynamoDB table daily.

Which solution will meet this requirement with the LEAST configuration effort?

Options:

A.  

Enable DynamoDB TTL on the DynamoDB table. Adjust the application source code to set the TTL attribute appropriately.

B.  

Create an Amazon EventBridge rule that uses a daily cron expression to trigger an AWS Lambda function to delete items that are older than the specified age.

C.  

Add a lifecycle configuration to the DynamoDB table that deletes items that are older than the specified age.

D.  

Create a DynamoDB stream that has an AWS Lambda function that reacts to data modifications. Configure the Lambda function to delete items that are older than the specified age.

Discussion 0
Questions 19

A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon EC2 instances. The EMR cluster uses EMR managed scaling between one to five task nodes for the company's long-running Apache Spark extract, transform, and load (ETL) job. The company runs the ETL job every day.

When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster often reaches maximum CPU usage, but the memory usage remains under 30%.

The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily ETL job.

Which solution will meet these requirements MOST cost-effectively?

Options:

A.  

Increase the maximum number of task nodes for EMR managed scaling to 10.

B.  

Change the task node type from general purpose EC2 instances to memory optimized EC2 instances.

C.  

Switch the task node type from general purpose EC2 instances to compute optimized EC2 instances.

D.  

Reduce the scaling cooldown period for the provisioned EMR cluster.

Discussion 0
Questions 20

A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job. The data engineer has set the maximum concurrency for the AWS Glue job to 1.

The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs.

What is the likely reason the AWS Glue job is reprocessing the files?

Options:

A.  

The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.

B.  

The maximum concurrency for the AWS Glue job is set to 1.

C.  

The data engineer incorrectly specified an older version of AWS Glue for the Glue job.

D.  

The AWS Glue job does not have a required commit statement.

Discussion 0
Questions 21

An ecommerce company stores sales data in an AWS Glue table named sales_data. The company stores the sales_data table in an Amazon S3 Standard bucket. The table contains columns named order_id, customer_id, product_id, order_date, shipping_date, and order_amount.

The company wants to improve query performance by partitioning the sales_data table by order_date. The company needs to add the partition to the existing sales_data table in AWS Glue.

Which solution will meet these requirements?

Options:

A.  

Update the AWS Glue table’s schema to include the new partition.

B.  

Edit the AWS Glue table’s metadata file directly in Amazon S3.

C.  

Use the AWS Glue Data Catalog API to add the new partition to the table.

D.  

Manually modify the S3 bucket to use the new partition.

Discussion 0
Questions 22

A company stores server logs in an Amazon 53 bucket. The company needs to keep the logs for 1 year. The logs are not required after 1 year.

A data engineer needs a solution to automatically delete logs that are older than 1 year.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Define an S3 Lifecycle configuration to delete the logs after 1 year.

B.  

Create an AWS Lambda function to delete the logs after 1 year.

C.  

Schedule a cron job on an Amazon EC2 instance to delete the logs after 1 year.

D.  

Configure an AWS Step Functions state machine to delete the logs after 1 year.

Discussion 0
Questions 23

A media company wants to improve a system that recommends media content to customer based on user behavior and preferences. To improve the recommendation system, the company needs to incorporate insights from third-party datasets into the company's existing analytics platform.

The company wants to minimize the effort and time required to incorporate third-party datasets.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Use API calls to access and integrate third-party datasets from AWS Data Exchange.

B.  

Use API calls to access and integrate third-party datasets from AWS

C.  

Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS CodeCommit repositories.

D.  

Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon Elastic Container Registry (Amazon ECR).

Discussion 0
Questions 24

A company needs to collect logs for an Amazon RDS for MySQL database and make the logs available for audits. The logs must track each user that modifies data in the database or makes changes to the database instance.

Which solution will meet these requirements?

Options:

A.  

Enable Amazon CloudWatch Logs. Create metric filters to monitor database changes and instance-level changes. Configure automated notification systems to send near real-time alerts for suspicious database operations.

B.  

Configure an Amazon EventBridge rule to monitor database activity. Create an AWS Lambda function to process EventBridge events and store them in Amazon OpenSearch Service.

C.  

Configure AWS CloudTrail to log API calls. Use Amazon CloudWatch Logs for basic monitoring. Use IAM policies to control access to the logs. Set up scheduled reporting for log audits.

D.  

Enable and configure native Amazon RDS database audit logging. Enable Amazon CloudWatch Logs. Configure metric filters and alarms. Configure AWS CloudTrail audit logging.

Discussion 0
Questions 25

An airline company is collecting metrics about flight activities for analytics. The company is conducting a proof of concept (POC) test to show how analytics can provide insights that the company can use to increase on-time departures.

The POC test uses objects in Amazon S3 that contain the metrics in .csv format. The POC test uses Amazon Athena to query the data. The data is partitioned in the S3 bucket by date.

As the amount of data increases, the company wants to optimize the storage solution to improve query performance.

Which combination of solutions will meet these requirements? (Choose two.)

Options:

A.  

Add a randomized string to the beginning of the keys in Amazon S3 to get more throughput across partitions.

B.  

Use an S3 bucket that is in the same account that uses Athena to query the data.

C.  

Use an S3 bucket that is in the same AWS Region where the company runs Athena queries.

D.  

Preprocess the .csv data to JSON format by fetching only the document keys that the query requires.

E.  

Preprocess the .csv data to Apache Parquet format by fetching only the data blocks that are needed for predicates.

Discussion 0
Questions 26

A company stores petabytes of data in thousands of Amazon S3 buckets in the S3 Standard storage class. The data supports analytics workloads that have unpredictable and variable data access patterns.

The company does not access some data for months. However, the company must be able to retrieve all data within milliseconds. The company needs to optimize S3 storage costs.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Use S3 Storage Lens standard metrics to determine when to move objects to more cost-optimized storage classes. Create S3 Lifecycle policies for the S3 buckets to move objects to cost-optimized storage classes. Continue to refine the S3 Lifecycle policies in the future to optimize storage costs.

B.  

Use S3 Storage Lens activity metrics to identify S3 buckets that the company accesses infrequently. Configure S3 Lifecycle rules to move objects from S3 Standard to the S3 Standard-Infrequent Access (S3 Standard-IA) and S3 Glacier storage classes based on the age of the data.

C.  

Use S3 Intelligent-Tiering. Activate the Deep Archive Access tier.

D.  

Use S3 Intelligent-Tiering. Use the default access tier.

Discussion 0
Questions 27

A company’s data processing pipeline uses AWS Glue jobs and AWS Glue Data Catalog. All AWS Glue jobs must run in a custom VPC inside a private subnet. The company uses a NAT gateway to support outbound connections.

A data engineer needs to use AWS Glue to migrate data from an on-premises PostgreSQL database to Amazon S3. There is no current network connection between AWS and the on-premises environment. However, the data engineer has updated the on-premises database to allow traffic from the custom VPC.

Which solution will meet these requirements?

Options:

A.  

Create a JDBC connection in AWS Glue with the database JDBC URL, username, and password.

B.  

Create a Simple Authentication and Security Layer (SASL) connection in AWS Glue to the on-premises database.

C.  

Create a JDBC connection in AWS Glue with a security group that allows TCP traffic to and from itself.

D.  

Create a JDBC connection in AWS Glue that uses a JDBC driver stored in Amazon S3. Retrieve the database URL, username, and password from AWS Secrets Manager.

Discussion 0
Questions 28

A company uses Amazon S3 and AWS Glue Data Catalog to manage a data lake that contains contact information for customers. The company uses PySpark and AWS Glue jobs with a DynamicFrame to run a workflow that processes data within the data lake.

A data engineer notices that the workflow is generating errors as a result of how customer postal codes are stored in the data lake. Some postal codes include unnecessary numbers or invalid characters.

The data engineer needs a solution to address the errors and correct the postal codes in the data lake.

Which solution will meet these requirements?

Options:

A.  

Create a schema definition for PySpark that matches the format the processing workflow requires for postal codes. Pass the schema to the DynamicFrame during processing.

B.  

Use AWS Glue workflow properties to allow job state sharing. Configure the AWS Glue jobs to read values from the postal code column by using the properties from a previously successful run of the jobs.

C.  

Configure the columnPushDownPredicate setting and the catalogPartitionPredicate settings for the postal code column in the DynamicFrame.

D.  

Set the DynamicFrame additional options parameter useSSListImplementation to True.

Discussion 0
Questions 29

A company has a data lake in Amazon 53. The company uses AWS Glue to catalog data and AWS Glue Studio to implement data extract, transform, and load (ETL) pipelines.

The company needs to ensure that data quality issues are checked every time the pipelines run. A data engineer must enhance the existing pipelines to evaluate data quality rules based on predefined thresholds.

Which solution will meet these requirements with the LEAST implementation effort?

Options:

A.  

Add a new transform that is defined by a SQL query to each Glue ETL job. Use the SQL query to implement a ruleset that includes the data quality rules that need to be evaluated.

B.  

Add a new Evaluate Data Quality transform to each Glue ETL job. Use Data Quality Definition Language (DQDL) to implement a ruleset that includes the data quality rules that need to be evaluated.

C.  

Add a new custom transform to each Glue ETL job. Use the PyDeequ library to implement a ruleset that includes the data quality rules that need to be evaluated.

D.  

Add a new custom transform to each Glue ETL job. Use the Great Expectations library to implement a ruleset that includes the data quality rules that need to be evaluated.

Discussion 0
Questions 30

A company is creating a new data pipeline to populate a data lake. A data analyst needs to prepare and standardize the data before a data engineering team can perform advanced data transformations. The data analyst needs a solution to process the data that does not require writing new code.

Which solution will meet these requirements with the LEAST operational effort?

Options:

A.  

Use Python and Pandas in an AWS Glue Studio notebook. Ensure that the data engineers add additional transformations to complete the pipeline.

B.  

Use Amazon SageMaker Canvas and SageMaker Data Wrangler to write to a new dataset. Ensure that the data engineers add additional transformations to complete the pipeline by using AWS Glue.

C.  

Use AWS Glue Studio with data preparation recipe transformations. Ensure that the data engineers add additional transformations to complete the pipeline.

D.  

Create a document that includes the data preparation rules. Ensure that the data engineers implement the rules in AWS Glue.

Discussion 0
Questions 31

A company creates a new non-production application that runs on an Amazon EC2 instance. The application needs to communicate with an Amazon RDS database instance using Java Database Connectivity (JDBC). The EC2 instances and the RDS database instance are in the same subnet.

Which solution will meet this requirement?

Options:

A.  

Modify the IAM role that is assigned to the database instance to allow connections from the EC2 instances.

B.  

Modify the ec2_authorized_hosts parameter in the RDS parameter group to include the EC2 instances. Restart the database instance.

C.  

Update the database security group to allow connections from the EC2 instances.

D.  

Enable the Amazon RDS Data API and specify the Amazon Resource Name (ARN) of the database instance in the JDBC connection string.

Discussion 0
Questions 32

A company uses Amazon S3 as a data lake. The company sets up a data warehouse by using a multi-node Amazon Redshift cluster. The company organizes the data files in the data lake based on the data source of each data file.

The company loads all the data files into one table in the Redshift cluster by using a separate COPY command for each data file location. This approach takes a long time to load all the data files into the table. The company must increase the speed of the data ingestion. The company does not want to increase the cost of the process.

Which solution will meet these requirements?

Options:

A.  

Use a provisioned Amazon EMR cluster to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.

B.  

Load all the data files in parallel into Amazon Aurora. Run an AWS Glue job to load the data into Amazon Redshift.

C.  

Use an AWS Glue job to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.

D.  

Create a manifest file that contains the data file locations. Use a COPY command to load the data into Amazon Redshift.

Discussion 0
Questions 33

A retail company uses an Amazon Redshift data warehouse and an Amazon S3 bucket. The company ingests retail order data into the S3 bucket every day.

The company stores all order data at a single path within the S3 bucket. The data has more than 100 columns. The company ingests the order data from a third-party application that generates more than 30 files in CSV format every day. Each CSV file is between 50 and 70 MB in size.

The company uses Amazon Redshift Spectrum to run queries that select sets of columns. Users aggregate metrics based on daily orders. Recently, users have reported that the performance of the queries has degraded. A data engineer must resolve the performance issues for the queries.

Which combination of steps will meet this requirement with LEAST developmental effort? (Select TWO.)

Options:

A.  

Configure the third-party application to create the files in a columnar format.

B.  

Develop an AWS Glue ETL job to convert the multiple daily CSV files to one file for each day.

C.  

Partition the order data in the S3 bucket based on order date.

D.  

Configure the third-party application to create the files in JSON format.

E.  

Load the JSON data into the Amazon Redshift table in a SUPER type column.

Discussion 0
Questions 34

A company manages an Amazon Redshift data warehouse. The data warehouse is in a public subnet inside a custom VPC A security group allows only traffic from within itself- An ACL is open to all traffic.

The company wants to generate several visualizations in Amazon QuickSight for an upcoming sales event. The company will run QuickSight Enterprise edition in a second AW5 account inside a public subnet within a second custom VPC. The new public subnet has a security group that allows outbound traffic to the existing Redshift cluster.

A data engineer needs to establish connections between Amazon Redshift and QuickSight. QuickSight must refresh dashboards by querying the Redshift cluster.

Which solution will meet these requirements?

Options:

A.  

Configure the Redshift security group to allow inbound traffic on the Redshift port from the QuickSight security group.

B.  

Assign Elastic IP addresses to the QuickSight visualizations. Configure the QuickSight security group to allow inbound traffic on the Redshift port from the Elastic IP addresses.

C.  

Confirm that the CIDR ranges of the Redshift VPC and the QuickSight VPC are the same. If CIDR ranges are different, reconfigure one CIDR range to match the other. Establish network peering between the VPCs.

D.  

Create a QuickSight gateway endpoint in the Redshift VPC. Attach an endpoint policy to the gateway endpoint to ensure only specific QuickSight accounts can use the endpoint.

Discussion 0
Questions 35

A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded.

A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.

How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?

Options:

A.  

Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.

B.  

Use the Amazon Redshift Data API to publish an event to Amazon EventBridqe. Configure an EventBridge rule to invoke the Lambda function.

C.  

Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.

D.  

Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.

Discussion 0
Questions 36

A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the morning. The application has very low usage during weekends.

The company must ensure that the application performs consistently during peak usage times.

Which solution will meet these requirements in the MOST cost-effective way?

Options:

A.  

Increase the provisioned capacity to the maximum capacity that is currently present during peak load times.

B.  

Divide the table into two tables. Provision each table with half of the provisioned capacity of the original table. Spread queries evenly across both tables.

C.  

Use AWS Application Auto Scaling to schedule higher provisioned capacity for peak usage times. Schedule lower capacity during off-peak times.

D.  

Change the capacity mode from provisioned to on-demand. Configure the table to scale up and scale down based on the load on the table.

Discussion 0
Questions 37

A gaming company uses AWS Glue to perform read and write operations on Apache Iceberg tables for real-time streaming data. The data in the Iceberg tables is stored in Apache Parquet format. The company is experiencing slow query performance.

Which solutions will improve query performance? (Select TWO)

Options:

A.  

Use AWS Glue Data Catalog to generate column-level statistics for the Iceberg tables on a schedule.

B.  

Use AWS Glue Data Catalog to automatically compact the Iceberg tables.

C.  

Use AWS Glue Data Catalog to automatically optimize indexes for the Iceberg tables.

D.  

Use AWS Glue Data Catalog to enable copy-on-write for the Iceberg tables.

E.  

Use AWS Glue Data Catalog to generate views for the Iceberg tables.

Discussion 0
Questions 38

A company has three subsidiaries. Each subsidiary uses a different data warehousing solution. The first subsidiary hosts its data warehouse in Amazon Redshift. The second subsidiary uses Teradata Vantage on AWS. The third subsidiary uses Google BigQuery.

The company wants to aggregate all the data into a central Amazon S3 data lake. The company wants to use Apache Iceberg as the table format.

A data engineer needs to build a new pipeline to connect to all the data sources, run transformations by using each source engine, join the data, and write the data to Iceberg.

Which solution will meet these requirements with the LEAST operational effort?

Options:

A.  

Use native Amazon Redshift, Teradata, and BigQuery connectors to build the pipeline in AWS Glue. Use native AWS Glue transforms to join the data. Run a Merge operation on the data lake Iceberg table.

B.  

Use the Amazon Athena federated query connectors for Amazon Redshift, Teradata, and BigQuery to build the pipeline in Athena. Write a SQL query to read from all the data sources, join the data, and run a Merge operation on the data lake Iceberg table.

C.  

Use the native Amazon Redshift connector, the Java Database Connectivity (JDBC) connector for Teradata, and the open source Apache Spark BigQuery connector to build the pipeline in Amazon EMR. Write code in PySpark to join the data. Run a Merge operation on the data lake Iceberg table.

D.  

Use the native Amazon Redshift, Teradata, and BigQuery connectors in Amazon Appflow to write data to Amazon S3 and AWS Glue Data Catalog. Use Amazon Athena to join the data. Run a Merge operation on the data lake Iceberg table.

Discussion 0
Questions 39

A sales company uses AWS Glue ETL to collect, process, and ingest data into an Amazon S3 bucket. The AWS Glue pipeline creates a new file in the S3 bucket every hour. File sizes vary from 200 KB to 300 KB. The company wants to build a sales prediction model by using data from the previous 5 years. The historic data includes 44,000 files.

The company builds a second AWS Glue ETL pipeline by using the smallest worker type. The second pipeline retrieves the historic files from the S3 bucket and processes the files for downstream analysis. The company notices significant performance issues with the second ETL pipeline.

The company needs to improve the performance of the second pipeline.

Which solution will meet this requirement MOST cost-effectively?

Options:

A.  

Use a larger worker type.

B.  

Increase the number of workers in the AWS Glue ETL jobs.

C.  

Use the AWS Glue DynamicFrame grouping option.

D.  

Enable AWS Glue auto scaling.

Discussion 0
Questions 40

A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.

Which solution will meet these requirements MOST cost-effectively?

Options:

A.  

Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

B.  

Write a PySpark ETL script. Host the script on an Amazon EMR cluster.

C.  

Write an AWS Glue PySpark job. Use Apache Spark to transform the data.

D.  

Write an AWS Glue Python shell job. Use pandas to transform the data.

Discussion 0
Questions 41

Files from multiple data sources arrive in an Amazon S3 bucket on a regular basis. A data engineer wants to ingest new files into Amazon Redshift in near real time when the new files arrive in the S3 bucket.

Which solution will meet these requirements?

Options:

A.  

Use the query editor v2 to schedule a COPY command to load new files into Amazon Redshift.

B.  

Use the zero-ETL integration between Amazon Aurora and Amazon Redshift to load new files into Amazon Redshift.

C.  

Use AWS Glue job bookmarks to extract, transform, and load (ETL) load new files into Amazon Redshift.

D.  

Use S3 Event Notifications to invoke an AWS Lambda function that loads new files into Amazon Redshift.

Discussion 0
Questions 42

A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3.

Which solution will meet this requirement MOST cost-effectively?

Options:

A.  

Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis.

B.  

Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files.

C.  

Use Amazon Athena Federated Query to join the data from all data sources.

D.  

Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift.

Discussion 0
Questions 43

A company needs to implement a new inventory management system that provides near real-time updates and visibility across all AWS Regions. The new solution must provide centralized access control over data access and permissions. The company has a separate inventory management team assigned to each Region. Each inventory management team needs to update inventory levels.

A data engineer must implement Amazon Redshift data sharing with write capabilities. The solution must follow the principle of least privilege.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Configure a single Redshift datashare from the company's headquarters that provides read-only access for all Regions. Configure a separate AWS Glue ETL job to update data for each Region.

B.  

Configure three Regional Redshift datashares that provide full write access. Allow full self-managed access controls.

C.  

Configure a single Redshift datashare from the company's headquarters that has selective write permissions for inventory. Set up Regional namespace controls.

D.  

Configure separate Redshift datashares for multiple table types that provide full write access. Distribute the datashares across all Regional clusters. Allow self-managed Regional schema permissions.

Discussion 0
Questions 44

A company receives marketing campaign data from a vendor. The company ingests the data into an Amazon S3 bucket every 40 to 60 minutes. The data is in CSV format. File sizes are between 100 KB and 300 KB.

A data engineer needs to set-up an extract, transform, and load (ETL) pipeline to upload the content of each file to Amazon Redshift.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Create an AWS Lambda function that connects to Amazon Redshift and runs a COPY command. Use Amazon EventBridge to invoke the Lambda function based on an Amazon S3 upload trigger.

B.  

Create an Amazon Data Firehose stream. Configure the stream to use an AWS Lambda function as a source to pull data from the S3 bucket. Set Amazon Redshift as the destination.

C.  

Use Amazon Redshift Spectrum to query the S3 bucket. Configure an AWS Glue Crawler for the S3 bucket to update metadata in an AWS Glue Data Catalog.

D.  

Creates an AWS Database Migration Service (AWS DMS) task. Specify an appropriate data schema to migrate. Specify the appropriate type of migration to use.

Discussion 0
Questions 45

A data engineer is building a solution to detect sensitive information that is stored in a data lake across multiple Amazon S3 buckets. The solution must detect personally identifiable information (PII) that is in a proprietary data format.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Use the AWS Glue Detect PII transform with specific patterns.

B.  

Use Amazon Macie with managed data identifiers.

C.  

Use an AWS Lambda function with custom regular expressions.

D.  

Use Amazon Athena with a SQL query to match the custom formats.

Discussion 0
Questions 46

A data engineer needs to build an enterprise data catalog based on the company's Amazon S3 buckets and Amazon RDS databases. The data catalog must include storage format metadata for the data in the catalog.

Which solution will meet these requirements with the LEAST effort?

Options:

A.  

Use an AWS Glue crawler to scan the S3 buckets and RDS databases and build a data catalog. Use data stewards to inspect the data and update the data catalog with the data format.

B.  

Use an AWS Glue crawler to build a data catalog. Use AWS Glue crawler classifiers to recognize the format of data and store the format in the catalog.

C.  

Use Amazon Macie to build a data catalog and to identify sensitive data elements. Collect the data format information from Macie.

D.  

Use scripts to scan data elements and to assign data classifications based on the format of the data.

Discussion 0
Questions 47

A data engineer is processing a large amount of log data from web servers. The data is stored in an Amazon S3 bucket. The data engineer uses AWS services to process the data every day. The data engineer needs to extract specific fields from the raw log data and load the data into a data warehouse for analysis.

Options:

A.  

Use Amazon EMR to run Apache Hive queries on the raw log files in the S3 bucket to extract the specified fields. Store the output as ORC files in the original S3 bucket.

B.  

Use AWS Step Functions to orchestrate a series of AWS Batch jobs to parse the raw log files. Load the specified fields into an Amazon RDS for PostgreSQL database.

C.  

Use an AWS Glue crawler to parse the raw log data in the S3 bucket and to generate a schema. Use AWS Glue ETL jobs to extract and transform the data and to load it into Amazon Redshift.

D.  

Use AWS Glue DataBrew to run AWS Glue ETL jobs on a schedule to extract the specified fields from the raw log files in the S3 bucket. Load the data into partitioned tables in Amazon Redshift.

Discussion 0
Questions 48

A hotel management company receives daily data files from each of its hotels. The company wants to upload its data to AWS. The company plans to use Amazon Athena to access the files. The company needs to protect the files from accidental deletion. The company will develop an application on its on-premises servers to automatically forward the files to a fully managed AWS ingestion service.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Use AWS DataSync to replicate data from the on-premises servers to Amazon Elastic File System (Amazon EFS). Configure automatic backups in AWS Backup.

B.  

Use the Amazon Kinesis Agent on the on-premises servers to send data to Amazon Data Firehose. Store the data in an Amazon S3 bucket that has versioning enabled.

C.  

Use AWS Glue jobs to ingest data from the on-premises servers into Amazon RDS. Enable automated backups for data protection.

D.  

Use a self-managed Apache Kafka agent on the on-premises servers to stream data to Amazon Managed Streaming for Apache Kafka (Amazon MSK). Store the data in an Amazon S3 bucket with versioning enabled.

Discussion 0
Questions 49

A data engineer needs to optimize the performance of a data pipeline that handles retail orders. Data about the orders is ingested daily into an Amazon S3 bucket.

The data engineer runs queries once each week to extract metrics from the orders data based on the order date for multiple date ranges. The data engineer needs an optimization solution that ensures the query performance will not degrade when the volume of data increases.

Options:

A.  

Partition the data based on order date. Use Amazon Athena to query the data.

B.  

Partition the data based on order date. Use Amazon Redshift to query the data.

C.  

Partition the data based on load date. Use Amazon EMR to query the data.

D.  

Partition the data based on load date. Use Amazon Aurora to query the data.

Discussion 0
Questions 50

A company stores logs in an Amazon S3 bucket. When a data engineer attempts to access several log files, the data engineer discovers that some files have been unintentionally deleted.

The data engineer needs a solution that will prevent unintentional file deletion in the future.

Which solution will meet this requirement with the LEAST operational overhead?

Options:

A.  

Manually back up the S3 bucket on a regular basis.

B.  

Enable S3 Versioning for the S3 bucket.

C.  

Configure replication for the S3 bucket.

D.  

Use an Amazon S3 Glacier storage class to archive the data that is in the S3 bucket.

Discussion 0
Questions 51

A retail company is using an Amazon Redshift cluster to support real-time inventory management. The company has deployed an ML model on a real-time endpoint in Amazon SageMaker.

The company wants to make real-time inventory recommendations. The company also wants to make predictions about future inventory needs.

Which solutions will meet these requirements? (Select TWO.)

Options:

A.  

Use Amazon Redshift ML to generate inventory recommendations.

B.  

Use SQL to invoke a remote SageMaker endpoint for prediction.

C.  

Use Amazon Redshift ML to schedule regular data exports for offline model training.

D.  

Use SageMaker Autopilot to create inventory management dashboards in Amazon Redshift.

E.  

Use Amazon Redshift as a file storage system to archive old inventory management reports.

Discussion 0
Questions 52

A data engineer needs to query data from multiple sources to generate an annual report. The analytics team uses Amazon Redshift for analysis. The data engineer needs to integrate Amazon Redshift data with 10 years of historical data from Amazon RDS for PostgreSQL and RDS for MySQL. All the databases are in the same VPC. The data engineer needs a solution that provides seamless data integration with Amazon Redshift.

Which solution will meet these requirements in the MOST cost-effective way?

Options:

A.  

Use federated queries in Amazon Redshift to fetch data from RDS for PostgreSQL and RDS for MySQL. Apply the necessary transformations within Amazon Redshift.

B.  

Use the SELECT INTO OUTFILE S3 statement to export data from Amazon RDS to Amazon S3. Use the COPY command to load the data into Amazon Redshift.

C.  

Create a visual extract, transform, and load (ETL) job in AWS Glue to extract the required data and load it to Amazon Redshift.

D.  

Use AWS Database Migration Service (AWS DMS) to ingest data from RDS for PostgreSQL and RDS for MySQL. Implement the necessary transformations within Amazon Redshift.

Discussion 0
Questions 53

A company is building a data lake for a new analytics team. The company is using Amazon S3 for storage and Amazon Athena for query analysis. All data that is in Amazon S3 is in Apache Parquet format.

The company is running a new Oracle database as a source system in the company's data center. The company has 70 tables in the Oracle database. All the tables have primary keys. Data can occasionally change in the source system. The company wants to ingest the tables every day into the data lake.

Which solution will meet this requirement with the LEAST effort?

Options:

A.  

Create an Apache Sqoop job in Amazon EMR to read the data from the Oracle database. Configure the Sqoop job to write the data to Amazon S3 in Parquet format.

B.  

Create an AWS Glue connection to the Oracle database. Create an AWS Glue bookmark job to ingest the data incrementally and to write the data to Amazon S3 in Parquet format.

C.  

Create an AWS Database Migration Service (AWS DMS) task for ongoing replication. Set the Oracle database as the source. Set Amazon S3 as the target. Configure the task to write the data in Parquet format.

D.  

Create an Oracle database in Amazon RDS. Use AWS Database Migration Service (AWS DMS) to migrate the on-premises Oracle database to Amazon RDS. Configure triggers on the tables to invoke AWS Lambda functions to write changed records to Amazon S3 in Parquet format.

Discussion 0
Questions 54

A financial company recently added more features to its mobile app. The new features required the company to create a new topic in an existing Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.

A few days after the company added the new topic, Amazon CloudWatch raised an alarm on the RootDiskUsed metric for the MSK cluster.

How should the company address the CloudWatch alarm?

Options:

A.  

Expand the storage of the MSK broker. Configure the MSK cluster storage to expand automatically.

B.  

Expand the storage of the Apache ZooKeeper nodes.

C.  

Update the MSK broker instance to a larger instance type. Restart the MSK cluster.

D.  

Specify the Target-Volume-in-GiB parameter for the existing topic.

Discussion 0
Questions 55

A company has a production AWS account that runs company workloads. The company's security team created a security AWS account to store and analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs.

The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account.

Which solution will meet these requirements?

Options:

A.  

Create a destination data stream in the production AWS account. In the security AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the production AWS account.

B.  

Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the security AWS account.

C.  

Create a destination data stream in the production AWS account. In the production AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the security AWS account.

D.  

Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the production AWS account.

Discussion 0
Questions 56

A company uses an organization in AWS Organizations to manage multiple AWS accounts. The company uses an enhanced fanout data stream in Amazon Kinesis Data Streams to receive streaming data from multiple producers. The data stream runs in Account A. The company wants to use an AWS Lambda function in Account B to process the data from the stream. The company creates a Lambda execution role in Account B that has permissions to access data from the stream in Account A.

What additional step must the company take to meet this requirement?

Options:

A.  

Create a service control policy (SCP) to grant the data stream read access to the cross-account Lambda execution role. Attach the SCP to Account

A.  

B.  

Add a resource-based policy to the data stream to allow read access for the cross-account Lambda execution role.

C.  

Create a service control policy (SCP) to grant the data stream read access to the cross-account Lambda execution role. Attach the SCP to Account B.

D.  

Add a resource-based policy to the cross-account Lambda function to grant the data stream read access to the function.

Discussion 0
Questions 57

A company is developing an application that runs on Amazon EC2 instances. Currently, the data that the application generates is temporary. However, the company needs to persist the data, even if the EC2 instances are terminated.

A data engineer must launch new EC2 instances from an Amazon Machine Image (AMI) and configure the instances to preserve the data.

Which solution will meet this requirement?

Options:

A.  

Launch new EC2 instances by using an AMI that is backed by an EC2 instance store volume that contains the application data. Apply the default settings to the EC2 instances.

B.  

Launch new EC2 instances by using an AMI that is backed by a root Amazon Elastic Block Store (Amazon EBS) volume that contains the application data. Apply the default settings to the EC2 instances.

C.  

Launch new EC2 instances by using an AMI that is backed by an EC2 instance store volume. Attach an Amazon Elastic Block Store (Amazon EBS) volume to contain the application data. Apply the default settings to the EC2 instances.

D.  

Launch new EC2 instances by using an AMI that is backed by an Amazon Elastic Block Store (Amazon EBS) volume. Attach an additional EC2 instance store volume to contain the application data. Apply the default settings to the EC2 instances.

Discussion 0
Questions 58

A data engineer is building a data pipeline. A large data file is uploaded to an Amazon S3 bucket once each day at unpredictable times. An AWS Glue workflow uses hundreds of workers to process the file and load the data into Amazon Redshift. The company wants to process the file as quickly as possible.

Which solution will meet these requirements?

Options:

A.  

Create an on-demand AWS Glue trigger to start the workflow. Create an AWS Lambda function that runs every 15 minutes to check the S3 bucket for the daily file. Configure the function to start the AWS Glue workflow if the file is present.

B.  

Create an event-based AWS Glue trigger to start the workflow. Configure Amazon S3 to log events to AWS CloudTrail. Create a rule in Amazon EventBridge to forward PutObject events to the AWS Glue trigger.

C.  

Create a scheduled AWS Glue trigger to start the workflow. Create a cron job that runs the AWS Glue job every 15 minutes. Set up the AWS Glue job to check the S3 bucket for the daily file. Configure the job to stop if the file is not present.

D.  

Create an on-demand AWS Glue trigger to start the workflow. Create an AWS Database Migration Service (AWS DMS) migration task. Set the DMS source as the S3 bucket. Set the target endpoint as the AWS Glue workflow.

Discussion 0
Questions 59

A company is using an AWS Transfer Family server to migrate data from an on-premises environment to AWS. Company policy mandates the use of TLS 1.2 or above to encrypt the data in transit.

Which solution will meet these requirements?

Options:

A.  

Generate new SSH keys for the Transfer Family server. Make the old keys and the new keys available for use.

B.  

Update the security group rules for the on-premises network to allow only connections that use TLS 1.2 or above.

C.  

Update the security policy of the Transfer Family server to specify a minimum protocol version of TLS 1.2.

D.  

Install an SSL certificate on the Transfer Family server to encrypt data transfers by using TLS 1.2.

Discussion 0
Questions 60

A company builds a new data pipeline to process data for business intelligence reports. Users have noticed that data is missing from the reports.

A data engineer needs to add a data quality check for columns that contain null values and for referential integrity at a stage before the data is added to storage.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Use Amazon SageMaker Data Wrangler to create a Data Quality and Insights report.

B.  

Use AWS Glue ETL jobs to perform a data quality evaluation transform on the data. Use an IsComplete rule on the requested columns. Use a ReferentialIntegrity rule for each join.

C.  

Use AWS Glue ETL jobs to perform a SQL transform on the data to determine whether requested columns contain null values. Use a second SQL transform to check referential integrity.

D.  

Use Amazon SageMaker Data Wrangler and a custom Python transform to create custom rules to check for null values and referential integrity.

Discussion 0
Questions 61

A data engineer must orchestrate a data pipeline that consists of one AWS Lambda function and one AWS Glue job. The solution must integrate with AWS services.

Which solution will meet these requirements with the LEAST management overhead?

Options:

A.  

Use an AWS Step Functions workflow that includes a state machine. Configure the state machine to run the Lambda function and then the AWS Glue job.

B.  

Use an Apache Airflow workflow that is deployed on an Amazon EC2 instance. Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job.

C.  

Use an AWS Glue workflow to run the Lambda function and then the AWS Glue job.

D.  

Use an Apache Airflow workflow that is deployed on Amazon Elastic Kubernetes Service (Amazon EKS). Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job.

Discussion 0
Questions 62

A company stores a 100 MB dataset in an Amazon S3 bucket as an Apache Parquet file. A data engineer needs to profile the data before performing data preparation steps on the data.

Which solution will meet this requirement in the MOST operationally efficient way?

Options:

A.  

Create a profile job on the dataset in AWS Glue DataBrew. Review the profile job results.

B.  

Stream the data into Amazon Managed Service for Apache Flink for SQL queries. Use the Apache Flink dashboard to profile the data.

C.  

Ingest the data into Amazon Redshift Spectrum. Use SQL queries to profile the data.

D.  

Load the data into an Amazon QuickSight dataset. Build a topic to profile the data with questions.

Discussion 0
Questions 63

A company has a data lake in Amazon S3. The company collects AWS CloudTrail logs for multiple applications. The company stores the logs in the data lake, catalogs the logs in AWS Glue, and partitions the logs based on the year. The company uses Amazon Athena to analyze the logs.

Recently, customers reported that a query on one of the Athena tables did not return any data. A data engineer must resolve the issue.

Which combination of troubleshooting steps should the data engineer take? (Select TWO.)

Options:

A.  

Confirm that Athena is pointing to the correct Amazon S3 location.

B.  

Increase the query timeout duration.

C.  

Use the MSCK REPAIR TABLE command.

D.  

Restart Athena.

E.  

Delete and recreate the problematic Athena table.

Discussion 0
Questions 64

A retail company needs to implement a solution to capture data updates from multiple Amazon Aurora MySQL databases. The company needs to make the updates available for analytics in near real time. The solution must be serverless and require minimal maintenance.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Set up AWS Database Migration Service (AWS DMS) tasks that perform schema conversions for each database. Load the changes into Amazon Redshift Serverless.

B.  

Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) Connect with Debezium connectors to load data into Amazon Redshift Serverless.

C.  

Use AWS Database Migration Service (AWS DMS) to set up binary log replication to Amazon Kinesis Data Streams. Load the data into Amazon Redshift Serverless after schema conversion.

D.  

Use Aurora zero-ETL integrations with Amazon Redshift Serverless for each database to load Aurora MySQL changes in Amazon Redshift Serverless.

Discussion 0
Questions 65

A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically.

Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?

Options:

A.  

AWS DataSync

B.  

AWS Glue

C.  

AWS Direct Connect

D.  

Amazon S3 Transfer Acceleration

Discussion 0
Questions 66

A company implements a data mesh that has a central governance account. The company needs to catalog all data in the governance account. The governance account uses AWS Lake Formation to centrally share data and grant access permissions.

The company has created a new data product that includes a group of Amazon Redshift Serverless tables. A data engineer needs to share the data product with a marketing team. The marketing team must have access to only a subset of columns. The data engineer needs to share the same data product with a compliance team. The compliance team must have access to a different subset of columns than the marketing team needs access to.

Which combination of steps should the data engineer take to meet these requirements? (Select TWO.)

Options:

A.  

Create views of the tables that need to be shared. Include only the required columns.

B.  

Create an Amazon Redshift data than that includes the tables that need to be shared.

C.  

Create an Amazon Redshift managed VPC endpoint in the marketing team's account. Grant the marketing team access to the views.

D.  

Share the Amazon Redshift data share to the Lake Formation catalog in the governance account.

E.  

Share the Amazon Redshift data share to the Amazon Redshift Serverless workgroup in the marketing team's account.

Discussion 0
Questions 67

A company is designing a serverless data processing workflow in AWS Step Functions that involves multiple steps. The processing workflow ingests data from an external API, transforms the data by using multiple AWS Lambda functions, and loads the transformed data into Amazon DynamoDB.

The company needs the workflow to perform specific steps based on the content of the incoming data.

Which Step Functions state type should the company use to meet this requirement?

Options:

A.  

Parallel

B.  

Choice

C.  

Task

D.  

Map

Discussion 0
Questions 68

A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.  

Establish WebSocket connections to Amazon Redshift.

B.  

Use the Amazon Redshift Data API.

C.  

Set up Java Database Connectivity (JDBC) connections to Amazon Redshift.

D.  

Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run the queries.

Discussion 0
Questions 69

A company processes 500 GB of audience and advertising data daily, storing CSV files in Amazon S3 with schemas registered in AWS Glue Data Catalog. They need to convert these files to Apache Parquet format and store them in an S3 bucket.

The solution requires a long-running workflow with 15 GiB memory capacity to process the data concurrently, followed by a correlation process that begins only after the first two processes complete.

Options:

A.  

Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the workflow by using AWS Glue. Configure AWS Glue to begin the third process after the first two processes have finished.

B.  

Use Amazon EMR to run each process in the workflow. Create an Amazon Simple Queue Service (Amazon SQS) queue to handle messages that indicate the completion of the first two processes. Configure an AWS Lambda function to process the SQS queue by running the third process.

C.  

Use AWS Glue workflows to run the first two processes in parallel. Ensure that the third process starts after the first two processes have finished.

D.  

Use AWS Step Functions to orchestrate a workflow that uses multiple AWS Lambda functions. Ensure that the third process starts after the first two processes have finished.

Discussion 0
Questions 70

A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.

A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year.

The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability.

Which solution will meet these requirements in the MOST cost-effective way?

Options:

A.  

Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.

B.  

Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.

C.  

Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

D.  

Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

Discussion 0
Questions 71

A company wants to migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region of an AWS account named Account_A. The company will migrate the data to an Amazon Redshift cluster in the eu-west-1 Region of an AWS account named Account_B.

Which solution will give AWS Database Migration Service (AWS DMS) the ability to replicate data between two data stores?

Options:

A.  

Set up an AWS DMS replication instance in Account_B in eu-west-1.

B.  

Set up an AWS DMS replication instance in Account_B in eu-east-1.

C.  

Set up an AWS DMS replication instance in a new AWS account in eu-west-1

D.  

Set up an AWS DMS replication instance in Account_A in eu-east-1.

Discussion 0
Questions 72

A data engineer must ingest a source of structured data that is in .csv format into an Amazon S3 data lake. The .csv files contain 15 columns. Data analysts need to run Amazon Athena queries on one or two columns of the dataset. The data analysts rarely query the entire file.

Which solution will meet these requirements MOST cost-effectively?

Options:

A.  

Use an AWS Glue PySpark job to ingest the source data into the data lake in .csv format.

B.  

Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to ingest the data into the data lake in JSON format.

C.  

Use an AWS Glue PySpark job to ingest the source data into the data lake in Apache Avro format.

D.  

Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to write the data into the data lake in Apache Parquet format.

Discussion 0
Questions 73

A company has an application that uses a microservice architecture. The company hosts the application on an Amazon Elastic Kubernetes Services (Amazon EKS) cluster.

The company wants to set up a robust monitoring system for the application. The company needs to analyze the logs from the EKS cluster and the application. The company needs to correlate the cluster's logs with the application's traces to identify points of failure in the whole application request flow.

Which combination of steps will meet these requirements with the LEAST development effort? (Select TWO.)

Options:

A.  

Use FluentBit to collect logs. Use OpenTelemetry to collect traces.

B.  

Use Amazon CloudWatch to collect logs. Use Amazon Kinesis to collect traces.

C.  

Use Amazon CloudWatch to collect logs. Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to collect traces.

D.  

Use Amazon OpenSearch to correlate the logs and traces.

E.  

Use AWS Glue to correlate the logs and traces.

Discussion 0
Questions 74

A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks.

The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team's BI cluster.

The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster.

Which solution will meet these requirements?

Options:

A.  

Set up the sales team Bl cluster as a consumer of the ETL cluster by using Redshift data sharing.

B.  

Create materialized views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.

C.  

Create database views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.

D.  

Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based on the content of the ETL cluster.

Discussion 0
Questions 75

A data engineer needs to use AWS Step Functions to design an orchestration workflow. The workflow must parallel process a large collection of data files and apply a specific transformation to each file.

Which Step Functions state should the data engineer use to meet these requirements?

Options:

A.  

Parallel state

B.  

Choice state

C.  

Map state

D.  

Wait state

Discussion 0
Questions 76

A company that operates globally must follow regulations that require data from an AWS Region to be accessible only within that Region.

A data engineer is creating a data pipeline that will create resources in the Region where the data engineer works. The data pipeline should have access to data only from the Region where the data engineer works. The pipeline uses Active Directory as an identity and authentication system. The pipeline uses a custom identity broker application to verify that employees are signed in to Active Directory and to obtain temporary credentials by using the AssumeRole API operation.

Which solution will meet the locality requirements with the LEAST administrative effort?

Options:

A.  

Create an IAM role that has permissions to create resources. Create a policy for each Region that ensures users can create resources only in that Region. Pass the policy as the session policy when employees obtain the temporary credentials.

B.  

Create an IAM role for data engineers in each Region separately. Instruct each data engineer to obtain temporary credentials by assuming the appropriate Region-specific IAM role.

C.  

Create an IAM group for each Region. Include the required IAM policies for each IAM group. Add users to each IAM group so that when users log in by obtaining the temporary credentials, the users will receive the appropriate access based on the IAM group.

D.  

Create individual IAM policies that allow users to create resources in a specific Region. Assign the policies to each data engineer. Allow users to assume the individually assigned role when the users log in to AWS.

Discussion 0