Labour Day Special 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exams65

Databricks Certified Data Engineer Professional Exam Question and Answers

Databricks Certified Data Engineer Professional Exam

Last Update Apr 30, 2024
Total Questions : 120

We are offering FREE Databricks-Certified-Professional-Data-Engineer Databricks exam questions. All you do is to just go and sign up. Give your details, prepare Databricks-Certified-Professional-Data-Engineer free exam questions and then go for complete pool of Databricks Certified Data Engineer Professional Exam test questions that will help you more.

Databricks-Certified-Professional-Data-Engineer pdf

Databricks-Certified-Professional-Data-Engineer PDF

$35  $99.99
Databricks-Certified-Professional-Data-Engineer Engine

Databricks-Certified-Professional-Data-Engineer Testing Engine

$42  $119.99
Databricks-Certified-Professional-Data-Engineer PDF + Engine

Databricks-Certified-Professional-Data-Engineer PDF + Testing Engine

$56  $159.99
Questions 1

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 & longitude > -20

Which statement describes how data will be filtered?

Options:

A.  

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

B.  

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

C.  

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

D.  

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

E.  

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Discussion 0
Questions 2

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Options:

A.  

spark.sql.files.maxPartitionBytes

B.  

spark.sql.autoBroadcastJoinThreshold

C.  

spark.sql.files.openCostInBytes

D.  

spark.sql.adaptive.coalescePartitions.minPartitionNum

E.  

spark.sql.adaptive.advisoryPartitionSizeInBytes

Discussion 0
Questions 3

A developer has successfully configured credential for Databricks Repos and cloned a remote Git repository. Hey don not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Use Response to pull changes from the remote Git repository commit and push changes to a branch that appeared as a changes were pulled.

Options:

A.  

Use Repos to merge all differences and make a pull request back to the remote repository.

B.  

Use repos to merge all difference and make a pull request back to the remote repository.

C.  

Use Repos to create a new branch commit all changes and push changes to the remote Git repertory.

D.  

Use repos to create a fork of the remote repository commit all changes and make a pull request on the source repository

Discussion 0
Questions 4

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

Options:

A.  

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

B.  

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

C.  

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

D.  

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

E.  

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Discussion 0
Questions 5

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.

Which statement describes the contents of the workspace audit logs concerning these events?

Options:

A.  

Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events.

B.  

Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.

C.  

Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.

D.  

Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.

E.  

Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.

Discussion 0
Questions 6

Which statement describes Delta Lake Auto Compaction?

Options:

A.  

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

B.  

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

C.  

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

D.  

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

E.  

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Discussion 0
Questions 7

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

A.  

When a database is being created, make sure that the LOCATION keyword is used.

B.  

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

C.  

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

D.  

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

E.  

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Discussion 0
Questions 8

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Options:

A.  

The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.

B.  

Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.

C.  

Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.

D.  

Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

E.  

Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Discussion 0
Questions 9

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

A.  

importlib.resource path

B.  

,sys.path

C.  

os-path

D.  

pypi.path

E.  

pylib.source

Discussion 0
Questions 10

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Options:

A.  

Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

B.  

Schedule a Structured Streaming job with a trigger interval of 60 minutes.

C.  

Schedule a job to execute the pipeline once hour on a new job cluster.

D.  

Configure a job that executes every time new data lands in a given directory.

Discussion 0
Questions 11

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Options:

A.  

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

B.  

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

C.  

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

D.  

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

E.  

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Discussion 0
Questions 12

A data engineer needs to capture pipeline settings from an existing in the workspace, and use them to create and version a JSON file to create a new pipeline.

Which command should the data engineer enter in a web terminal configured with the Databricks CLI?

Options:

A.  

Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command

B.  

Stop the existing pipeline; use the returned settings in a reset command

C.  

Use the alone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git

D.  

Use list pipelines to get the specs for all pipelines; get the pipeline spec from the return results parse and use this to create a pipeline

Discussion 0
Questions 13

A table named user_ltv is being used to create a view that will be used by data analysis on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

An analyze who is not a member of the auditing group executing the following query:

Which result will be returned by this query?

Options:

A.  

All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

B.  

All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.

C.  

All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.

D.  

All records from all columns will be displayed with the values in user_ltv.

Discussion 0
Questions 14

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Which statement describes this implementation?

Options:

A.  

The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.

B.  

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

C.  

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

D.  

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

E.  

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Discussion 0
Questions 15

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

Options:

A.  

DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.

B.  

By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.

C.  

The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.

D.  

Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.

E.  

The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

Discussion 0
Questions 16

What is the first of a Databricks Python notebook when viewed in a text editor?

Options:

A.  

%python

B.  

% Databricks notebook source

C.  

-- Databricks notebook source

D.  

//Databricks notebook source

Discussion 0
Questions 17

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Options:

A.  

Regex

B.  

Julia

C.  

pyspsark.ml.feature

D.  

Scala Datasets

E.  

C++

Discussion 0
Questions 18

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

Options:

A.  

Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

B.  

Z-order indices calculated on the table are preventing file compaction

C Bloom filler indices calculated on the table are preventing file compaction

C.  

Databricks has autotuned to a smaller target file size based on the overall size of data in the table

D.  

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Discussion 0
Questions 19

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Which statement describes what will happen when the above code is executed?

Options:

A.  

The connection to the external table will fail; the string "redacted" will be printed.

B.  

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.

C.  

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.

D.  

The connection to the external table will succeed; the string value of password will be printed in plain text.

E.  

The connection to the external table will succeed; the string "redacted" will be printed.

Discussion 0
Questions 20

A Data engineer wants to run unit’s tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production.

How can the data engineer run unit tests against function that work with data in production?

Options:

A.  

Run unit tests against non-production data that closely mirrors production

B.  

Define and unit test functions using Files in Repos

C.  

Define units test and functions within the same notebook

D.  

Define and import unit test functions from a separate Databricks notebook

Discussion 0
Questions 21

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

Options:

A.  

The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.

B.  

Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

C.  

Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.

D.  

Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.

E.  

Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Discussion 0
Questions 22

The data science team has created and logged a production using MLFlow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, load the customer table containing the customer_id key column into a Dataframe, and defines the feature columns needed for the model.

Which code block will output DataFrame with the schema'' customer_id LONG, predictions DOUBLE''?

Options:

A.  

Model, predict (df, columns)

B.  

Df, map (lambda k:midel (x [columns]) ,select (''customer_id predictions'')

C.  

Df. Select (''customer_id''.

Model (''columns) alias (''predictions'')

D.  

Df.apply(model, columns). Select (''customer_id, prediction''

Discussion 0
Questions 23

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

Options:

A.  

Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.

B.  

The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.

C.  

Other than the default "admins" group, only individual users can be granted privileges on jobs.

D.  

A user can only transfer job ownership to a group if they are also a member of that group.

E.  

Only workspace administrators can grant "Owner" privileges to a group.

Discussion 0
Questions 24

Which is a key benefit of an end-to-end test?

Options:

A.  

It closely simulates real world usage of your application.

B.  

It pinpoint errors in the building blocks of your application.

C.  

It provides testing coverage for all code paths and branches.

D.  

It makes it easier to automate your test suite

Discussion 0
Questions 25

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Options:

A.  

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

B.  

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

C.  

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

D.  

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

E.  

Databricks notebooks send all executable code from the user's browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Discussion 0
Questions 26

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.

Which statement describes a main benefit that offset this additional effort?

Options:

A.  

Improves the quality of your data

B.  

Validates a complete use case of your application

C.  

Troubleshooting is easier since all steps are isolated and tested individually

D.  

Yields faster deployment and execution times

E.  

Ensures that all steps interact correctly to achieve the desired end result

Discussion 0
Questions 27

Which statement describes Delta Lake optimized writes?

Options:

A.  

A shuffle occurs prior to writing to try to group data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.

B.  

Optimized writes logical partitions instead of directory partitions partition boundaries are only represented in metadata fewer small files are written.

C.  

An asynchronous job runs after the write completes to detect if files could be further compacted; yes, an OPTIMIZE job is executed toward a default of 1 GB.

D.  

Before a job cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.

Discussion 0
Questions 28

Which statement regarding stream-static joins and static Delta tables is correct?

Options:

A.  

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

B.  

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.

C.  

The checkpoint directory will be used to track state information for the unique keys present in the join.

D.  

Stream-static joins cannot use static Delta tables because of consistency issues.

E.  

The checkpoint directory will be used to track updates to the static Delta table.

Discussion 0
Questions 29

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Which step must also be completed to put the proposed query into production?

Options:

A.  

Increase the shuffle partitions to account for additional aggregates

B.  

Specify a new checkpointlocation

C.  

Run REFRESH TABLE delta, /item_agg'

D.  

Remove .option (mergeSchema', true') from the streaming write

Discussion 0
Questions 30

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Options:

A.  

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.

B.  

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.

C.  

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.

D.  

Both commands will fail. No new variables, tables, or views will be created.

E.  

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.

Discussion 0
Questions 31

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

Options:

A.  

Workspace audit logs

B.  

Driver's log file

C.  

Ganglia

D.  

Cluster Event Log

E.  

Executor's log file

Discussion 0
Questions 32

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

Options:

A.  

Integration

B.  

Unit

C.  

Manual

D.  

functional

Discussion 0
Questions 33

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

Options:

A.  

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

B.  

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

C.  

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

D.  

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

E.  

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Discussion 0
Questions 34

A Delta Lake table was created with the below query:

Consider the following query:

DROP TABLE prod.sales_by_store -

If this statement is executed by a workspace admin, which result will occur?

Options:

A.  

Nothing will occur until a COMMIT command is executed.

B.  

The table will be removed from the catalog but the data will remain in storage.

C.  

The table will be removed from the catalog and the data will be deleted.

D.  

An error will occur because Delta Lake prevents the deletion of production data.

E.  

Data will be marked as deleted but still recoverable with Time Travel.

Discussion 0
Questions 35

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:

Which command allows manual confirmation that these three requirements have been met?

Options:

A.  

DESCRIBE EXTENDED dev.pii test

B.  

DESCRIBE DETAIL dev.pii test

C.  

SHOW TBLPROPERTIES dev.pii test

D.  

DESCRIBE HISTORY dev.pii test

E.  

SHOW TABLES dev

Discussion 0
Questions 36

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

Options:

A.  

date = spark.conf.get("date")

B.  

input_dict = input()

date= input_dict["date"]

C.  

import sys

date = sys.argv[1]

D.  

date = dbutils.notebooks.getParam("date")

E.  

dbutils.widgets.text("date", "null")

date = dbutils.widgets.get("date")

Discussion 0