Databricks Certified Associate Developer for Apache Spark 3.5 – Python Question and Answers

Databricks Certified Associate Developer for Apache Spark 3.5 – Python

Last Update Nov 30, 2025
Total Questions : 136

We are offering FREE Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks exam questions. All you do is to just go and sign up. Give your details, prepare Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 free exam questions and then go for complete pool of Databricks Certified Associate Developer for Apache Spark 3.5 – Python test questions that will help you more.

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF

$36.75 ~~$104.99~~

Add to Cart

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Testing Engine

$43.75 ~~$124.99~~

Add to Cart

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Testing Engine

$57.75 ~~$164.99~~

Add to Cart

Questions 1

4 of 55.

A developer is working on a Spark application that processes a large dataset using SQL queries. Despite having a large cluster, the developer notices that the job is underutilizing the available resources. Executors remain idle for most of the time, and logs reveal that the number of tasks per stage is very low. The developer suspects that this is causing suboptimal cluster performance.

Which action should the developer take to improve cluster utilization?

Options:

Increase the value of spark.sql.shuffle.partitions

Reduce the value of spark.sql.shuffle.partitions

Enable dynamic resource allocation to scale resources as needed

Increase the size of the dataset to create more partitions

Discussion 0

Questions 2

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Options:

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

Increase the executor memory allocation in the Spark configuration.

Reduce the size of the data partitions to improve task scheduling.

Increase the number of executor instances to handle more concurrent tasks.

Discussion 0

Questions 3

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns for transaction_id, account_number, transaction_amount, and timestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

Options:

df = df.dropDuplicates()

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

df = df.filter(F.col("transaction_id").isNotNull())

df = df.dropDuplicates(["transaction_amount"])

Discussion 0

Questions 4

Given the schema:

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

dropDuplicates on all columns (wrong criteria)

dropDuplicates with no arguments (removes based on all columns)

groupBy without aggregation (invalid use)

dropDuplicates on the exact matching fields

Discussion 0

Questions 5

40 of 55.

A developer wants to refactor older Spark code to take advantage of built-in functions introduced in Spark 3.5.

The original code:

from pyspark.sql import functions as F

min_price = 110.50

result_df = prices_df.filter(F.col("price") > min_price).agg(F.count("*"))

Which code block should the developer use to refactor the code?

Options:

result_df = prices_df.filter(F.col("price") > F.lit(min_price)).agg(F.count("*"))

result_df = prices_df.where(F.lit("price") > min_price).groupBy().count()

result_df = prices_df.withColumn("valid_price", when(col("price") > F.lit(min_price), True))

result_df = prices_df.filter(F.lit(min_price) > F.col("price")).count()

Discussion 0

Questions 6

A Spark application is experiencing performance issues in client mode because the driver is resource-constrained.

How should this issue be resolved?

Options:

Add more executor instances to the cluster

Increase the driver memory on the client machine

Switch the deployment mode to cluster mode

Switch the deployment mode to local mode

Discussion 0

Questions 7

54 of 55.

What is the benefit of Adaptive Query Execution (AQE)?

Options:

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

Discussion 0

Questions 8

What is a feature of Spark Connect?

Options:

It supports DataStreamReader, DataStreamWriter, StreamingQuery, and Streaming APIs

Supports DataFrame, Functions, Column, SparkContext PySpark APIs

It supports only PySpark applications

It has built-in authentication

Discussion 0

Questions 9

What is the benefit of Adaptive Query Execution (AQE)?

Options:

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Discussion 0

Questions 10

A data engineer is working on a Streaming DataFrame streaming_df with the given streaming data:

Which operation is supported with streamingdf ?

Options:

streaming_df. select (countDistinct ("Name") )

streaming_df.groupby("Id") .count ()

streaming_df.orderBy("timestamp").limit(4)

streaming_df.filter (col("count") < 30).show()

Discussion 0

Questions 11

What is the benefit of using Pandas on Spark for data transformations?

Options:

It is available only with Python, thereby reducing the learning curve.

It computes results immediately using eager execution, making it simple to use.

It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost-efficient.

It executes queries faster using all the available cores in the cluster as well as provides Pandas’s rich set of features.

Discussion 0

Questions 12

33 of 55.

The data engineering team created a pipeline that extracts data from a transaction system.

The transaction system stores timestamps in UTC, and the data engineers must now transform the transaction_datetime field to the “America/New_York” timezone for reporting.

Which code should be used to convert the timestamp to the target timezone?

Options:

raw.withColumn("transaction_datetime", from_utc_timestamp(col("transaction_datetime"), "America/New_York"))

raw.withColumn("transaction_datetime", to_utc_timestamp(col("transaction_datetime"), "America/New_York"))

raw.withColumn("transaction_datetime", date_format(col("transaction_datetime"), "America/New_York"))

raw.withColumn("transaction_datetime", convert_timezone(col("transaction_datetime"), "America/New_York"))

Discussion 0

Questions 13

49 of 55.

In the code block below, aggDF contains aggregations on a streaming DataFrame:

aggDF.writeStream \

.format("console") \

.outputMode("???") \

.start()

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

AGGREGATE

COMPLETE

REPLACE

APPEND

Discussion 0

Questions 14

55 of 55.

An application architect has been investigating Spark Connect as a way to modernize existing Spark applications running in their organization.

Which requirement blocks the adoption of Spark Connect in this organization?

Options:

Debuggability: the ability to perform interactive debugging directly from the application code

Upgradability: the ability to upgrade the Spark applications independently from the Spark driver itself

Complete Spark API support: the ability to migrate all existing code to Spark Connect without modification, including the RDD APIs

Stability: isolation of application code and dependencies from each other and the Spark driver

Discussion 0

Questions 15

43 of 55.

An organization has been running a Spark application in production and is considering disabling the Spark History Server to reduce resource usage.

What will be the impact of disabling the Spark History Server in production?

Options:

Prevention of driver log accumulation during long-running jobs

Improved job execution speed due to reduced logging overhead

Loss of access to past job logs and reduced debugging capability for completed jobs

Enhanced executor performance due to reduced log size

Discussion 0

Questions 16

41 of 55.

A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.

The DataFrame has columns:

id | Name | count | timestamp

---------------------------------

1 | USA | 10

2 | India | 20

3 | England | 50

4 | India | 50

5 | France | 20

6 | India | 10

7 | USA | 30

8 | USA | 40

Which code fragment should the engineer use to sort the data in the Name and count columns?

Options:

df1.orderBy(col("count").desc(), col("Name").asc())

df1.sort("Name", "count")

df1.orderBy("Name", "count")

df1.orderBy(col("Name").desc(), col("count").asc())

Discussion 0

Questions 17

A data scientist is working with a Spark DataFrame called customerDF that contains customer information. The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

Options:

customerDF.select(

col("email").substr(0, 5).alias("username"),

col("email").substr(-5).alias("domain")

)

customerDF.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

customerDF.withColumn("username", substring_index(col("email"), "@", 1)) \

.withColumn("domain", substring_index(col("email"), "@", -1))

customerDF.select(

regexp_replace(col("email"), "@", "").alias("username"),

regexp_replace(col("email"), "@", "").alias("domain")

)

Discussion 0

Questions 18

25 of 55.

A Data Analyst is working on employees_df and needs to add a new column where a 10% tax is calculated on the salary.

Additionally, the DataFrame contains the column age, which is not needed.

Which code fragment adds the tax column and removes the age column?

Options:

employees_df = employees_df.withColumn("tax", col("salary") * 0.1).drop("age")

employees_df = employees_df.withColumn("tax", lit(0.1)).drop("age")

employees_df = employees_df.dropField("age").withColumn("tax", col("salary") * 0.1)

employees_df = employees_df.withColumn("tax", col("salary") + 0.1).drop("age")

Discussion 0

Questions 19

What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

Options:

The conversion will automatically distribute the data across worker nodes

The operation will fail if the Pandas DataFrame exceeds 1000 rows

Data will be lost during conversion

The operation will load all data into the driver's memory, potentially causing memory overflow

Discussion 0

Questions 20

Given the code:

df = spark.read.csv("large_dataset.csv")

filtered_df = df.filter(col("error_column").contains("error"))

mapped_df = filtered_df.select(split(col("timestamp"), " ").getItem(0).alias("date"), lit(1).alias("count"))

reduced_df = mapped_df.groupBy("date").sum("count")

reduced_df.count()

reduced_df.show()

At which point will Spark actually begin processing the data?

Options:

When the filter transformation is applied

When the count action is applied

When the groupBy transformation is applied

When the show action is applied

Discussion 0

Questions 21

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

Options:

The Spark engine requires manual intervention to start executing transformations.

Only actions trigger the execution of the transformation pipeline.

Transformations are executed immediately to build the lineage graph.

The Spark engine optimizes the execution plan during the transformations, causing delays.

Transformations are evaluated lazily.

Discussion 0

Questions 22

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

Replace .bucketBy() with .partitionBy("event_year", "event_month")

Change the bucket count (42) to a lower number

Add .sortBy() after .bucketBy()

Replace .bucketBy() with .partitionBy("event_year") only

Discussion 0

Questions 23

16 of 55.

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior? (Choose 2 answers)

Options:

Transformations are executed immediately to build the lineage graph.

The Spark engine optimizes the execution plan during the transformations, causing delays.

Transformations are evaluated lazily.

The Spark engine requires manual intervention to start executing transformations.

Only actions trigger the execution of the transformation pipeline.

Discussion 0

Questions 24

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

regions = dict(

regions_df

.select('region', 'region_id')

.sort('region_id')

.take(3)

)

regions = dict(

regions_df

.select('region_id', 'region')

.sort('region_id')

.take(3)

)

regions = dict(

regions_df

.select('region_id', 'region')

.limit(3)

.collect()

)

regions = dict(

regions_df

.select('region', 'region_id')

.sort(desc('region_id'))

.take(3)

)

Discussion 0

Questions 25

45 of 55.

Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?

Options:

It is primarily used for data ingestion into Spark from external sources.

It provides a way to run Spark applications remotely in any programming language.

It can be used to interact with any remote cluster using the REST API.

It allows for remote execution of Spark jobs.

Discussion 0

Questions 26

A data engineer is running a batch processing job on a Spark cluster with the following configuration:

10 worker nodes

16 CPU cores per worker node

64 GB RAM per node

The data engineer wants to allocate four executors per node, each executor using four cores.

What is the total number of CPU cores used by the application?

Options:

160

Discussion 0

Questions 27

30 of 55.

A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

Options:

spark.udf.register("cube_func", cube_func)

num_df.selectExpr("cube_func(num)").show()

num_df.select(cube_func("num")).show()

spark.createDataFrame(cube_func("num")).show()

num_df.register("cube_func").select("num").show()

Discussion 0

Questions 28

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Options:

Optimize the data processing logic by repartitioning the DataFrame.

Modify the Spark configuration to disable garbage collection

Increase the memory allocated to the Spark Driver.

Cache large DataFrames to persist them in memory.

Discussion 0

Questions 29

A developer wants to refactor some older Spark code to leverage built-in functions introduced in Spark 3.5.0. The existing code performs array manipulations manually. Which of the following code snippets utilizes new built-in functions in Spark 3.5.0 for array operations?

Options:

result_df = prices_df \

.withColumn("valid_price", F.when(F.col("spot_price") > F.lit(min_price), 1).otherwise(0))

result_df = prices_df \

.agg(F.count_if(F.col("spot_price") >= F.lit(min_price)))

result_df = prices_df \

.agg(F.min("spot_price"), F.max("spot_price"))

result_df = prices_df \

.agg(F.count("spot_price").alias("spot_price")) \

.filter(F.col("spot_price") > F.lit("min_price"))

Discussion 0

Questions 30

A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes. What will be the outcome?

Options:

The job may fail if the memory on each executor is not large enough to accommodate the DataFrame being broadcasted

The job may fail if the executors do not have enough CPU cores to process the broadcasted dataset

The job will hang indefinitely as Spark will struggle to distribute and serialize such a large broadcast variable to all executors

The job may fail because the driver does not have enough CPU cores to serialize the large DataFrame

Discussion 0

Questions 31

44 of 55.

A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.

They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

Options:

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

query = df.writeStream \

.outputMode("append") \

.start()

Discussion 0

Questions 32

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

Options:

By configuring the option checkpointLocation during readStream

By configuring the option recoveryLocation during the SparkSession initialization

By configuring the option recoveryLocation during writeStream

By configuring the option checkpointLocation during writeStream

Discussion 0

Questions 33

An engineer has two DataFrames: df1 (small) and df2 (large). A broadcast join is used:

python

CopyEdit

from pyspark.sql.functions import broadcast

result = df2.join(broadcast(df1), on='id', how='inner')

What is the purpose of using broadcast() in this scenario?

Options:

It filters the id values before performing the join.

It increases the partition size for df1 and df2.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

It ensures that the join happens only when the id values are identical.

Discussion 0

Questions 34

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Options:

Execute their pyspark shell with the option --remote "https://localhost "

Execute their pyspark shell with the option --remote "sc://localhost"

Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell

Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code

Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Discussion 0

Questions 35

Given this code:

.withWatermark("event_time", "10 minutes")

.groupBy(window("event_time", "15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Discussion 0

Questions 36

A data analyst wants to add a column date derived from a timestamp column.

Options:

dates_df.withColumn("date", f.unix_timestamp("timestamp")).show()

dates_df.withColumn("date", f.to_date("timestamp")).show()

dates_df.withColumn("date", f.date_format("timestamp", "yyyy-MM-dd")).show()

dates_df.withColumn("date", f.from_unixtime("timestamp")).show()

Discussion 0

Questions 37

2 of 55. Which command overwrites an existing JSON file when writing a DataFrame?

Options:

df.write.json("path/to/file")

df.write.mode("append").json("path/to/file")

df.write.option("overwrite").json("path/to/file")

df.write.mode("overwrite").json("path/to/file")

Discussion 0

Questions 38

37 of 55.

A data scientist is working with a Spark DataFrame called customerDF that contains customer information.

The DataFrame has a column named email with customer email addresses.

The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

Options:

customerDF = customerDF \

.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

customerDF = customerDF.withColumn("username", regexp_replace(col("email"), "@", ""))

customerDF = customerDF.select("email").alias("username", "domain")

customerDF = customerDF.withColumn("domain", col("email").split("@")[1])

Discussion 0

Questions 39

9 of 55.

Given the code fragment:

import pyspark.pandas as ps

pdf = ps.DataFrame(data)

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

Options:

pdf.to_pandas()

pdf.to_spark()

pdf.to_dataframe()

pdf.spark()

Discussion 0

Questions 40

48 of 55.

A data engineer needs to join multiple DataFrames and has written the following code:

from pyspark.sql.functions import broadcast

data1 = [(1, "A"), (2, "B")]

data2 = [(1, "X"), (2, "Y")]

data3 = [(1, "M"), (2, "N")]

df1 = spark.createDataFrame(data1, ["id", "val1"])

df2 = spark.createDataFrame(data2, ["id", "val2"])

df3 = spark.createDataFrame(data3, ["id", "val3"])

df_joined = df1.join(broadcast(df2), "id", "inner") \

.join(broadcast(df3), "id", "inner")

What will be the output of this code?

Options:

The code will work correctly and perform two broadcast joins simultaneously to join df1 with df2, and then the result with df3.

The code will fail because only one broadcast join can be performed at a time.

The code will fail because the second join condition (df2.id == df3.id) is incorrect.

The code will result in an error because broadcast() must be called before the joins, not inline.

Discussion 0

Big Black Friday Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exams65

examsbrite logo

Navigation:

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Questions with Experts Answers Updated Recently

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Question and Answers

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Testing Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Testing Engine

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Answer:

Explanation:

Options:

Options:

Answer:

Explanation:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Answer:

Explanation:

Options:

Options:

Options:

Options:

Options:

Options:

Quick Links

Recently New Released Certification Exams

Site Secure