Black Friday Special 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exams65

ExamsBrite Dumps

Databricks Certified Machine Learning Associate Exam Question and Answers

Databricks Certified Machine Learning Associate Exam

Last Update Dec 2, 2024
Total Questions : 74

We are offering FREE Databricks-Machine-Learning-Associate Databricks exam questions. All you do is to just go and sign up. Give your details, prepare Databricks-Machine-Learning-Associate free exam questions and then go for complete pool of Databricks Certified Machine Learning Associate Exam test questions that will help you more.

Databricks-Machine-Learning-Associate pdf

Databricks-Machine-Learning-Associate PDF

$36.75  $104.99
Databricks-Machine-Learning-Associate Engine

Databricks-Machine-Learning-Associate Testing Engine

$43.75  $124.99
Databricks-Machine-Learning-Associate PDF + Engine

Databricks-Machine-Learning-Associate PDF + Testing Engine

$57.75  $164.99
Questions 1

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.

Which of the following describes why?

Options:

A.  

Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

B.  

Gradient boosting requires access to all data at once which cannot happen during parallelization.

C.  

Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

D.  

Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

Discussion 0
Questions 2

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

Options:

A.  

When the features are of the categorical type

B.  

When the features are of the boolean type

C.  

When the features contain a lot of extreme outliers

D.  

When the features contain no outliers

E.  

When the features contain no missingno values

Discussion 0
Questions 3

A machine learning engineer is trying to scale a machine learning pipelinepipelinethat contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:

A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to theestimatorparameter and then placing the updated cv object as the final stage of thepipelinein place of the original model.

Which of the following is a negative consequence of the approach suggested by the colleague?

Options:

A.  

The model will take longerto train for each unique combination of hvperparameter values

B.  

The feature engineering stages will be computed using validation data

C.  

The cross-validation process will no longer be

D.  

The cross-validation process will no longer be reproducible

E.  

The model will be refit one more per cross-validation fold

Discussion 0
Questions 4

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.  

Logistic regression

B.  

Spark ML cannot distribute linear regression training

C.  

Iterative optimization

D.  

Least-squares method

E.  

Singular value decomposition

Discussion 0
Questions 5

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model in parallel. They elect to use the Hyperopt library to facilitate this process.

Which of the following Hyperopt tools provides the ability to optimize hyperparameters in parallel?

Options:

A.  

fmin

B.  

SparkTrials

C.  

quniform

D.  

search_space

E.  

objective_function

Discussion 0
Questions 6

Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?

Options:

A.  

The vectorized pandas UDFs allow for the use of type hints

B.  

The vectorized pandas UDFs process data in batches rather than one row at a time

C.  

The vectorized pandas UDFs allow for pandas API use inside of the function

D.  

The vectorized pandas UDFs work on distributed DataFrames

E.  

The vectorized pandas UDFs process data in memory rather than spilling to disk

Discussion 0
Questions 7

A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".

Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?

Options:

A.  

mlflow.register_model(run_id, "best_model")

B.  

mlflow.register_model(f"runs:/{run_id}/model”, "best_model”)

C.  

millow.register_model(f"runs:/{run_id)/model")

D.  

mlflow.register_model(f"runs:/{run_id}/best_model", "model")

Discussion 0
Questions 8

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

Options:

A.  

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

B.  

One-hot encoding is dependent on the target variable’s values which differ for each apaplication.

C.  

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D.  

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

Discussion 0
Questions 9

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model bycomparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.

Which of the following possible explanations for this difference is invalid?

Options:

A.  

The second model is much more accurate than the first model

B.  

The data scientist failed to exponentiate the predictions in the second model prior tocomputingthe RMSE

C.  

The datascientist failed to take the logof the predictions in the first model prior to computingthe RMSE

D.  

The first model is much more accurate than the second model

E.  

The RMSE is an invalid evaluation metric for regression problems

Discussion 0
Questions 10

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

Options:

A.  

Theycan turn on Databricks Autologging

B.  

Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values

C.  

Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO

D.  

They can start each child run with the same experiment ID as the parent run

E.  

They can specify nested=True when starting the parent run for the tuningprocess

Discussion 0
Questions 11

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.

They have developed this code block to accomplish this task:

The code block is returning an error.

Which of the following adjustments does the data scientist need to make to accomplish this task?

Options:

A.  

They need to specify the method parameter to the OneHotEncoder.

B.  

They need to remove the line with the fit operation.

C.  

They need to use Stringlndexer prior to one-hot encodinq the features.

D.  

They need to useVectorAssemblerprior to one-hot encoding the features.

Discussion 0
Questions 12

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

• 10.0

• 12.0

• 17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

Options:

A.  

13.0

B.  

17.0

C.  

12.0

D.  

39.0

E.  

10.0

Discussion 0
Questions 13

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

Options:

A.  

One-hot encoding is not supported by most machine learning libraries.

B.  

One-hot encoding is dependent on the target variable's values which differ for each application.

C.  

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D.  

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

E.  

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Discussion 0
Questions 14

A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.

Which of the following suggestions should the team include in their guidelines?

Options:

A.  

The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.

B.  

The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.

C.  

The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.

D.  

The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.

Discussion 0
Questions 15

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed thetrain_modelfunction, and they want to apply it to each group of DataFramedf.

They have written the following incomplete code block:

Which of the following pieces of code can be used to fill in the above blank to complete the task?

Options:

A.  

applyInPandas

B.  

mapInPandas

C.  

predict

D.  

train_model

E.  

groupedApplyIn

Discussion 0
Questions 16

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.

Which of the following code blocks will accomplish this task?

Options:

A.  

spark_df[spark_df["price"] > 0]

B.  

spark_df.filter(col("price") > 0)

C.  

SELECT * FROM spark_df WHERE price > 0

D.  

spark_df.loc[spark_df["price"] > 0,:]

E.  

spark_df.loc[:,spark_df["price"] > 0]

Discussion 0
Questions 17

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

Options:

A.  

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

B.  

The process will leak data from the training set to the test set during the evaluation phase

C.  

The process will be unable to parallelize tuning due to the distributed nature of pipeline

D.  

The process will leak data prep information from the validation sets to the training sets for each model

Discussion 0
Questions 18

Which of the following statements describes a Spark ML estimator?

Options:

A.  

An estimator is a hyperparameter arid that can be used to train a model

B.  

An estimator chains multiple alqorithms toqether to specify an ML workflow

C.  

An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions

D.  

An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer

E.  

An estimator is an evaluation tool to assess to the quality of a model

Discussion 0
Questions 19

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

A.  

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

B.  

pandas API on Spark DataFrames are more performant than Spark DataFrames

C.  

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

D.  

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

E.  

pandas API on Spark DataFrames are unrelated to Spark DataFrames

Discussion 0
Questions 20

A data scientist is using Spark ML to engineer features for an exploratory machine learning project.

They decide they want to standardize their features using the following code block:

Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.

Which of the following changes can the data scientist make to address the concern?

Options:

A.  

Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values

B.  

Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values

C.  

Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data

D.  

Utilize the Pipeline API to standardize the training data according to the test data's summary statistics

E.  

Utilize the Pipeline API to standardize the test data according to the training data's summary statistics

Discussion 0
Questions 21

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column’s median value.

They have developed the following code block to accomplish this task:

The code block is not accomplishing the task.

Which reasons describes why the code block is not accomplishing the imputation task?

Options:

A.  

It does not impute both the training and test data sets.

B.  

The inputCols and outputCols need to be exactly the same.

C.  

The fit method needs to be called instead of transform.

D.  

It does not fit the imputer on the data to create an ImputerModel.

Discussion 0
Questions 22

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

A)

B)

C)

D)

Options:

A.  

Option A

B.  

Option B

C.  

Option C

D.  

Option D

Discussion 0