Month End Special 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exams65

ExamsBrite Dumps

Google Professional Data Engineer Exam Question and Answers

Google Professional Data Engineer Exam

Last Update Jan 25, 2025
Total Questions : 372

We are offering FREE Professional-Data-Engineer Google exam questions. All you do is to just go and sign up. Give your details, prepare Professional-Data-Engineer free exam questions and then go for complete pool of Google Professional Data Engineer Exam test questions that will help you more.

Professional-Data-Engineer pdf

Professional-Data-Engineer PDF

$36.75  $104.99
Professional-Data-Engineer Engine

Professional-Data-Engineer Testing Engine

$43.75  $124.99
Professional-Data-Engineer PDF + Engine

Professional-Data-Engineer PDF + Testing Engine

$57.75  $164.99
Questions 1

What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?

Options:

A.  

create a third instance and sync the data from the two storage types via batch jobs

B.  

export the data from the existing instance and import the data into a new instance

C.  

run parallel instances where one is HDD and the other is SDD

D.  

the selection is final and you must resume using the same storage type

Discussion 0
Questions 2

Which Google Cloud Platform service is an alternative to Hadoop with Hive?

Options:

A.  

Cloud Dataflow

B.  

Cloud Bigtable

C.  

BigQuery

D.  

Cloud Datastore

Discussion 0
Questions 3

Which of the following is NOT one of the three main types of triggers that Dataflow supports?

Options:

A.  

Trigger based on element size in bytes

B.  

Trigger that is a combination of other triggers

C.  

Trigger based on element count

D.  

Trigger based on time

Discussion 0
Questions 4

What are two of the benefits of using denormalized data structures in BigQuery?

Options:

A.  

Reduces the amount of data processed, reduces the amount of storage required

B.  

Increases query speed, makes queries simpler

C.  

Reduces the amount of storage required, increases query speed

D.  

Reduces the amount of data processed, increases query speed

Discussion 0
Questions 5

Which of these numbers are adjusted by a neural network as it learns from a training dataset (select 2 answers)?

Options:

A.  

Weights

B.  

Biases

C.  

Continuous features

D.  

Input values

Discussion 0
Questions 6

Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day’s events. They also want to use streaming ingestion. What should you do?

Options:

A.  

Create a table called tracking_table and include a DATE column.

B.  

Create a partitioned table called tracking_table and include a TIMESTAMP column.

C.  

Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.

D.  

Create a table called tracking_table with a TIMESTAMP column to represent the day.

Discussion 0
Questions 7

MJTelco’s Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

Options:

A.  

The zone

B.  

The number of workers

C.  

The disk size per worker

D.  

The maximum number of workers

Discussion 0
Questions 8

You need to compose visualizations for operations teams with the following requirements:

Which approach meets the requirements?

Options:

A.  

Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show only suboptimal links in a table.

B.  

Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates the metric, and shows only suboptimal rows in a table in Google Sheets.

C.  

Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries all rows, applies a function to derive the metric, and then renders results in a table using the Google charts and visualization API.

D.  

Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a table.

Discussion 0
Questions 9

You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.

Which two actions should you take? (Choose two.)

Options:

A.  

Ensure all the tables are included in global dataset.

B.  

Ensure each table is included in a dataset for a region.

C.  

Adjust the settings for each table to allow a related region-based security group view access.

D.  

Adjust the settings for each view to allow a related region-based security group view access.

E.  

Adjust the settings for each dataset to allow a related region-based security group view access.

Discussion 0
Questions 10

MJTelco is building a custom interface to share data. They have these requirements:

    They need to do aggregations over their petabyte-scale datasets.

    They need to scan specific time range rows with a very fast response time (milliseconds).

Which combination of Google Cloud Platform products should you recommend?

Options:

A.  

Cloud Datastore and Cloud Bigtable

B.  

Cloud Bigtable and Cloud SQL

C.  

BigQuery and Cloud Bigtable

D.  

BigQuery and Cloud Storage

Discussion 0
Questions 11

You need to compose visualization for operations teams with the following requirements:

    Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)

    The report must not be more than 3 hours delayed from live data.

    The actionable report should only show suboptimal links.

    Most suboptimal links should be sorted to the top.

    Suboptimal links can be grouped and filtered by regional geography.

    User response time to load the report must be <5 seconds.

You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?

Options:

A.  

Look through the current data and compose a series of charts and tables, one for each possible

combination of criteria.

B.  

Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.

C.  

Export the data to a spreadsheet, compose a series of charts and tables, one for each possible

combination of criteria, and spread them across multiple tabs.

D.  

Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.

Discussion 0
Questions 12

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

Options:

A.  

Rewrite the job in Pig.

B.  

Rewrite the job in Apache Spark.

C.  

Increase the size of the Hadoop cluster.

D.  

Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Discussion 0
Questions 13

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.

You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

Options:

A.  

Redis

B.  

HBase

C.  

MySQL

D.  

MongoDB

E.  

Cassandra

F.  

HDFS with Hive

Discussion 0
Questions 14

Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?

Options:

A.  

The CSV data loaded in BigQuery is not flagged as CSV.

B.  

The CSV data has invalid rows that were skipped on import.

C.  

The CSV data loaded in BigQuery is not using BigQuery’s default encoding.

D.  

The CSV data has not gone through an ETL phase before loading into BigQuery.

Discussion 0
Questions 15

You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

Options:

A.  

Change the processing job to use Google Cloud Dataproc instead.

B.  

Manually start the Cloud Dataflow job each morning when you get into the office.

C.  

Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.

D.  

Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Discussion 0
Questions 16

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:

    The user profile: What the user likes and doesn’t like to eat

    The user account information: Name, address, preferred meal times

    The order information: When orders are made, from where, to whom

The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

Options:

A.  

BigQuery

B.  

Cloud SQL

C.  

Cloud Bigtable

D.  

Cloud Datastore

Discussion 0
Questions 17

Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.

You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)

Options:

A.  

Introduce data compression for each file to increase the rate file of file transfer.

B.  

Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.

C.  

Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.

D.  

Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.

E.  

Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.

Discussion 0
Questions 18

You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity ‘Movie’ the property ‘actors’ and the property ‘tags’ have multiple values but the property ‘date released’ does not. A typical query would ask for all movies with actor= ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

Options:

A.  

Option A

B.  

Option

B.  

C.  

Option C

D.  

Option D

Discussion 0
Questions 19

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How can you make that data available while minimizing cost?

Options:

A.  

Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.

B.  

Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.

C.  

Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.

D.  

Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.

Discussion 0
Questions 20

You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?

Options:

A.  

Load the data every 30 minutes into a new partitioned table in BigQuery.

B.  

Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery

C.  

Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore

D.  

Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Discussion 0
Questions 21

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

Options:

A.  

Store the common data in BigQuery as partitioned tables.

B.  

Store the common data in BigQuery and expose authorized views.

C.  

Store the common data encoded as Avro in Google Cloud Storage.

D.  

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Discussion 0
Questions 22

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

Options:

A.  

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

B.  

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

C.  

Use the NOW () function in BigQuery to record the event’s time.

D.  

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Discussion 0
Questions 23

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all the data in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

Options:

A.  

Export the data into a Google Sheet for virtualization.

B.  

Create an additional table with only the necessary columns.

C.  

Create a view on the table to present to the virtualization tool.

D.  

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Discussion 0
Questions 24

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

Options:

A.  

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

B.  

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

C.  

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

D.  

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Discussion 0
Questions 25

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?

Options:

A.  

Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.

B.  

Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.

C.  

Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.

D.  

Create a Cloud Dataflow job that reads the data in frequent time intervals, and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.

Discussion 0
Questions 26

As your organization expands its usage of GCP, many teams have started to create their own projects. Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects. Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies. Which two steps should you take? Choose 2 answers.

Options:

A.  

Use Cloud Deployment Manager to automate access provision.

B.  

Introduce resource hierarchy to leverage access control policy inheritance.

C.  

Create distinct groups for various teams, and specify groups in Cloud IAM policies.

D.  

Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.

E.  

For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.

Discussion 0
Questions 27

You are planning to load some of your existing on-premises data into BigQuery on Google Cloud. You want to either stream or batch-load data, depending on your use case. Additionally, you want to mask some sensitive data before loading into BigQuery. You need to do this in a programmatic way while keeping costs to a minimum. What should you do?

Options:

A.  

Use the BigQuery Data Transfer Service to schedule your migration. After the data is populated in BigQuery. use the connection to the Cloud Data Loss Prevention {Cloud DLP} API to de-identify the necessary data.

B.  

Create your pipeline with Dataflow through the Apache Beam SDK for Python, customizing separate options within your code for streaming.

batch processing, and Cloud DLP Select BigQuery as your data sink.

C.  

Use Cloud Data Fusion to design your pipeline, use the Cloud DLP plug-in to de-identify data within your pipeline, and then move the data

into BigQuery.

D.  

Set up Datastream to replicate your on-premise data on BigQuery.

Discussion 0
Questions 28

You work for a large ecommerce company. You store your customers order data in Bigtable. You have a garbage collection policy set to delete the data after 30 days and the number of versions is set to 1. When the data analysts run a query to report total customer spending, the analysts sometimes see customer data that is older than 30 days. You need to ensure that the analysts do not see customer data older than 30 days while minimizing cost and overhead. What should you do?

Options:

A.  

Set the expiring values of the column families to 30 days and set the number of versions to 2.

B.  

Use a timestamp range filter in the query to fetch the customer's data for a specific range.

C.  

Set the expiring values of the column families to 29 days and keep the number of versions to 1.

D.  

Schedule a job daily to scan the data in the table and delete data older than 30 days.

Discussion 0
Questions 29

The CUSTOM tier for Cloud Machine Learning Engine allows you to specify the number of which types of cluster nodes?

Options:

A.  

Workers

B.  

Masters, workers, and parameter servers

C.  

Workers and parameter servers

D.  

Parameter servers

Discussion 0
Questions 30

Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?

Options:

A.  

Dataproc Worker

B.  

Dataproc Viewer

C.  

Dataproc Runner

D.  

Dataproc Editor

Discussion 0
Questions 31

Which of these are examples of a value in a sparse vector? (Select 2 answers.)

Options:

A.  

[0, 5, 0, 0, 0, 0]

B.  

[0, 0, 0, 1, 0, 0, 1]

C.  

[0, 1]

D.  

[1, 0, 0, 0, 0, 0, 0]

Discussion 0
Questions 32

Cloud Dataproc charges you only for what you really use with _____ billing.

Options:

A.  

month-by-month

B.  

minute-by-minute

C.  

week-by-week

D.  

hour-by-hour

Discussion 0
Questions 33

You are building a model to make clothing recommendations. You know a user’s fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

Options:

A.  

Continuously retrain the model on just the new data.

B.  

Continuously retrain the model on a combination of existing data and the new data.

C.  

Train on the existing data while using the new data as your test set.

D.  

Train on the new data while using the existing data as your test set.

Discussion 0
Questions 34

To give a user read permission for only the first three columns of a table, which access control method would you use?

Options:

A.  

Primitive role

B.  

Predefined role

C.  

Authorized view

D.  

It's not possible to give access to only the first three columns of a table.

Discussion 0
Questions 35

Your company built a TensorFlow neural-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

Options:

A.  

Threading

B.  

Serialization

C.  

Dropout Methods

D.  

Dimensionality Reduction

Discussion 0
Questions 36

You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine. Which learning algorithm should you use?

Options:

A.  

Linear regression

B.  

Logistic classification

C.  

Recurrent neural network

D.  

Feedforward neural network

Discussion 0
Questions 37

Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:

# Syntax error : Expected end of statement but got “-“ at [4:11]

SELECT age

FROM

bigquery-public-data.noaa_gsod.gsod

WHERE

age != 99

AND_TABLE_SUFFIX = ‘1929’

ORDER BY

age DESC

Which table name will make the SQL statement work correctly?

Options:

A.  

‘bigquery-public-data.noaa_gsod.gsod‘

B.  

bigquery-public-data.noaa_gsod.gsod*

C.  

‘bigquery-public-data.noaa_gsod.gsod’*

D.  

‘bigquery-public-data.noaa_gsod.gsod*`

Discussion 0
Questions 38

You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?

Options:

A.  

Include ORDER BY DESK on timestamp column and LIMIT to 1.

B.  

Use GROUP BY on the unique ID column and timestamp column and SUM on the values.

C.  

Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.

D.  

Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Discussion 0
Questions 39

Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?

Options:

A.  

Use a row key of the form .

B.  

Use a row key of the form .

C.  

Use a row key of the form #.

D.  

Use a row key of the form >##.

Discussion 0
Questions 40

You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?

Options:

A.  

cron

B.  

Cloud Composer

C.  

Cloud Scheduler

D.  

Workflow Templates on Cloud Dataproc

Discussion 0
Questions 41

You are migrating a table to BigQuery and are deeding on the data model. Your table stores information related to purchases made across several store locations and includes information like the time of the transaction, items purchased, the store ID and the city and state in which the store is located You frequently query this table to see how many of each item were sold over the past 30 days and to look at purchasing trends by state city and individual store. You want to model this table to minimize query time and cost. What should you do?

Options:

A.  

Partition by transaction time; cluster by state first, then city then store ID

B.  

Partition by transaction tome cluster by store ID first, then city, then stale

C.  

Top-level cluster by stale first, then city then store

D.  

Top-level cluster by store ID first, then city then state.

Discussion 0
Questions 42

You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?

Options:

A.  

Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory

B.  

Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS

C.  

Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up

D.  

Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage

Discussion 0
Questions 43

You are running a Dataflow streaming pipeline, with Streaming Engine and Horizontal Autoscaling enabled. You have set the maximum number of workers to 1000. The input of your pipeline is Pub/Sub messages with notifications from Cloud Storage One of the pipeline transforms reads CSV files and emits an element for every CSV line. The Job performance is low. the pipeline is using only 10 workers, and you notice that the autoscaler is not spinning up additional workers. What should you do to improve performance?

Options:

A.  

Use Dataflow Prime, and enable Right Fitting to increase the worker resources.

B.  

Update the job to increase the maximum number of workers.

C.  

Enable Vertical Autoscaling to let the pipeline use larger workers.

D.  

Change the pipeline code, and introduce a Reshuffle step to prevent fusion.

Discussion 0
Questions 44

You need to create a SQL pipeline. The pipeline runs an aggregate SOL transformation on a BigQuery table every two hours and appends the result to another existing BigQuery table. You need to configure the pipeline to retry if errors occur. You want the pipeline to send an email notification after three consecutive failures. What should you do?

Options:

A.  

Create a BigQuery scheduled query to run the SOL transformation with schedule options that repeats every two hours, and enable email

notifications.

B.  

Use the BigQueryUpsertTableOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to

true.

C.  

Use the BigQuerylnsertJobOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to

true.

D.  

Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable

notification to Pub/Sub topic. Use Pub/Sub and Cloud Functions to send an email after three tailed executions.

Discussion 0
Questions 45

Different teams in your organization store customer and performance data in BigOuery. Each team needs to keep full control of their collected data, be able to query data within their projects, and be able to exchange their data with other teams. You need to implement an organization-wide solution, while minimizing operational tasks and costs. What should you do?

Options:

A.  

Create a BigQuery scheduled query to replicate all customer data into team projects.

B.  

Enable each team to create materialized views of the data they need to access in their projects.

C.  

Ask each team to publish their data in Analytics Hub. Direct the other teams to subscribe to them.

D.  

Ask each team to create authorized views of their data. Grant the biquery. jobUser role to each team.

Discussion 0
Questions 46

You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company's mobile app You have reviewed old chat logs and lagged each conversation for intent based on each customer's stated intention for contacting customer service About 70% of customer requests are simple requests that are solved within 10 intents The remaining 30% of inquiries require much longer, more complicated requests Which intents should you automate first?

Options:

A.  

Automate the 10 intents that cover 70% of the requests so that live agents can handle more complicated requests

B.  

Automate the more complicated requests first because those require more of the agents' time

C.  

Automate a blend of the shortest and longest intents to be representative of all intents

D.  

Automate intents in places where common words such as "payment" appear only once so the software isn't confused

Discussion 0
Questions 47

You are loading CSV files from Cloud Storage to BigQuery. The files have known data quality issues, including mismatched data types, such as STRINGS and INT64s in the same column, and inconsistent formatting of values such as phone numbers or addresses. You need to create the data pipeline to maintain data quality and perform the required cleansing and transformation. What should you do?

Options:

A.  

Use Data Fusion to transform the data before loading it into BigQuery.

B.  

Load the CSV files into a staging table with the desired schema, perform the transformations with SQL. and then write the results to the final destination table.

C.  

Create a table with the desired schema, toad the CSV files into the table, and perform the transformations in place using SQL.

D.  

Use Data Fusion to convert the CSV files lo a self-describing data formal, such as AVRO. before loading the data to BigOuery.

Discussion 0
Questions 48

You have a data processing application that runs on Google Kubernetes Engine (GKE). Containers need to be launched with their latest available configurations from a container registry. Your GKE nodes need to have GPUs. local SSDs, and 8 Gbps bandwidth. You want to efficiently provision the data processing infrastructure and manage the deployment process. What should you do?

Options:

A.  

Use Compute Engi.no startup scriots to pull container Images, and use gloud commands to provision the infrastructure.

B.  

Use GKE to autoscale containers, and use gloud commands to provision the infrastructure.

C.  

Use Cloud Build to schedule a job using Terraform build to provision the infrastructure and launch with the most current container images.

D.  

Use Dataflow to provision the data pipeline, and use Cloud Scheduler to run the job.

Discussion 0
Questions 49

You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally. You also want to optimize data for range queries on nonkey columns. What should you do?

Options:

A.  

Use Cloud SQL for storage. Add secondary indexes to support query patterns.

B.  

Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.

C.  

Use Cloud Spanner for storage. Add secondary indexes to support query patterns.

D.  

Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.

Discussion 0
Questions 50

You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard functions to generate daily and monthly reports for all time ranges. Recently, you discovered that some queries that cover long date ranges are exceeding the limit of 1,000 tables and failing. How can you resolve this issue?

Options:

A.  

Convert all daily log tables into date-partitioned tables

B.  

Convert the sharded tables into a single partitioned table

C.  

Enable query caching so you can cache data from previous months

D.  

Create separate views to cover each month, and query from these views

Discussion 0
Questions 51

You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?

Options:

A.  

Add a SideInput that returns a Boolean if the element is corrupt.

B.  

Add a ParDo transform in Cloud Dataflow to discard corrupt elements.

C.  

Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.

D.  

Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.

Discussion 0
Questions 52

Your company’s customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations. The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?

Options:

A.  

Add a node to the MySQL cluster and build an OLAP cube there.

B.  

Use an ETL tool to load the data from MySQL into Google BigQuery.

C.  

Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.

D.  

Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.

Discussion 0
Questions 53

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

Options:

A.  

Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.

B.  

Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.

C.  

Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.

D.  

Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.

E.  

Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Discussion 0
Questions 54

Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO. During testing, you notice that some messages are missing in the dashboard. You check the logs, and all messages are being published to Cloud Pub/Sub successfully. What should you do next?

Options:

A.  

Check the dashboard application to see if it is not displaying correctly.

B.  

Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.

C.  

Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.

D.  

Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.

Discussion 0
Questions 55

Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?

Options:

A.  

Redefine the schema by evenly distributing reads and writes across the row space of the table.

B.  

The performance issue should be resolved over time as the site of the BigDate cluster is increased.

C.  

Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.

D.  

Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.

Discussion 0
Questions 56

Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?

Options:

A.  

Issue a command to restart the database servers.

B.  

Retry the query with exponential backoff, up to a cap of 15 minutes.

C.  

Retry the query every second until it comes back online to minimize staleness of data.

D.  

Reduce the query frequency to once every hour until the database comes back online.

Discussion 0
Questions 57

You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update. What should you do?

Options:

A.  

Update the current pipeline and use the drain flag.

B.  

Update the current pipeline and provide the transform mapping JSON object.

C.  

Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.

D.  

Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.

Discussion 0
Questions 58

You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?

Options:

A.  

Make a call to the Stackdriver API to list all logs, and apply an advanced filter.

B.  

In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.

C.  

In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.

D.  

Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

Discussion 0
Questions 59

Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects. What should you do?

Options:

A.  

Enable data access logs in each Data Analyst’s project. Restrict access to Stackdriver Logging via Cloud IAM roles.

B.  

Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts’ projects. Restrict access to the Cloud Storage bucket.

C.  

Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit logs. Restrict access to the project with the exported logs.

D.  

Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.

Discussion 0
Questions 60

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

Options:

A.  

1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.

2. Enable turbo replication.

3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the us-south1 region.

4. In case of a regional failure, redeploy your Dataproc duster to the us-south1 region and continue reading from the same bucket.

B.  

1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.

2. Enable turbo replication.

3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region.

4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket.

C.  

1. Create a Cloud Storage bucket in the US multi-region.

2. Run the Dataproc cluster in a zone in the ua-central1 region, reading data from the US multi-region bucket.

3. In case of a regional failure, redeploy the Dataproc cluster to the us-central2 region and continue reading from the same bucket.

D.  

1. Create two regional Cloud Storage buckets, one in the us-central1 region and one in the us-south1 region.

2. Have the upstream process write data to the us-central1 bucket. Use the Storage Transfer Service to copy data hourly from the us-central1 bucket to the us-south1 bucket.

3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in that region.

4. In case of regional failure, redeplo

Discussion 0
Questions 61

A data scientist has created a BigQuery ML model and asks you to create an ML pipeline to serve predictions. You have a REST API application with the requirement to serve predictions for an individual user ID with latency under 100 milliseconds. You use the following query to generate predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL ‘dataset.model’, table user_features). How should you create the ML pipeline?

Options:

A.  

Add a WHERE clause to the query, and grant the BigQuery Data Viewer role to the application service account.

B.  

Create an Authorized View with the provided query. Share the dataset that contains the view with the application service account.

C.  

Create a Cloud Dataflow pipeline using BigQueryIO to read results from the query. Grant the Dataflow Worker role to the application service account.

D.  

Create a Cloud Dataflow pipeline using BigQueryIO to read predictions for all users from the query. Write the results to Cloud Bigtable using BigtableIO. Grant the Bigtable Reader role to the application service account so that the application can read predictions for individual users from Cloud Bigtable.

Discussion 0
Questions 62

Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

Options:

A.  

Field promotion

B.  

Randomization

C.  

Salting

D.  

Hashing

Discussion 0
Questions 63

What are all of the BigQuery operations that Google charges for?

Options:

A.  

Storage, queries, and streaming inserts

B.  

Storage, queries, and loading data from a file

C.  

Storage, queries, and exporting data

D.  

Queries and streaming inserts

Discussion 0
Questions 64

Which of these statements about BigQuery caching is true?

Options:

A.  

By default, a query's results are not cached.

B.  

BigQuery caches query results for 48 hours.

C.  

Query results are cached even if you specify a destination table.

D.  

There is no charge for a query that retrieves its results from cache.

Discussion 0
Questions 65

How would you query specific partitions in a BigQuery table?

Options:

A.  

Use the DAY column in the WHERE clause

B.  

Use the EXTRACT(DAY) clause

C.  

Use the __PARTITIONTIME pseudo-column in the WHERE clause

D.  

Use DATE BETWEEN in the WHERE clause

Discussion 0
Questions 66

Which software libraries are supported by Cloud Machine Learning Engine?

Options:

A.  

Theano and TensorFlow

B.  

Theano and Torch

C.  

TensorFlow

D.  

TensorFlow and Torch

Discussion 0
Questions 67

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error.

SELECT person FROM `project1.example.table1` WHERE city = "London"

How would you correct the error?

Options:

A.  

Add ", UNNEST(person)" before the WHERE clause.

B.  

Change "person" to "person.city".

C.  

Change "person" to "city.person".

D.  

Add ", UNNEST(city)" before the WHERE clause.

Discussion 0
Questions 68

Which of the following is NOT true about Dataflow pipelines?

Options:

A.  

Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner

B.  

Dataflow pipelines can consume data from other Google Cloud services

C.  

Dataflow pipelines can be programmed in Java

D.  

Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources

Discussion 0
Questions 69

Which Java SDK class can you use to run your Dataflow programs locally?

Options:

A.  

LocalRunner

B.  

DirectPipelineRunner

C.  

MachineRunner

D.  

LocalPipelineRunner

Discussion 0
Questions 70

What are the minimum permissions needed for a service account used with Google Dataproc?

Options:

A.  

Execute to Google Cloud Storage; write to Google Cloud Logging

B.  

Write to Google Cloud Storage; read to Google Cloud Logging

C.  

Execute to Google Cloud Storage; execute to Google Cloud Logging

D.  

Read and write to Google Cloud Storage; write to Google Cloud Logging

Discussion 0
Questions 71

Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)

Options:

A.  

Disable writes to certain tables.

B.  

Restrict access to tables by role.

C.  

Ensure that the data is encrypted at all times.

D.  

Restrict BigQuery API access to approved users.

E.  

Segregate data across multiple tables or databases.

F.  

Use Google Stackdriver Audit Logging to determine policy violations.

Discussion 0
Questions 72

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

Options:

A.  

The message body for the sensor event is too large.

B.  

Your custom endpoint has an out-of-date SSL certificate.

C.  

The Cloud Pub/Sub topic has too many messages published to it.

D.  

Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

Discussion 0
Questions 73

Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?

Options:

A.  

Use Google Stackdriver Audit Logs to review data access.

B.  

Get the identity and access management IIAM) policy of each table

C.  

Use Stackdriver Monitoring to see the usage of BigQuery query slots.

D.  

Use the Google Cloud Billing API to see what account the warehouse is being billed to.

Discussion 0
Questions 74

You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?

Options:

A.  

Disable caching by editing the report settings.

B.  

Disable caching in BigQuery by editing table details.

C.  

Refresh your browser tab showing the visualizations.

D.  

Clear your browser history for the past hour then reload the tab showing the virtualizations.

Discussion 0
Questions 75

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

Options:

A.  

Add capacity (memory and disk space) to the database server by the order of 200.

B.  

Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

C.  

Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.

D.  

Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Discussion 0
Questions 76

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

Options:

A.  

There are very few occurrences of mutations relative to normal samples.

B.  

There are roughly equal occurrences of both normal and mutated samples in the database.

C.  

You expect future mutations to have different features from the mutated samples in the database.

D.  

You expect future mutations to have similar features to the mutated samples in the database.

E.  

You already have labels for which samples are mutated and which are normal in the database.

Discussion 0