Pre-Summer Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exams65

ExamsBrite Dumps

NVIDIA AI Infrastructure Question and Answers

NVIDIA AI Infrastructure

Last Update Apr 15, 2026
Total Questions : 71

We are offering FREE NCP-AII NVIDIA exam questions. All you do is to just go and sign up. Give your details, prepare NCP-AII free exam questions and then go for complete pool of NVIDIA AI Infrastructure test questions that will help you more.

NCP-AII pdf

NCP-AII PDF

$36.75  $104.99
NCP-AII Engine

NCP-AII Testing Engine

$43.75  $124.99
NCP-AII PDF + Engine

NCP-AII PDF + Testing Engine

$57.75  $164.99
Questions 1

An engineer needs to verify the current firmware versions of all components (ATF, BSP, NIC, UEFI) on a BlueField-3 DPU's BMC. Which Redfish API command provides this information?

Options:

A.  

mlxconfig -d q

B.  

curl -k -u root: -X GET https:// /redfish/v1/UpdateService/FirmwareList

C.  

mstflint -d query full

D.  

curl -k -u root: -X GET https:// /redfish/v1/UpdateService/FirmwareInventory

Discussion 0
Questions 2

After NCCL burn-in reports "transport retry count exceeded," which corrective action addresses the underlying fabric issue?

Options:

A.  

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.  

Reduce message size to decrease network utilization

C.  

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.  

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

Discussion 0
Questions 3

A system administrator needs to install a container toolkit and successfully run the following commands:

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime docker

What step should be taken next to finish the installation?

Options:

A.  

dpkg -i doca-host-repo-ubuntu_amd64.deb

B.  

apt-get install cuda-drivers

C.  

systemctl restart docker

D.  

apt-get remove nvidia-container-toolkit

Discussion 0
Questions 4

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

Options:

A.  

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

B.  

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

C.  

Power drain then restart the DGX and check if the performance degradation resolves.

D.  

Increase the fan speed to maximum and check whether the performance improves.

Discussion 0
Questions 5

You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?

Options:

A.  

Reboot the host system to apply the repository changes and proceed.

B.  

Install the nvidia-container-toolkit package using your package manager.

C.  

Format the disk to clear any existing NVIDIA-related dependencies first.

D.  

Download the CUDA toolkit installer from NVIDIA'S official website.

Discussion 0
Questions 6

A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?

Options:

A.  

Multi-protocol data access with low latency.

B.  

High capacity with moderate speed.

C.  

Tape backup systems.

D.  

Low-cost HDD solutions.

Discussion 0
Questions 7

An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?

Options:

A.  

cmsh status to check HA status and active/standby roles.

B.  

nvsm show health to validate GPU status on both head nodes.

C.  

systemctl restart cmdaemon to force a failover test.

D.  

ping to test basic connectivity.

Discussion 0
Questions 8

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

Options:

A.  

Reduction of problem size (N) to accelerate computation.

B.  

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.

C.  

Doubling of GPU clock speeds through firmware updates and relevant configuration.

D.  

Automatic NVLink bandwidth doubling via driver updates.

Discussion 0
Questions 9

A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)

Options:

A.  

Helm is installed on the installer machine.

B.  

Ensure Kubernetes is running on the cluster.

C.  

All cluster nodes have NVIDIA GPUs installed.

D.  

NTP is disabled to simplify time synchronization.

Discussion 0
Questions 10

You are evaluating the integration of NVIDIA BlueField DPUs into your data center's storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?

Options:

A.  

Unlimited scalability by adding more DPUs without architectural changes.

B.  

Elimination of latency issues in data processing tasks.

C.  

Reduced CPU load by offloading data processing tasks to DPUs.

D.  

Enhanced I/O performance with NVMe storage access speeds.

Discussion 0
Questions 11

ClusterKit's NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.  

Optimal performance, indicating healthy fabric and GPUDirect RDM

A.  

B.  

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.  

Critical failure; expected is >390 GB/s for HDR InfiniBand.

D.  

Inconclusive; rerun with --stress=cpu to validate.

Discussion 0
Questions 12

A system engineer needs to set the vGPU scheduling behavior for all GPUs to share the scheduling equally with the default time slice length. What command should be used?

Options:

A.  

esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=RmPVMRL=0x01"

B.  

esxcli graphics module parameters set -m nvidia -p "NVreg_RegistryDwords=RmPVMRL=0x01"

C.  

esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=FRL=0x01"

D.  

esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=RmPVMRL=0x00"

Discussion 0
Questions 13

A 24-hour HPL burn-in fails with "illegal value" errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?

Options:

A.  

Switch from FP64 to FP32 precision.

B.  

Disable GPU affinity.

C.  

Reduce test duration to 12 hours.

D.  

Verify the matrix size is divisible by block size.

Discussion 0
Questions 14

A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?

Options:

A.  

A single VLAN for all types of network traffic.

B.  

Two networks: one for management and one for compute.

C.  

Four networks: compute, storage, out-of-band, and management.

Discussion 0
Questions 15

An engineer needs to validate NVLink Switch functionality on a DGX H100 system with 8 GPUs. Which NCCL command verifies intra-node NVLink bandwidth?

Options:

A.  

broadcast_perf -b 8 -e 16G -f 2 -g 8 without split configuration

B.  

all_reduce_perf -b 8 -e 16G -f 2 -g 4 with NCCL_TESTS_SPLIT="MOD 2"

C.  

all_reduce_perf -b 8 -e 16G -f 2 -g 1 repeated 8 times

D.  

all_reduce_perf -b 8 -e 16G -f 2 -g 8 with NCCL_TESTS_SPLIT="OR 0x7"

Discussion 0
Questions 16

After initial setup and health checks, the DGX H100 system administrator wants to verify that containers can access GPUs before running production workloads. Which method is recommended for this validation?

Options:

A.  

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 systemctl

B.  

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 ls -la

C.  

sudo docker run --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

D.  

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

Discussion 0
Questions 17

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

Options:

A.  

The command output is ignored if the system powers on without errors.

B.  

At least half of the GPUs report Status_Health = OK.

C.  

All GPUs report Status_Health = OK and Health = OK for each device.

D.  

Only the head node's GPUs need to be healthy.

Discussion 0
Questions 18

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

Options:

A.  

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.  

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.  

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.  

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Discussion 0
Questions 19

A media company is developing an AI platform for video content analysis that requires storing and processing large volumes of unstructured video data. The platform must support high throughput for data ingestion and provide efficient access for real-time analytics. Given these requirements, which storage strategy should the company implement?

Options:

A.  

Tape storage for its cost-effectiveness and archival capabilities

B.  

Block storage for low latency and high performance

C.  

File storage for hierarchical organization and easy navigation

D.  

Object storage for scalability and metadata management

Discussion 0
Questions 20

A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?

Options:

A.  

ngc config get

B.  

ngc init

C.  

ngc config set

D.  

ngc config update

Discussion 0
Questions 21

A systems engineer is updating firmware across a large DGX cluster using automation. What is the best practice for minimizing risk and ensuring cluster health during and after the process?

Options:

A.  

Drain nodes from the scheduler, run pre-update diagnostics, update firmware in batches, and verify health post-update before scaling to the next batch.

B.  

To save time, simultaneously update all nodes in the cluster without draining or diagnostics.

C.  

Update nodes that have reported faults, leaving others on older firmware.

D.  

Drain nodes from the scheduler, update firmware in batches, skip diagnostics and verify health post-update before scaling to the next batch.

Discussion 0